[00:02:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:10:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135524 [00:10:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135524 (owner: 10TrainBranchBot) [00:10:37] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:46] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10728283 (10Eevans) [00:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:22:16] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 (10Eevans) 03NEW [00:22:24] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10728300 (10Eevans) p:05Triage→03Medium [00:24:40] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:28:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135524 (owner: 10TrainBranchBot) [00:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728306 (10phaultfinder) [00:37:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [00:53:13] PROBLEM - SSH on bast3007 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:54:13] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728319 (10phaultfinder) [01:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728347 (10phaultfinder) [01:27:24] (03PS1) 10Krinkle: mc: remove unused "memcached-pecl" definition from wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) [01:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:28:19] (03PS2) 10Krinkle: mc: remove unused "memcached-pecl" definition from wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) [01:37:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:48:51] (03CR) 10Bartosz Dziewoński: [C:03+1] mc: remove unused "memcached-pecl" definition from wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) (owner: 10Krinkle) [01:55:32] (03CR) 10Creynolds: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [01:57:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:58:13] (03PS2) 10Creynolds: dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 [02:10:37] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:12:21] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10728390 (10Andrew) a:05Andrew→03dcaro After a bit of monkeying with the raid settings (to mark the new drive as non-raid) I can now see the drive in l... [02:17:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:52:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [03:04:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728424 (10phaultfinder) [03:15:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:17:36] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:24:40] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:37:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [05:04:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728492 (10phaultfinder) [05:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728507 (10phaultfinder) [05:19:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728509 (10Marostegui) p:05Triage→03Medium I can be the point of contact for this task, with the exception of aqs1022 and restbase1045 [05:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:38:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728520 (10Marostegui) [05:39:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728522 (10Marostegui) @RobH I've filled out the databases, pc and dbproxy. We are ready to do this (ideally in 2-3 hosts batches) anytime, we just need coordi... [05:42:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [05:42:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:47:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T0600) [06:00:04] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T0600). [06:02:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [06:08:26] 06SRE, 10Wikimedia-Mailing-lists: mailman/postorius: errors when changing subscription or when trying to unsubscribe - https://phabricator.wikimedia.org/T391260#10728547 (10Krd) The problem seems to have disappeared- [06:17:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:43] (03CR) 10Awight: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1135481 (owner: 10Awight) [06:40:42] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:43:20] 06SRE, 10Prod-Kubernetes, 06Traffic, 10Wikidata, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10728638 (10Silvan_WMDE) a:03Silvan_WMDE [06:43:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:44:02] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc2 T391454', diff saved to https://phabricator.wikimedia.org/P74822 and previous config saved to /var/cache/conftool/dbconfig/20250410-064511-marostegui.json [06:45:15] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [06:45:58] (03PS1) 10Marostegui: pc2: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1135640 (https://phabricator.wikimedia.org/T391454) [06:46:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:46:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance [06:47:13] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:20] (03CR) 10Marostegui: [C:03+2] pc2: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1135640 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [06:50:42] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:51:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:52:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc2 T391454', diff saved to https://phabricator.wikimedia.org/P74823 and previous config saved to /var/cache/conftool/dbconfig/20250410-065208-marostegui.json [06:52:12] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [06:55:20] !log Migrate pc2 to MariaDB 10.11 T391454 [06:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T0700) [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] (03PS1) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [07:00:37] (03CR) 10CI reject: [V:04-1] benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:07:31] (03PS2) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [07:07:53] (03CR) 10CI reject: [V:04-1] benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:13:25] (03CR) 10Slyngshede: [C:03+2] Add .tox to gitignore [software/bitu] - 10https://gerrit.wikimedia.org/r/1135448 (owner: 10Majavah) [07:14:08] (03PS5) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [07:14:08] (03PS2) 10Brouberol: airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) [07:14:16] (03CR) 10CI reject: [V:04-1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [07:14:19] (03CR) 10CI reject: [V:04-1] airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [07:15:08] (03CR) 10Brouberol: airflow: scrape additional metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [07:15:12] (03PS6) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [07:15:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:16:39] (03Merged) 10jenkins-bot: Add .tox to gitignore [software/bitu] - 10https://gerrit.wikimedia.org/r/1135448 (owner: 10Majavah) [07:16:59] (03PS7) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [07:16:59] (03PS3) 10Brouberol: airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) [07:17:06] (03CR) 10CI reject: [V:04-1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [07:17:09] (03CR) 10CI reject: [V:04-1] airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [07:17:52] (03PS8) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [07:17:52] (03PS4) 10Brouberol: airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) [07:19:06] (03CR) 10Tiziano Fogli: [C:03+2] perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [07:19:14] (03CR) 10Slyngshede: "Do we need to keep the avg() to compensate for having two hosts and two counters. If I run the query I get four values, so that would trig" [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [07:19:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:44] (03Merged) 10jenkins-bot: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [07:21:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1180 to s6 vslow/dump', diff saved to https://phabricator.wikimedia.org/P74824 and previous config saved to /var/cache/conftool/dbconfig/20250410-072127-marostegui.json [07:22:31] (03CR) 10Vgutierrez: cdn: Unify ats/haproxy/varnish upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [07:24:36] (03CR) 10Btullis: [C:03+1] "Cool." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [07:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728719 (10phaultfinder) [07:31:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10728726 (10Marostegui) [07:35:49] !log upload liberica 0.12 to bookworm-wikimedia (apt.wm.o) [07:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:30] (03CR) 10Brouberol: airflow: scrape additional metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [07:39:58] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-canary [07:40:02] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [07:40:02] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:40:14] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [07:40:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-canary [07:40:52] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [07:40:52] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:41:04] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [07:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:25] (03PS3) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [07:44:07] !log rollback to liberica 0.11 in lvs1013 [07:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:30] FIRING: LibericaDiffFPCheck: Liberica instance lvs1013:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [07:44:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:46:48] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary [07:47:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [07:47:09] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [07:47:13] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [07:47:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [07:47:26] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting A:liberica-canary [07:49:25] (03PS1) 10Slyngshede: Netbox: Temporarily remove Netbox alerting [alerts] - 10https://gerrit.wikimedia.org/r/1135673 [07:49:30] RESOLVED: LibericaDiffFPCheck: Liberica instance lvs1013:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [07:52:39] (03CR) 10Vgutierrez: [C:03+2] liberica,hiera: Add IPv6 endpoints for prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) (owner: 10Vgutierrez) [07:54:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:56:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-canary [07:56:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-canary [07:58:59] (03PS1) 10DCausse: cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 [08:00:13] (03CR) 10CI reject: [V:04-1] cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 (owner: 10DCausse) [08:01:26] (03PS2) 10DCausse: cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 [08:01:50] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary [08:02:12] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.upgrade (exit_code=1) restarting A:liberica-canary [08:02:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:04:55] (03CR) 10Brouberol: [C:03+1] cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 (owner: 10DCausse) [08:05:38] (03CR) 10DCausse: [C:03+2] cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 (owner: 10DCausse) [08:05:43] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:06:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10728784 (10VRiley-WMF) [08:06:52] (03Merged) 10jenkins-bot: cirrussearch: Fix CirrusBackendErrorRateTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1135675 (owner: 10DCausse) [08:07:04] (03PS1) 10Volans: netbox: fix reports runs [puppet] - 10https://gerrit.wikimedia.org/r/1135677 [08:09:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:10:24] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1135677 (owner: 10Volans) [08:13:39] (03CR) 10Volans: [C:03+2] netbox: fix reports runs [puppet] - 10https://gerrit.wikimedia.org/r/1135677 (owner: 10Volans) [08:13:48] (03CR) 10Filippo Giunchedi: "The alerts get deployed to prometheus (not thanos) so each sees its own site (i.e. one netbox host per site). AFAICS both hosts export the" [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [08:14:40] (03CR) 10Elukey: [C:03+2] Add citoid-ingress CNAMEs for the Istio ingress [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [08:16:02] !log elukey@dns1004 START - running authdns-update [08:16:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10728838 (10VRiley-WMF) [08:16:32] PROBLEM - OpenSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 52, active_shards: 84, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 19, delayed_unassigned_shards: 0, number_of_pendin [08:16:33] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.50467289719626 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:16:33] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 52, active_shards: 84, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 19, delayed_unassigned_shards: 0, number_of_pendin [08:16:33] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.50467289719626 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:16:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10728839 (10VRiley-WMF) [08:17:14] PROBLEM - OpenSearch health check for shards on 9200 on relforge1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 52, active_shards: 84, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 19, delayed_unassigned_shards: 0, number_of_pendin [08:17:14] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.50467289719626 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:17:14] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f85d26a41c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [08:17:14] org/wiki/Search%23Administration [08:17:36] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:18:28] !log elukey@dns1004 END - running authdns-update [08:19:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10728847 (10MatthewVernon) [08:20:02] !log upload liberica 0.13 to bookworm-wikimedia (apt.wm.o) [08:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:13] FIRING: [2x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:25] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-canary [08:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [08:22:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-canary [08:24:02] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:40] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:26:48] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:28:54] (03CR) 10Elukey: [C:03+2] services: add extra fqdn to the citoid's ingress config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135449 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [08:30:14] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 13, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [08:30:14] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:14] RECOVERY - OpenSearch health check for shards on 9200 on relforge1008 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 13, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [08:30:14] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:32] RECOVERY - OpenSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 13, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [08:30:32] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:32] RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 13, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [08:30:32] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:31:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10728883 (10VRiley-WMF) Collecting a report on this and will update when I have a ticket with Dell. [08:32:13] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:30] FIRING: [3x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [08:40:10] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/citoid: sync [08:40:13] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: sync [08:40:29] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: sync [08:40:30] FIRING: [5x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [08:40:33] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: sync [08:41:18] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: sync [08:41:21] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: sync [08:41:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10728916 (10cmooney) >>! In T387145#10726404, @Vgutierrez wrote: > This isn't a big issue at the moment given that we don't need `L2` adjacency anymore. (also cp hosts are on rows A to D... [08:44:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10728918 (10Marostegui) Thank you - you can reboot and power off the host as much as you need, it's not accessible, data is corrupted and it's out of production [08:45:30] FIRING: [8x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [08:46:40] (03CR) 10Elukey: "@mvolz@wikimedia.org Hi! I am moving citoid to be behind the Istio K8s gateway, this means the following:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [08:46:48] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:47:13] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10728920 (10VRiley-WMF) Dell work order number is 208331710. Currently this is under investigation [08:50:30] FIRING: [11x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [08:50:33] (03CR) 10Peter Fischer: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [08:51:09] (03PS4) 10DCausse: opensearch: allow setting LD_LIBRARY_PATH [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) [08:51:10] (03PS2) 10DCausse: cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) [08:52:02] (03CR) 10Btullis: "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [08:52:15] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [08:52:20] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [08:53:31] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7003.magru.wmnet} and A:liberica [08:54:19] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7003.magru.wmnet} and A:liberica [08:55:30] FIRING: [15x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [08:55:43] ^^ that's me, totally expected [08:57:11] (03PS5) 10Slyngshede: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 [08:57:29] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-magru and not P{lvs7003.magru.wmnet} and A:liberica [08:57:43] (03CR) 10Slyngshede: "I think we should do the global then and just get the one alert. I've updated the tag." [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [08:58:24] (03CR) 10Mvolz: [C:03+1] "Go ahead!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [08:58:31] (03CR) 10Fabfur: [C:03+1] haproxy: enable requestctl rules everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1135431 (owner: 10Giuseppe Lavagetto) [08:59:07] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-magru and not P{lvs7003.magru.wmnet} and A:liberica [08:59:18] (03CR) 10Btullis: [C:03+2] opensearch: allow setting LD_LIBRARY_PATH [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [09:00:03] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-ulsfo and A:liberica [09:00:30] FIRING: [15x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:02:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-ulsfo and A:liberica [09:02:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10728943 (10cmooney) a:03Jhancock.wm [09:03:54] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-eqsin and A:liberica [09:05:30] FIRING: [15x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:07:15] (03CR) 10Cathal Mooney: [C:03+2] Cloudsw: adjust routing-policies to reflect change to IBGP [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [09:07:46] (03Merged) 10jenkins-bot: Cloudsw: adjust routing-policies to reflect change to IBGP [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [09:07:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-eqsin and A:liberica [09:08:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-drmrs and A:liberica [09:10:30] FIRING: [15x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:11:00] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-drmrs and A:liberica [09:11:12] (03CR) 10Vgutierrez: benthos: install benthos on all cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:11:35] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-esams and A:liberica [09:13:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-esams and A:liberica [09:13:58] LibericaStaleConfig alert should recover soon(TM) [09:15:30] RESOLVED: [14x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:15:56] (03CR) 10Vgutierrez: [C:03+2] wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) (owner: 10Vgutierrez) [09:16:08] (03CR) 10Vgutierrez: [C:03+2] liberica: Allow configuring UDP services [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) (owner: 10Vgutierrez) [09:17:32] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [09:17:38] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [09:18:16] (03Abandoned) 10Btullis: Configure the ceph-csi-rbd storageclass to retain PVs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134200 (https://phabricator.wikimedia.org/T391087) (owner: 10Btullis) [09:22:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10728997 (10VRiley-WMF) After talking to Dell about this ticket, they are escalating this toa higher tier of support. Will update when I hear back from them. Hopefully we can get this resolved once and for... [09:23:26] (03PS4) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [09:23:45] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [09:24:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [09:25:30] (03CR) 10Fabfur: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135431 (owner: 10Giuseppe Lavagetto) [09:25:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:26:37] (03CR) 10Brouberol: "You can also clean up all sections starting with" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [09:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:32:31] !log decom 2x10G lag from cloudsw1-c8-eqiad to asw2-b-eqiad T391489 [09:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:34] T391489: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489 [09:33:40] jouncebot: nowandnext [09:33:40] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [09:33:40] In 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1000) [09:34:00] (03CR) 10Brouberol: [C:03+2] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [09:34:04] (03CR) 10Brouberol: [C:03+2] airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [09:34:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#10729017 (10cmooney) Indeed there is nothing there in row B on any of those vlans. ` cmooney@cloudsw1-c8-eqiad> show ethernet-switching table interface ae1... [09:34:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#10729018 (10cmooney) p:05Triage→03Medium [09:36:19] (03Merged) 10jenkins-bot: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [09:36:22] (03Merged) 10jenkins-bot: airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [09:37:13] FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:14] (03CR) 10Clément Goubert: alertmanager: add task receivers for 4 teams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [09:37:33] (03PS5) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [09:37:55] (03CR) 10Elukey: [C:03+2] services: point rest-gateway to the ingress citoid endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [09:38:22] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:38:57] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: sync [09:39:03] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: sync [09:39:07] (03CR) 10Clément Goubert: [V:03+2] php: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135379 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [09:39:09] (03CR) 10Clément Goubert: [V:03+2 C:03+2] php: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135379 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [09:40:06] !log Rebuilding php base images to pick up 1135379 - T387208 [09:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:09] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [09:41:30] FIRING: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:42:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:42:13] RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:04] !log cgoubert@deploy1003 Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135379 - T387208 [09:45:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:45:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:50:25] (03PS6) 10Filippo Giunchedi: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [09:50:33] !log fabfur@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading A:liberica [09:50:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:11] (03PS7) 10Filippo Giunchedi: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [09:55:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading A:liberica [09:55:10] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:55:16] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:56:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9eea39c718e6bfc887c06dff129f5a8dfdbc25fa650ec51c535e9e01d4e9215d/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [09:56:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:56:33] hmmm [09:56:56] claime ^^ was resolved due to the cookbook run [09:56:57] (03CR) 10Filippo Giunchedi: "Please see PS7 which is global and will avoid duplicates" [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [09:57:21] fabfur: I'm hmming at the disk space on deploy1003 because I'm in the middle of a full image build there [09:57:34] ack! [09:58:11] but it looks like a false positive [09:58:25] the dir exists, for some reason the prom exporter can't access it [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1000) [10:02:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [10:09:23] (03PS1) 10Hnowlan: rest-gateway: enable ingress at route-level for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135683 (https://phabricator.wikimedia.org/T391457) [10:09:59] (03PS8) 10Filippo Giunchedi: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [10:12:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10729123 (10OKarakaya-WMF) hi @Jelto @achou and I... [10:13:00] (03CR) 10Elukey: [C:03+1] rest-gateway: enable ingress at route-level for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135683 (https://phabricator.wikimedia.org/T391457) (owner: 10Hnowlan) [10:15:06] (03CR) 10Elukey: [C:03+2] rest-gateway: enable ingress at route-level for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135683 (https://phabricator.wikimedia.org/T391457) (owner: 10Hnowlan) [10:18:04] (03CR) 10Slyngshede: [C:03+2] Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [10:19:15] !log cgoubert@deploy1003 sync-world aborted: Rebuilding mediawiki images to pick up new base images 1135379 - T387208 (duration: 35m 23s) [10:19:18] (03Merged) 10jenkins-bot: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [10:19:18] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [10:19:35] !log cgoubert@deploy1003 Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135379 - T387208 [10:20:49] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: sync [10:21:01] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: sync [10:23:46] !log phedenskog@deploy1003 Started deploy [performance/navtiming@94fa387]: Disable navtiming performance metrics in Graphite [10:23:55] !log phedenskog@deploy1003 Finished deploy [performance/navtiming@94fa387]: Disable navtiming performance metrics in Graphite (duration: 00m 50s) [10:25:09] (03CR) 10Federico Ceratto: [C:03+1] hiera: Add zarcillo service to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:25:11] (03CR) 10Federico Ceratto: [C:03+2] hiera: Add zarcillo service to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:26:48] !log rest-gateway from now on calls citoid on its ingress endpoint [10:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:19] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: sync [10:28:25] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: sync [10:31:56] (03PS1) 10Tiziano Fogli: perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) [10:36:22] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [10:42:50] augh it rolled back [10:43:20] well since the image was rebuilt, I can just do a backport that I need and it should pick it up [10:44:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) (owner: 10Clément Goubert) [10:45:03] (03Merged) 10jenkins-bot: MWScript.php: exit code on mesh, longer timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) (owner: 10Clément Goubert) [10:45:56] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] [10:46:00] T390972: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972 [10:46:00] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [10:46:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker2048:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:47] (03PS1) 10Ladsgroup: Bump thumbnail steps to 85% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135688 (https://phabricator.wikimedia.org/T360589) [10:49:20] claime: please let me know when you're done <3 [10:49:27] Amir1: sure [10:50:13] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw: route IPv4-only subnet [puppet] - 10https://gerrit.wikimedia.org/r/1135689 (https://phabricator.wikimedia.org/T391325) [10:51:15] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: codfw: route IPv4-only subnet [puppet] - 10https://gerrit.wikimedia.org/r/1135689 (https://phabricator.wikimedia.org/T391325) [10:51:35] (03PS1) 10Lucas Werkmeister (WMDE): Revert "logspam: Consolidate CurlFactory cURL errors" [puppet] - 10https://gerrit.wikimedia.org/r/1135690 [10:52:01] (03CR) 10CI reject: [V:04-1] Revert "logspam: Consolidate CurlFactory cURL errors" [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (owner: 10Lucas Werkmeister (WMDE)) [10:52:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135689 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [10:53:42] (03CR) 10Lucas Werkmeister (WMDE): "Optional suggestion." [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (owner: 10Lucas Werkmeister (WMDE)) [10:54:08] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:54:13] T390972: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972 [10:54:13] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [10:54:24] (03PS2) 10Lucas Werkmeister (WMDE): Revert "logspam: Consolidate CurlFactory cURL errors" [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (https://phabricator.wikimedia.org/T371633) [10:55:16] !log cgoubert@deploy1003 cgoubert: Continuing with sync [10:56:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker2048:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:11] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] (duration: 22m 15s) [11:08:15] T390972: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972 [11:08:15] Amir1: done [11:08:15] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [11:08:19] you can go ahead [11:08:24] Thank you! [11:08:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135688 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:09:45] (03Merged) 10jenkins-bot: Bump thumbnail steps to 85% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135688 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:10:09] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135688|Bump thumbnail steps to 85% (T360589)]] [11:10:12] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:12:53] (03CR) 10Clément Goubert: [C:03+2] mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [11:15:52] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:15:58] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1135688|Bump thumbnail steps to 85% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:16:01] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:17:41] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:17:49] Cool I broke the cronjobs [11:17:56] (only mw-cron so it's ok) [11:18:05] but I'll have to do an image rebuild >< [11:26:24] (03PS1) 10Clément Goubert: fpm-multiversion-base: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135694 [11:26:30] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135688|Bump thumbnail steps to 85% (T360589)]] (duration: 16m 20s) [11:26:33] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:27:10] (03CR) 10Clément Goubert: [V:03+2 C:03+2] fpm-multiversion-base: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135694 (owner: 10Clément Goubert) [11:27:11] (03CR) 10Klausman: [C:03+1] changeprop: add liftwing RRLA source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135153 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [11:27:48] jouncebot: nowandnext [11:27:48] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [11:27:48] In 0 hour(s) and 32 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1200) [11:28:54] !log Rebuilding php base images to pick up 1135694 - T387208 [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [11:32:06] !log cgoubert@deploy1003 Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135694 - T387208 [11:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.122s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:36:32] (03CR) 10Hnowlan: [C:03+1] changeprop: add liftwing RRLA source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135153 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [11:37:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.122s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:39:39] (03CR) 10Kevin Bazira: [C:03+2] changeprop: add liftwing RRLA source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135153 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [11:41:07] (03Merged) 10jenkins-bot: changeprop: add liftwing RRLA source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135153 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [11:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:01] (03PS2) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) [11:45:32] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: codfw: route IPv4-only subnet [puppet] - 10https://gerrit.wikimedia.org/r/1135689 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [11:46:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/93f8eff6bd73716e7141eb29978dd3904fa659ca2bf6ce840715169c87d1dd63/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:48:08] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: codfw: route IPv4-only subnet [puppet] - 10https://gerrit.wikimedia.org/r/1135689 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [11:48:52] (03CR) 10AikoChou: [C:03+1] ml-services: update RRLA output stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135054 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [11:50:16] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [11:50:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1159.eqiad.wmnet with reason: Maintenance [11:50:31] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [11:50:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T391056)', diff saved to https://phabricator.wikimedia.org/P74827 and previous config saved to /var/cache/conftool/dbconfig/20250410-115037-fceratto.json [11:50:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:52:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:53:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T391056)', diff saved to https://phabricator.wikimedia.org/P74828 and previous config saved to /var/cache/conftool/dbconfig/20250410-115328-fceratto.json [11:53:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:53:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:54:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:55:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:55:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:56:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:56:42] (03PS1) 10Effie Mouzeli: php8.1: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1135698 (https://phabricator.wikimedia.org/T391452) [11:56:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:56:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:57:16] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [11:57:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [11:57:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:57:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:58:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:58:37] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135698 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [11:58:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:59:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:59:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1200) [12:00:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:00:22] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update RRLA output stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135054 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:01:11] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1169 [12:01:31] !log btullis@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1169 [12:01:47] (03Merged) 10jenkins-bot: ml-services: update RRLA output stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135054 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:02:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:03:21] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:03:45] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [12:34:54] (03CR) 10AOkoth: [C:03+2] site: revert releases to production role [puppet] - 10https://gerrit.wikimedia.org/r/1135444 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [12:35:55] RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:36:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [12:37:10] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [12:38:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T391056)', diff saved to https://phabricator.wikimedia.org/P74832 and previous config saved to /var/cache/conftool/dbconfig/20250410-123850-fceratto.json [12:38:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:39:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:39:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:39:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T391056)', diff saved to https://phabricator.wikimedia.org/P74833 and previous config saved to /var/cache/conftool/dbconfig/20250410-123931-fceratto.json [12:41:12] (03PS2) 10Filippo Giunchedi: perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [12:41:50] (03CR) 10Filippo Giunchedi: [C:03+1] "I made a small change to keep the "base" metric name unchanged (between : :) and move the anonymous reference at the end, please let me kn" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [12:42:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T391056)', diff saved to https://phabricator.wikimedia.org/P74834 and previous config saved to /var/cache/conftool/dbconfig/20250410-124222-fceratto.json [12:43:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [12:43:59] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [12:45:05] !log reedy@deploy1003 Synchronized wmf-config/interwiki-labs.php: Update! (duration: 14m 07s) [12:47:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:34] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug1002.eqiad.wmnet with reason: host reimage [12:51:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [12:52:10] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug1002.eqiad.wmnet with reason: host reimage [12:56:24] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [12:56:53] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [12:56:54] (03CR) 10Jforrester: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135521 (owner: 10Reedy) [12:57:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P74836 and previous config saved to /var/cache/conftool/dbconfig/20250410-125729-fceratto.json [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1300). [13:00:04] abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:02:30] o/ [13:02:38] Couple of minutes late for the deployment. [13:04:15] (03CR) 10Dwisehaupt: "Whoops, sorry about that. That was probably an accidental click on my part. Sorry for any additional work I caused and I'll look into what" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [13:05:03] (03PS1) 10Volans: ganeti-netbox-sync: fix puppetdb import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135713 [13:06:58] hi, is anyone around to do the backports? [13:08:36] (03PS1) 10Federico Ceratto: pool.py: In dry-run mode do not monitor connection drain [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) [13:08:36] (03CR) 10Federico Ceratto: "Small speedup, tested with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto) [13:09:08] (03PS1) 10Arturo Borrero Gonzalez: network: data: fix entry prefix to include 'cloud-instances' [puppet] - 10https://gerrit.wikimedia.org/r/1135715 (https://phabricator.wikimedia.org/T391325) [13:10:08] (03PS1) 10Slyngshede: Docker: Update Docker build [software/bitu] - 10https://gerrit.wikimedia.org/r/1135716 [13:10:31] (03PS1) 10Btullis: Temporarily put an-worker1169 back into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1135717 (https://phabricator.wikimedia.org/T390169) [13:11:49] (03CR) 10Btullis: [C:03+2] Temporarily put an-worker1169 back into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1135717 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [13:12:05] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135715 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [13:12:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P74837 and previous config saved to /var/cache/conftool/dbconfig/20250410-131237-fceratto.json [13:12:49] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:13:20] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:13:56] (03CR) 10Bking: [V:04-1] "Do not merge until after the row A hosts are fully reimaged." [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:17:37] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (releases2003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:18:08] o/ [13:18:12] abijeet: I can deploy [13:18:24] Lucas_WMDE, thanks! [13:18:51] (03PS6) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [13:19:06] (03CR) 10Slyngshede: [C:03+2] Docker: Update Docker build [software/bitu] - 10https://gerrit.wikimedia.org/r/1135716 (owner: 10Slyngshede) [13:19:12] huh, we have two diffConfig builds now [13:19:34] (03CR) 10Volans: [C:03+1] "LGTM, optional alternatives inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto) [13:19:48] also, the mwdebug1002 SSH host key changed? o_O [13:20:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135337 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [13:20:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [13:20:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye [13:20:57] hmm, https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/mwdebug1002.eqiad.wmnet hasn’t been touched in four years 🤔 [13:21:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10729653 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1... [13:21:27] (03Merged) 10jenkins-bot: AX: Enable Quick Surveys extension on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135337 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [13:21:31] (03Merged) 10jenkins-bot: AX: Enable entry-points on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [13:21:36] (03Merged) 10jenkins-bot: Docker: Update Docker build [software/bitu] - 10https://gerrit.wikimedia.org/r/1135716 (owner: 10Slyngshede) [13:22:00] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1135337|AX: Enable Quick Surveys extension on Asturian and Lombard wiki (T390023)]], [[gerrit:1135340|AX: Enable entry-points on Asturian and Lombard wiki (T390023)]] [13:22:03] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [13:22:30] ok after running wmf-update-known-hosts-production, I can ssh into mwdebug1002 again [13:22:38] presumably that means the wikitech page is outdated :/ [13:23:26] CC effie just in case this is related to T391452, I guess… [13:23:26] T391452: Migrate mwdebug* hosts to PHP8.1 - https://phabricator.wikimedia.org/T391452 [13:26:38] !log expand LVs on prometheus instances (k8s-dse) [13:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1135337|AX: Enable Quick Surveys extension on Asturian and Lombard wiki (T390023)]], [[gerrit:1135340|AX: Enable entry-points on Asturian and Lombard wiki (T390023)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:27:11] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [13:27:28] abijeet: please test :) [13:27:33] Lucas_WMDE, on it [13:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:27:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T391056)', diff saved to https://phabricator.wikimedia.org/P74838 and previous config saved to /var/cache/conftool/dbconfig/20250410-132744-fceratto.json [13:27:48] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:27:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:27:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T391056)', diff saved to https://phabricator.wikimedia.org/P74839 and previous config saved to /var/cache/conftool/dbconfig/20250410-132756-fceratto.json [13:28:05] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwdebug1002.eqiad.wmnet with OS bullseye [13:28:57] (03CR) 10Xcollazo: [C:03+1] dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [13:29:10] (03CR) 10Xcollazo: [C:03+1] "CC @btullis@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [13:30:17] (03CR) 10Fabfur: benthos: install benthos on all cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [13:30:37] ok, https://phabricator.wikimedia.org/T391452#10729673 confirms that mwdebug1002 got a new SSH key [13:30:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T391056)', diff saved to https://phabricator.wikimedia.org/P74840 and previous config saved to /var/cache/conftool/dbconfig/20250410-133046-fceratto.json [13:30:57] Lucas_WMDE, looks good. [13:31:11] !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Continuing with sync [13:31:12] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Thanks." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135713 (owner: 10Volans) [13:31:14] ok, thanks! [13:31:26] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:33:08] (03CR) 10Fabfur: [C:03+2] haproxy: enable requestctl rules everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1135431 (owner: 10Giuseppe Lavagetto) [13:34:27] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1005.eqiad.wmnet [13:34:28] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135431 to enable haproxy requestctl rules everywhere (T370745) [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:32] T370745: Integrate requestctl haproxy rules into our TLS terminator - https://phabricator.wikimedia.org/T370745 [13:36:24] Lucas_WMDE: is logstash working for you? it seems as of ~1h ago, there are no more messages from mediawiki [13:36:32] nope, same here [13:36:38] logspam-watch still has messages though [13:36:48] ok, so udp2log isn't affected [13:36:55] 3 in the last 10 minutes, which sounds plausible enough [13:37:27] hm.. those may be from mw-api-int [13:37:42] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135337|AX: Enable Quick Surveys extension on Asturian and Lombard wiki (T390023)]], [[gerrit:1135340|AX: Enable entry-points on Asturian and Lombard wiki (T390023)]] (duration: 15m 42s) [13:37:44] Lucas_WMDE: which host do you run that from [13:37:45] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [13:37:50] within the last 60 minutes there’s a huge spike of “could not enqueue jobs” [13:37:57] (03PS1) 10Jforrester: WikifunctionsClientUsageUpdateJob: Don't pass a heavy Title in, just the scalars [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135723 (https://phabricator.wikimedia.org/T391533) [13:37:58] Krinkle: mwlog1002.eqiad.wmnet [13:39:57] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [13:40:04] (03PS1) 10Brouberol: airflow-test-k8s: increase the reserved resources for the airflow-test-k8s scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135724 (https://phabricator.wikimedia.org/T391556) [13:41:20] (03CR) 10Jelto: "Thanks for the cookbook! I left a few comments in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:41:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10729720 (10Jhancock.wm) @elukey old disk restored! [13:41:39] (03PS1) 10Jforrester: Set WikiLambdaClientTargetAPI default value to protocol-relative, so HSTS doesn't sting us [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135725 (https://phabricator.wikimedia.org/T391534) [13:42:24] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 449143808 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:43:06] Lucas_WMDE, thanks for your help! [13:43:24] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 91936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:44:13] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:44:28] np [13:44:35] !log UTC afternoon backport+config window done [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:17] Lucas_WMDE: yeah, I reimaged it again btw [13:45:28] (03CR) 10Volans: "As discussed on IRC I did a quick pass for the cookbook-only bits, I'll leave the logic and details to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:45:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P74841 and previous config saved to /var/cache/conftool/dbconfig/20250410-134553-fceratto.json [13:46:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:46:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1005.eqiad.wmnet [13:46:55] effie: I guess I got lucky and ran wmf-update-known-hosts-production after the second reimage so I didn’t get the error again ^^ [13:47:05] haha [13:47:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:48:10] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2278, mw2279 - https://phabricator.wikimedia.org/T391001#10729755 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:48:11] (03CR) 10Andrew Bogott: [C:03+2] Remove final traces of cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1135136 (https://phabricator.wikimedia.org/T391413) (owner: 10Andrew Bogott) [13:49:24] ah right, I can see another key change command in yesterday’s journal [13:49:35] 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcontrol1005.eqiad.wmnet - https://phabricator.wikimedia.org/T391413#10729762 (10Andrew) [13:49:36] (wmf-update-known-hosts-production prints a diff of the old and new keys, which is quite nice) [13:49:38] !log jiji@cumin1002 conftool action : set/pooled=yes; selector: name=mwdebug1002.eqiad.wmnet [13:50:07] (03PS7) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [13:50:10] effie: do you know how https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/mwdebug1002.eqiad.wmnet gets updated? because that was my first idea for checking whether the key change was expected or not [13:50:31] (though I’m not 100% sure if it was up-to-date before, because the old fingerprint in the journal is in a different format than the wiki page) [13:50:39] (03PS8) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [13:51:04] (03CR) 10Volans: [C:03+2] ganeti-netbox-sync: fix puppetdb import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135713 (owner: 10Volans) [13:51:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2142'] [13:52:06] (03CR) 10Ssingh: geo-maps: add mapping for Peru (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [13:52:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2142'] [13:52:50] Lucas_WMDE: I will do so in a bit, the reimaging finished like half an hour ago :p [13:52:59] alright, thanks :D [13:53:03] sorry to bother you ^^ [13:53:09] (03Merged) 10jenkins-bot: ganeti-netbox-sync: fix puppetdb import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135713 (owner: 10Volans) [13:53:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10729811 (10Jhancock.wm) starting with firmware updates. hopefully we'll get a more concise error. [13:54:50] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:55:03] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:55:06] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:55:35] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:56:52] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug2002.codfw.wmnet [13:57:06] (03PS5) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [13:59:02] (03PS6) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [13:59:59] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:01:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P74842 and previous config saved to /var/cache/conftool/dbconfig/20250410-140100-fceratto.json [14:02:25] (03PS1) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) [14:03:25] (03PS1) 10Ssingh: package_builder: add packages for nginx build [puppet] - 10https://gerrit.wikimedia.org/r/1135731 (https://phabricator.wikimedia.org/T205378) [14:03:48] (03CR) 10CI reject: [V:04-1] package_builder: add packages for nginx build [puppet] - 10https://gerrit.wikimedia.org/r/1135731 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:04:31] (03PS2) 10Ssingh: package_builder: add packages for nginx build [puppet] - 10https://gerrit.wikimedia.org/r/1135731 (https://phabricator.wikimedia.org/T205378) [14:06:06] (03PS1) 10Ssingh: Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) [14:06:18] (03PS1) 10Xcollazo: Temporarily exclude mediawikiwiki from the dumps due to multiple failures that don't allow other dumps to move forward. [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) [14:06:46] (03CR) 10CI reject: [V:04-1] Temporarily exclude mediawikiwiki from the dumps due to multiple failures that don't allow other dumps to move forward. [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo) [14:07:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:09:17] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet [14:09:39] (03PS2) 10Xcollazo: Temporarily exclude mediawikiwiki from the dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) [14:09:58] (03CR) 10Xcollazo: "@btullis@wikimedia.org can you add the proper `Host:` to the commit description so that we can run PPC?" [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo) [14:10:22] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:03] (03PS1) 10Effie Mouzeli: switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135736 (https://phabricator.wikimedia.org/T391452) [14:14:26] (03CR) 10CI reject: [V:04-1] switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135736 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [14:14:44] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet [14:14:47] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet [14:15:51] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2074 to cirrussearch2074 [14:16:05] (03PS2) 10Effie Mouzeli: switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135736 (https://phabricator.wikimedia.org/T391452) [14:16:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T391056)', diff saved to https://phabricator.wikimedia.org/P74843 and previous config saved to /var/cache/conftool/dbconfig/20250410-141608-fceratto.json [14:16:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:16:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:16:14] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:16:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T391056)', diff saved to https://phabricator.wikimedia.org/P74844 and previous config saved to /var/cache/conftool/dbconfig/20250410-141619-fceratto.json [14:16:58] (03PS1) 10Volans: ganeti-netbox-sync: skip puppetdb import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135737 [14:18:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2142'] [14:18:16] Lucas_WMDE: turns out I need superpowers to update the fingerprints :p [14:18:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T391056)', diff saved to https://phabricator.wikimedia.org/P74845 and previous config saved to /var/cache/conftool/dbconfig/20250410-141845-fceratto.json [14:19:09] effie: that’s why I phrased it as “do you know how it gets updated” instead of “are you going to” xP [14:19:15] I didn’t know if you even had the superpowers [14:19:29] (I think I asked for them a few years ago but it was declined, quite sensibly tbh) [14:19:40] Lucas_WMDE: I am sorry I didnt [14:20:13] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet [14:20:21] ah, contentadmin is the user group apparently T216126 [14:20:22] T216126: Requesting contentadmin access for 'Lucas Werkmeister (WMDE)' on Wikitech - https://phabricator.wikimedia.org/T216126 [14:20:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2074 to cirrussearch2074 - bking@cumin2002" [14:20:54] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. I guess we could drop the equivalent on line 213 to reduce the lines of code, at the cost of wasted cycles." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135737 (owner: 10Volans) [14:21:00] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2074 to cirrussearch2074 - bking@cumin2002" [14:21:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:01] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2074 [14:21:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2074 [14:21:43] !log stop curator_actions_cluster_wide.service on logging-sd1001 - forcemerge causing kafka lag [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2074 to cirrussearch2074 [14:23:04] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2074.codfw.wmnet with OS bullseye [14:23:15] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2074 [14:24:41] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:25:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] network: data: fix entry prefix to include 'cloud-instances' [puppet] - 10https://gerrit.wikimedia.org/r/1135715 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [14:26:12] (03PS2) 10CDobbins: geo-maps: add mapping for Peru [dns] - 10https://gerrit.wikimedia.org/r/1135469 [14:28:05] (03PS3) 10CDobbins: geo-maps: add mapping for Peru [dns] - 10https://gerrit.wikimedia.org/r/1135469 [14:28:46] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2074 - bking@cumin2002" [14:28:52] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2074 - bking@cumin2002" [14:28:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:53] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2074.codfw.wmnet 138.0.192.10.in-addr.arpa 8.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:28:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2074.codfw.wmnet 138.0.192.10.in-addr.arpa 8.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:28:57] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2074 [14:29:19] (03CR) 10Volans: [C:03+2] "not really because otherwise the second API call will have no vm_names and in that case django would return all the VMs." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135737 (owner: 10Volans) [14:29:47] (03PS4) 10Clément Goubert: alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:31:04] (03CR) 10CDobbins: geo-maps: add mapping for Peru (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [14:31:24] (03Merged) 10jenkins-bot: ganeti-netbox-sync: skip puppetdb import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135737 (owner: 10Volans) [14:31:38] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2074 [14:31:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2074 [14:31:46] (03PS1) 10Cwhite: logstash: move err field mitigation to effective place [puppet] - 10https://gerrit.wikimedia.org/r/1135740 (https://phabricator.wikimedia.org/T390215) [14:31:49] (03CR) 10Ssingh: [C:03+1] geo-maps: add mapping for Peru [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [14:31:51] (03PS2) 10Kamila Součková: alertmanager: Route 3 teams' task-severity alerts to Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135418 (https://phabricator.wikimedia.org/T385709) [14:33:24] (03PS3) 10Clément Goubert: alertmanager: Route 4 teams' task-severity alerts to Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135418 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [14:33:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P74847 and previous config saved to /var/cache/conftool/dbconfig/20250410-143352-fceratto.json [14:34:25] (03Abandoned) 10Clément Goubert: alertmanager: add route for task-severity data-persistence alerts [puppet] - 10https://gerrit.wikimedia.org/r/1135413 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [14:35:05] (03CR) 10Clément Goubert: alertmanager: add task receivers for 4 teams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:36:41] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:36:53] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:36:58] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:37:26] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:37:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:37:47] (03PS1) 10Slyngshede: Release version 0.1.11 [software/bitu] - 10https://gerrit.wikimedia.org/r/1135741 [14:37:57] (03PS13) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [14:37:58] (03PS45) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [14:37:58] (03PS5) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [14:38:37] (03PS1) 10Hashar: mediawiki: remove unsupported LimitNOFILESoft directive [puppet] - 10https://gerrit.wikimedia.org/r/1135743 (https://phabricator.wikimedia.org/T389422) [14:39:51] (03PS5) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [14:41:12] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Route 4 teams' task-severity alerts to Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135418 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [14:41:16] (03PS1) 10Arturo Borrero Gonzalez: network: data: cloud: drop unused CIDR 172.16.132.0/22 [puppet] - 10https://gerrit.wikimedia.org/r/1135744 [14:41:17] (03CR) 10Clément Goubert: [C:03+2] alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:41:23] (03CR) 10Hashar: jobrunner: increase open files limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) (owner: 10Giuseppe Lavagetto) [14:41:31] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [14:42:06] (03CR) 10Tiziano Fogli: netbox-hiera: adding pdu type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:43:18] (03CR) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:43:42] (03PS1) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [14:43:56] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] network: data: cloud: drop unused CIDR 172.16.132.0/22 [puppet] - 10https://gerrit.wikimedia.org/r/1135744 (owner: 10Arturo Borrero Gonzalez) [14:45:35] (03PS6) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [14:45:36] (03PS2) 10Clément Goubert: alertmanager: Fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/1135747 [14:46:05] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [14:46:08] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/1135747 (owner: 10Clément Goubert) [14:47:01] (03PS1) 10Filippo Giunchedi: logstash: temporarily disable curator forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/1135748 [14:47:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2074.codfw.wmnet with reason: host reimage [14:48:24] (03PS2) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [14:48:46] (03PS7) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [14:49:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P74848 and previous config saved to /var/cache/conftool/dbconfig/20250410-144900-fceratto.json [14:49:14] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [14:51:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2074.codfw.wmnet with reason: host reimage [14:52:18] (03PS8) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [14:53:22] (03CR) 10Scott French: [C:03+1] switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135736 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [14:53:48] (03PS3) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [14:54:15] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [14:54:33] (03PS1) 10FNegri: openstack: delete old py2 script [puppet] - 10https://gerrit.wikimedia.org/r/1135750 [14:55:48] (03PS4) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [14:56:12] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [14:56:13] (03PS2) 10Cwhite: logstash: temporarily disable curator forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/1135748 (https://phabricator.wikimedia.org/T390215) (owner: 10Filippo Giunchedi) [14:56:59] (03PS2) 10FNegri: openstack: delete old py2 script [puppet] - 10https://gerrit.wikimedia.org/r/1135750 [14:56:59] (03PS9) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [14:58:28] (03PS5) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [14:58:53] (03CR) 10Cwhite: [C:03+2] logstash: temporarily disable curator forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/1135748 (https://phabricator.wikimedia.org/T390215) (owner: 10Filippo Giunchedi) [14:58:54] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [15:00:05] brennen and dancy: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1500) [15:00:29] (03CR) 10Dzahn: "no worries at all! What it needed were some special headers, similar to the Bug: line that specifies the list of hosts the experimental ch" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:01:27] (03PS6) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [15:02:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10730199 (10Jhancock.wm) @cmooney this is complete! [15:03:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10730203 (10Jhancock.wm) the NIC card has perished. I am opening an return with Dell. [15:04:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T391056)', diff saved to https://phabricator.wikimedia.org/P74849 and previous config saved to /var/cache/conftool/dbconfig/20250410-150407-fceratto.json [15:04:12] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:04:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:04:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T391056)', diff saved to https://phabricator.wikimedia.org/P74850 and previous config saved to /var/cache/conftool/dbconfig/20250410-150431-fceratto.json [15:06:03] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5252/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [15:06:05] (03CR) 10Ahmon Dancy: [C:03+1] Revert "scap: Use PHP 8.1 when executing maintenance scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [15:06:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T391056)', diff saved to https://phabricator.wikimedia.org/P74851 and previous config saved to /var/cache/conftool/dbconfig/20250410-150658-fceratto.json [15:07:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10730222 (10Clement_Goubert) RIP. Thanks. [15:08:50] (03CR) 10Elukey: [V:03+1] "Hi folks! This is a proposal to kick off a discussion about adding SLOs for services on K8s that are served by the Istio ingress (hence th" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [15:10:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2074.codfw.wmnet with OS bullseye [15:10:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:43] (03PS1) 10Clément Goubert: alertmanager: Add team/project receivers for Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135753 (https://phabricator.wikimedia.org/T385709) [15:11:01] (03PS1) 10Clément Goubert: alertmanager: Add routing for task alerts [puppet] - 10https://gerrit.wikimedia.org/r/1135754 (https://phabricator.wikimedia.org/T385709) [15:12:34] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10730291 (10Quiddity) @LSobanski I thought this bug was just a brief problem on Monday(?), but the missing emails still haven't appeared, if w... [15:14:52] (03PS2) 10Cwhite: logstash: move err field mitigation to effective place [puppet] - 10https://gerrit.wikimedia.org/r/1135740 (https://phabricator.wikimedia.org/T390215) [15:15:01] (03PS1) 10Jforrester: WikiLambdaApiBase: Add logging for every remaining dieWith?(Z)Error [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135755 [15:15:06] (03CR) 10Clément Goubert: [C:03+2] mediawiki: remove unsupported LimitNOFILESoft directive [puppet] - 10https://gerrit.wikimedia.org/r/1135743 (https://phabricator.wikimedia.org/T389422) (owner: 10Hashar) [15:16:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10730305 (10Jhancock.wm) SR208354425. np! [15:17:19] (03CR) 10Cwhite: [C:03+2] logstash: move err field mitigation to effective place [puppet] - 10https://gerrit.wikimedia.org/r/1135740 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:20:12] (03CR) 10Tiziano Fogli: "I took a look around and saw that, in other cases, we append the filter to the original metric name:" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [15:22:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P74852 and previous config saved to /var/cache/conftool/dbconfig/20250410-152206-fceratto.json [15:23:10] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2075 to cirrussearch2075 [15:23:32] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:24:15] (03CR) 10Hnowlan: [C:03+1] alertmanager: Add team/project receivers for Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135753 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [15:26:06] (03CR) 10Effie Mouzeli: [C:03+2] switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135736 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [15:29:55] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mwdebug2002.codfw.wmnet with OS bullseye [15:35:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P74854 and previous config saved to /var/cache/conftool/dbconfig/20250410-153713-fceratto.json [15:38:49] (03CR) 10Andrew Bogott: [C:03+1] "looks great, thank you for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [15:40:43] FIRING: [3x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:41:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2075 to cirrussearch2075 - bking@cumin2002" [15:42:01] (03PS1) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [15:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:42] (03CR) 10Andrew Bogott: [C:03+1] openstack: delete old py2 script [puppet] - 10https://gerrit.wikimedia.org/r/1135750 (owner: 10FNegri) [15:43:14] (03CR) 10FNegri: [C:03+2] openstack: delete old py2 script [puppet] - 10https://gerrit.wikimedia.org/r/1135750 (owner: 10FNegri) [15:43:16] (03CR) 10FNegri: [C:03+2] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [15:44:34] (03CR) 10CI reject: [V:04-1] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [15:45:07] (03PS1) 10Clément Goubert: mw:periodic_jobs: Absent updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135759 (https://phabricator.wikimedia.org/T388539) [15:45:09] (03PS1) 10Clément Goubert: mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) [15:45:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10730449 (10RobH) [15:45:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10730453 (10RobH) >>! In T391540#10728509, @Marostegui wrote: > I can be the point of contact for this task, with the exception of aqs1022 and restbase1045 Do... [15:46:42] (03PS2) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [15:48:26] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug2002.codfw.wmnet with reason: host reimage [15:48:39] (03CR) 10Herron: "Really nice Luca, thanks! I added a few comments about details inline, it looks great overall. Appreciate the cleanup and the new define" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [15:49:15] (03CR) 10CI reject: [V:04-1] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [15:49:41] (03PS3) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [15:50:39] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10730475 (10MatthewVernon) I've made two apus user accounts - gitlab-rw with a 350G quota and gitlab-ro with a 1G quota; the creden... [15:50:43] FIRING: [3x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:52:05] (03CR) 10CI reject: [V:04-1] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [15:52:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug2002.codfw.wmnet with reason: host reimage [15:52:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T391056)', diff saved to https://phabricator.wikimedia.org/P74855 and previous config saved to /var/cache/conftool/dbconfig/20250410-155220-fceratto.json [15:52:23] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:52:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: Maintenance [15:52:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T391056)', diff saved to https://phabricator.wikimedia.org/P74856 and previous config saved to /var/cache/conftool/dbconfig/20250410-155241-fceratto.json [15:54:32] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: Bookworm Re-image [15:55:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T391056)', diff saved to https://phabricator.wikimedia.org/P74857 and previous config saved to /var/cache/conftool/dbconfig/20250410-155528-fceratto.json [15:56:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599 (10RobH) 03NEW [15:57:25] (03PS4) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [15:57:25] (03PS1) 10Andrew Bogott: wmcs-package-build: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135761 [15:57:25] (03PS1) 10Andrew Bogott: tcpircbot: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135762 [15:57:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10730520 (10RobH) p:05Triage→03High a:03Kappakayala @Kappakayala I think you'd be the person to triage this within #service-ops and assign a point of contact for fe... [15:59:25] (03PS1) 10Volans: hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 [15:59:25] (03PS1) 10Volans: hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764 [16:00:04] (03CR) 10CI reject: [V:04-1] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2075 to cirrussearch2075 - bking@cumin2002" [16:01:24] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:25] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2075 [16:01:41] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730550 (10RobH) [16:02:01] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2075 [16:02:02] (03PS5) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [16:02:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2075 to cirrussearch2075 [16:04:14] 10ops-eqiad, 06SRE, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601 (10RobH) 03NEW [16:04:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10730569 (10Clement_Goubert) Tagging @eevans for sessionstore host. For `wikikube-worker` they can change VLAN/IP and move rack. Just tell us when you want to move them so... [16:04:34] (03CR) 10CI reject: [V:04-1] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [16:04:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10730573 (10phaultfinder) [16:04:39] (03CR) 10Scott French: "Thanks, Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [16:04:55] (03CR) 10Scott French: [C:03+2] Revert "scap: Use PHP 8.1 when executing maintenance scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [16:05:59] 10ops-eqiad, 06SRE, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730585 (10RobH) p:05Triage→03High @ArielGlenn, Normally I turf these over to the SRE sub-team manager in charge of the server, but snapshot hosts are a slightly different beast than the rest... [16:06:15] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730588 (10RobH) [16:06:18] (03PS6) 10Andrew Bogott: Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 [16:06:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2075.codfw.wmnet with OS bullseye [16:06:39] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2075 [16:06:48] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:08:51] 10ops-eqiad, 06SRE, 06DC-Ops: relocate sretest1002 out of D6 - https://phabricator.wikimedia.org/T391602 (10RobH) 03NEW [16:09:20] 10ops-eqiad, 06SRE, 06DC-Ops: relocate sretest1002 out of D6 - https://phabricator.wikimedia.org/T391602#10730608 (10RobH) sretest1002 shows active in netbox, but I'm not certain what team is using it right now. https://netbox.wikimedia.org/dcim/devices/2123/interfaces/ Does anyone in #dc-ops know? [16:09:27] 10ops-eqiad, 06SRE, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730610 (10ArielGlenn) >>! In T391601#10730585, @RobH wrote: > @ArielGlenn, > > Normally I turf these over to the SRE sub-team manager in charge of the server, but snapshot hosts are a slightly d... [16:09:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [16:09:31] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730611 (10RobH) [16:09:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:00] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730614 (10RobH) a:05Jclark-ctr→03RobH [16:10:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P74858 and previous config saved to /var/cache/conftool/dbconfig/20250410-161036-fceratto.json [16:13:07] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730622 (10RobH) @Jclark-ctr, I propose the following cadence for this project: * Rob creates the sub-tasks and follows up with the various sub-teams and managers per sub-task * Rob and... [16:13:16] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:06] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:15:31] (03CR) 10Scott French: [C:03+1] alertmanager: Add team/project receivers for Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135753 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:15:47] 10ops-eqiad, 06SRE, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730629 (10RobH) a:05ArielGlenn→03BTullis [16:16:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3321 MB (3% inode=98%): /tmp 3321 MB (3% inode=98%): /var/tmp 3321 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:16:17] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730632 (10RobH) [16:16:29] (03CR) 10Scott French: [C:03+1] alertmanager: Add routing for task alerts [puppet] - 10https://gerrit.wikimedia.org/r/1135754 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:17:19] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730639 (10RobH) [16:17:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:17:33] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:17:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:05] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: relocate (3) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10730640 (10RobH) [16:18:23] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2075 - bking@cumin2002" [16:18:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2075 - bking@cumin2002" [16:18:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:29] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2075.codfw.wmnet 145.0.192.10.in-addr.arpa 5.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:18:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2075.codfw.wmnet 145.0.192.10.in-addr.arpa 5.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:18:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2075 [16:18:34] (03CR) 10Hnowlan: [C:03+1] mw:periodic_jobs: Absent updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135759 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [16:18:42] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2075 [16:18:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2075 [16:18:43] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730642 (10RobH) 05Open→03Declined I've folded this into T391539 which has the other #data-persistence-sre hosts for migration! Thanks for the feedback. [16:19:04] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10730647 (10RobH) [16:19:13] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10730649 (10RobH) [16:19:34] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730650 (10RobH) a:05BTullis→03None [16:19:40] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: relocate snapshot1017 out of eqiad D6 - https://phabricator.wikimedia.org/T391601#10730651 (10RobH) [16:19:59] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Add team/project receivers for Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135753 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:20:07] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Add routing for task alerts [puppet] - 10https://gerrit.wikimedia.org/r/1135754 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:20:14] (03CR) 10Clément Goubert: [C:03+2] mw:periodic_jobs: Absent updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135759 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [16:20:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:24:40] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:25:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P74859 and previous config saved to /var/cache/conftool/dbconfig/20250410-162542-fceratto.json [16:26:14] (03CR) 10Scott French: [C:03+1] mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [16:30:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwdebug2002.codfw.wmnet with OS bullseye [16:30:31] (03PS1) 10Clément Goubert: alertmanager: Fix bad indentation [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) [16:30:36] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:33:35] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:33:57] !log jiji@cumin1002 conftool action : set/pooled=yes; selector: name=mwdebug2002.codfw.wmnet [16:34:14] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2075.codfw.wmnet with reason: host reimage [16:36:34] (03PS1) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135781 (https://phabricator.wikimedia.org/T385782) [16:37:01] (03CR) 10CI reject: [V:04-1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135781 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:37:05] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10730747 (10Dzahn) I deleted all frozen messages older than 14 days.. which was 1284 messages. And ran an exim command to make it try to re-d... [16:37:39] (03CR) 10Andrea Denisse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:37:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2075.codfw.wmnet with reason: host reimage [16:38:42] (03PS1) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [16:39:07] (03CR) 10CI reject: [V:04-1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:39:10] (03Abandoned) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135781 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:40:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T391056)', diff saved to https://phabricator.wikimedia.org/P74860 and previous config saved to /var/cache/conftool/dbconfig/20250410-164049-fceratto.json [16:40:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:41:05] (03PS2) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [16:41:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: Maintenance [16:41:37] (03CR) 10CI reject: [V:04-1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:42:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:42:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:43:20] (03PS3) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [16:43:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:44:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74861 and previous config saved to /var/cache/conftool/dbconfig/20250410-164400-fceratto.json [16:44:44] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:45:56] PROBLEM - Hadoop NodeManager on an-worker1197 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:59] (03CR) 10Andrea Denisse: [C:03+1] alertmanager: Fix bad indentation [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:46:25] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Fix bad indentation [puppet] - 10https://gerrit.wikimedia.org/r/1135779 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:46:55] (03PS1) 10Vgutierrez: trafficserver: disable cache write on Cache-Control: private [puppet] - 10https://gerrit.wikimedia.org/r/1135790 [16:47:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74862 and previous config saved to /var/cache/conftool/dbconfig/20250410-164753-fceratto.json [16:47:57] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:48:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10730802 (10RobH) [16:48:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10730820 (10Ladsgroup) >>! In T391540#10730449, @RobH wrote: >>>! In T391540#10728509, @Marostegui wrote: >> I can be the point of contact for this task, with t... [16:49:07] (03CR) 10CI reject: [V:04-1] trafficserver: disable cache write on Cache-Control: private [puppet] - 10https://gerrit.wikimedia.org/r/1135790 (owner: 10Vgutierrez) [16:51:29] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:58:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2075.codfw.wmnet with OS bullseye [17:00:04] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1700) [17:00:20] Nothing for my window this week. [17:03:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P74863 and previous config saved to /var/cache/conftool/dbconfig/20250410-170300-fceratto.json [17:03:55] (03PS1) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [17:08:09] (03PS36) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [17:08:09] (03CR) 10Federico Ceratto: "I updated the code and tested it with dry run, mypy and unit test." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [17:11:24] (03PS3) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) [17:12:56] RECOVERY - Hadoop NodeManager on an-worker1197 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:12:59] (03CR) 10Dreamy Jazz: [C:03+1] InitializeSettings: add wgSecurePollEditOtherWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134660 (https://phabricator.wikimedia.org/T384302) (owner: 10Novem Linguae) [17:13:36] 06SRE, 10Wikimedia-Mailing-lists: mailman/postorius: errors when changing subscription or when trying to unsubscribe - https://phabricator.wikimedia.org/T391260#10730962 (10Dzahn) The timing makes it look like this was also related to T391330 where mailman was restarted. Assuming that fixed this one. [17:14:09] 06SRE, 10Wikimedia-Mailing-lists: mailman/postorius: errors when changing subscription or when trying to unsubscribe - https://phabricator.wikimedia.org/T391260#10730979 (10Dzahn) 05Open→03Resolved a:03Dzahn optimistically calling it resolved [17:18:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P74865 and previous config saved to /var/cache/conftool/dbconfig/20250410-171808-fceratto.json [17:19:19] bd808: OK if I use your window then? [17:19:48] (03CR) 10Dzahn: "after thinking about it some more, let's reuse the migration class we already have before creating another special role. if not we can alw" [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:19:57] (03Abandoned) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:24:05] James_F: please do :) [17:24:11] <3 [17:24:41] James_F: Do you have a mediawiki change to backport? If so, I'd like to deploy it [17:24:44] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2111 to cirrussearch2111 [17:24:58] dancy: I do! Three of them, all non-i18n. [17:25:07] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:25:26] dancy: 1135755 1135725 1135723 [17:25:32] Awesome. All three in one batch? [17:25:43] Yes please. [17:25:53] They'll only affect test2wiki. [17:25:56] (03CR) 10Dzahn: "will do what Ahmon and Brennen say here" [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (https://phabricator.wikimedia.org/T371633) (owner: 10Lucas Werkmeister (WMDE)) [17:25:57] (In practice.) [17:25:59] ok. [17:26:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135755 (owner: 10Jforrester) [17:26:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135725 (https://phabricator.wikimedia.org/T391534) (owner: 10Jforrester) [17:26:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135723 (https://phabricator.wikimedia.org/T391533) (owner: 10Jforrester) [17:27:27] Now we're racing against the CI pressure from Reedy's post-security-release merges. [17:28:10] Actually, looks like they've all now finished. [17:28:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:29:13] (03Merged) 10jenkins-bot: WikiLambdaApiBase: Add logging for every remaining dieWith?(Z)Error [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135755 (owner: 10Jforrester) [17:31:07] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2111 to cirrussearch2111 - bking@cumin2002" [17:31:26] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2111 to cirrussearch2111 - bking@cumin2002" [17:31:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:31:27] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2111 [17:31:47] (03Merged) 10jenkins-bot: Set WikiLambdaClientTargetAPI default value to protocol-relative, so HSTS doesn't sting us [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135725 (https://phabricator.wikimedia.org/T391534) (owner: 10Jforrester) [17:31:49] (03Merged) 10jenkins-bot: WikifunctionsClientUsageUpdateJob: Don't pass a heavy Title in, just the scalars [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135723 (https://phabricator.wikimedia.org/T391533) (owner: 10Jforrester) [17:32:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2111 [17:32:20] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1135755|WikiLambdaApiBase: Add logging for every remaining dieWith?(Z)Error]], [[gerrit:1135725|Set WikiLambdaClientTargetAPI default value to protocol-relative, so HSTS doesn't sting us (T391534)]], [[gerrit:1135723|WikifunctionsClientUsageUpdateJob: Don't pass a heavy Title in, just the scalars (T391533)]] [17:32:25] T391534: VE integration API calls fail as they're insecure on an HSTS site (not preserving protocol?) - https://phabricator.wikimedia.org/T391534 [17:32:25] T391533: EventBus complains that the wikifunctionsUsageUpdate has a non-scalar parameter - https://phabricator.wikimedia.org/T391533 [17:33:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2111 to cirrussearch2111 [17:33:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74866 and previous config saved to /var/cache/conftool/dbconfig/20250410-173315-fceratto.json [17:33:19] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:33:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:33:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T391056)', diff saved to https://phabricator.wikimedia.org/P74867 and previous config saved to /var/cache/conftool/dbconfig/20250410-173339-fceratto.json [17:35:17] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2111.codfw.wmnet on all recursors [17:35:20] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2111.codfw.wmnet on all recursors [17:35:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2111.codfw.wmnet with OS bullseye [17:35:48] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2111 [17:35:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2111 [17:37:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T391056)', diff saved to https://phabricator.wikimedia.org/P74868 and previous config saved to /var/cache/conftool/dbconfig/20250410-173735-fceratto.json [17:37:50] !log dancy@deploy1003 dancy, jforrester: Backport for [[gerrit:1135755|WikiLambdaApiBase: Add logging for every remaining dieWith?(Z)Error]], [[gerrit:1135725|Set WikiLambdaClientTargetAPI default value to protocol-relative, so HSTS doesn't sting us (T391534)]], [[gerrit:1135723|WikifunctionsClientUsageUpdateJob: Don't pass a heavy Title in, just the scalars (T391533)]] synced to the testservers (https://wikitech.wikimedi [17:37:50] a.org/wiki/Mwdebug) [17:37:54] T391534: VE integration API calls fail as they're insecure on an HSTS site (not preserving protocol?) - https://phabricator.wikimedia.org/T391534 [17:37:54] T391533: EventBus complains that the wikifunctionsUsageUpdate has a non-scalar parameter - https://phabricator.wikimedia.org/T391533 [17:38:23] James_F: lemme know when ready to proceed. [17:38:29] Checking, sorry. [17:39:04] dancy: Awesome, please continue. [17:39:08] (Different error this time.) [17:39:10] !log dancy@deploy1003 dancy, jforrester: Continuing with sync [17:39:11] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10731074 (10Andrew) Regarding item 1 and 3, I think I my check for detecting redirect pages was too broad and excluded other useful pages. I'm rebui... [17:45:48] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135755|WikiLambdaApiBase: Add logging for every remaining dieWith?(Z)Error]], [[gerrit:1135725|Set WikiLambdaClientTargetAPI default value to protocol-relative, so HSTS doesn't sting us (T391534)]], [[gerrit:1135723|WikifunctionsClientUsageUpdateJob: Don't pass a heavy Title in, just the scalars (T391533)]] (duration: 13m 28s) [17:45:52] T391534: VE integration API calls fail as they're insecure on an HSTS site (not preserving protocol?) - https://phabricator.wikimedia.org/T391534 [17:45:52] T391533: EventBus complains that the wikifunctionsUsageUpdate has a non-scalar parameter - https://phabricator.wikimedia.org/T391533 [17:45:57] thx James_F [17:46:10] dancy: Thank *you*! [17:46:12] (03PS2) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [17:46:49] (03CR) 10Ahmon Dancy: [C:03+1] "I ran a deployment today and did not see any of the curl errors, so I'm okay with this change." [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (https://phabricator.wikimedia.org/T371633) (owner: 10Lucas Werkmeister (WMDE)) [17:52:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:52:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P74869 and previous config saved to /var/cache/conftool/dbconfig/20250410-175242-fceratto.json [17:53:00] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2111.codfw.wmnet with OS bullseye [17:53:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2111.codfw.wmnet with OS bullseye [17:53:32] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2111 [17:53:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2111 [17:56:46] (03CR) 10Ladsgroup: Catalog ContentTranslation tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [18:00:01] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2111.codfw.wmnet with OS bullseye [18:00:05] brennen and dancy: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1800). [18:00:16] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2111'] [18:03:49] o/ [18:05:01] (03PS3) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [18:05:11] o/ [18:06:19] jouncebot: nowandnext [18:06:19] For the next 1 hour(s) and 53 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T1800) [18:06:19] In 1 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T2000) [18:07:00] !log 1.44.0-wmf.24 train status (T386219): logs quiet, no current blockers, moving to all wikis [18:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:03] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [18:07:34] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135805 (https://phabricator.wikimedia.org/T386219) [18:07:35] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135805 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:07:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P74870 and previous config saved to /var/cache/conftool/dbconfig/20250410-180749-fceratto.json [18:07:55] (03PS4) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [18:08:13] Once you are done with the train, any chance I can deploy a no-op config change? [18:08:28] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135805 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:09:16] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2111'] [18:11:23] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2111'] [18:11:24] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10731162 (10VRiley-WMF) For reference, here are the previous tickets that have been made for this specific unit 188297490 - April 5th 2024 197398410 - September 10th 2024 198075128 - September 23rd 2024 2... [18:11:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cirrussearch2111'] [18:11:59] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2111'] [18:12:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:13:04] (03PS1) 10Jforrester: WikifunctionsClientUsageUpdateJob: Also init targetPageNamespace [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135807 [18:13:13] (03PS1) 10Jforrester: Special pages: Don't list or let execute repo-only ones on client wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135808 (https://phabricator.wikimedia.org/T391594) [18:13:38] Dreamy_Jazz: And after you (or bundled with) I've got a couple of minor fixes. [18:14:29] Fine to bundle if wanted [18:14:36] It's a no-op [18:14:58] Patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1134660 [18:15:00] Ack. But let's not assume the train is fine, lest we tempt fate. [18:15:12] :D [18:15:19] Dreamy_Jazz: Oh, neat, I can sling that out for you when things are ready. [18:15:36] Thanks! [18:20:14] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.24 refs T386219 [18:20:17] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [18:21:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2111'] [18:22:10] Dreamy_Jazz, James_F: give it a couple minutes more for the train to bake in, and all yours [18:22:14] Of course. [18:22:16] Thanks! [18:22:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T391056)', diff saved to https://phabricator.wikimedia.org/P74871 and previous config saved to /var/cache/conftool/dbconfig/20250410-182257-fceratto.json [18:23:00] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:23:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:23:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74872 and previous config saved to /var/cache/conftool/dbconfig/20250410-182319-fceratto.json [18:24:59] (logs look clean.) [18:26:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74873 and previous config saved to /var/cache/conftool/dbconfig/20250410-182652-fceratto.json [18:27:30] Cool, let's do it. [18:28:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135807 (owner: 10Jforrester) [18:28:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135808 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester) [18:28:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134660 (https://phabricator.wikimedia.org/T384302) (owner: 10Novem Linguae) [18:28:52] (03Merged) 10jenkins-bot: InitializeSettings: add wgSecurePollEditOtherWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134660 (https://phabricator.wikimedia.org/T384302) (owner: 10Novem Linguae) [18:29:33] (03Merged) 10jenkins-bot: WikifunctionsClientUsageUpdateJob: Also init targetPageNamespace [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135807 (owner: 10Jforrester) [18:31:02] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2111'] [18:32:10] (03Merged) 10jenkins-bot: Special pages: Don't list or let execute repo-only ones on client wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135808 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester) [18:32:27] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135807|WikifunctionsClientUsageUpdateJob: Also init targetPageNamespace]], [[gerrit:1135808|Special pages: Don't list or let execute repo-only ones on client wikis (T391594)]], [[gerrit:1134660|InitializeSettings: add wgSecurePollEditOtherWikis (T384302)]] [18:32:31] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [18:32:31] T384302: SecurePoll: Restrict creation of foreign and global elections - https://phabricator.wikimedia.org/T384302 [18:32:54] couple of reports of 503s from esams starting ~3 mins ago [18:32:55] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2111'] [18:33:17] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cirrussearch2111'] [18:33:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2111.codfw.wmnet with OS bullseye [18:33:48] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2111 [18:33:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2111 [18:34:10] hmm, seemingly most of those for cp3073 [18:37:17] !log jforrester@deploy1003 novemlinguae, jforrester: Backport for [[gerrit:1135807|WikifunctionsClientUsageUpdateJob: Also init targetPageNamespace]], [[gerrit:1135808|Special pages: Don't list or let execute repo-only ones on client wikis (T391594)]], [[gerrit:1134660|InitializeSettings: add wgSecurePollEditOtherWikis (T384302)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:38:32] !log jforrester@deploy1003 novemlinguae, jforrester: Continuing with sync [18:40:28] (03Abandoned) 10Vgutierrez: trafficserver: disable cache write on Cache-Control: private [puppet] - 10https://gerrit.wikimedia.org/r/1135790 (owner: 10Vgutierrez) [18:42:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P74875 and previous config saved to /var/cache/conftool/dbconfig/20250410-184159-fceratto.json [18:45:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135807|WikifunctionsClientUsageUpdateJob: Also init targetPageNamespace]], [[gerrit:1135808|Special pages: Don't list or let execute repo-only ones on client wikis (T391594)]], [[gerrit:1134660|InitializeSettings: add wgSecurePollEditOtherWikis (T384302)]] (duration: 12m 42s) [18:45:13] Dreamy_Jazz: Deployed. [18:45:14] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [18:45:14] T384302: SecurePoll: Restrict creation of foreign and global elections - https://phabricator.wikimedia.org/T384302 [18:45:20] Thanks! [18:50:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1317:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1317 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:57:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P74877 and previous config saved to /var/cache/conftool/dbconfig/20250410-185706-fceratto.json [18:57:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2111.codfw.wmnet with OS bullseye [18:57:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2111.codfw.wmnet with OS bullseye [18:58:04] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2111 [18:58:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2111 [19:11:19] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1135715 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [19:12:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74878 and previous config saved to /var/cache/conftool/dbconfig/20250410-191214-fceratto.json [19:12:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:12:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2192.codfw.wmnet with reason: Maintenance [19:12:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74879 and previous config saved to /var/cache/conftool/dbconfig/20250410-191226-fceratto.json [19:13:11] !log removing 1 file for legal compliance [19:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74880 and previous config saved to /var/cache/conftool/dbconfig/20250410-191459-fceratto.json [19:16:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3436 MB (3% inode=98%): /tmp 3436 MB (3% inode=98%): /var/tmp 3436 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [19:20:33] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2111.codfw.wmnet with reason: host reimage [19:22:47] !log removing 2 files for legal compliance [19:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2111.codfw.wmnet with reason: host reimage [19:30:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P74881 and previous config saved to /var/cache/conftool/dbconfig/20250410-193007-fceratto.json [19:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10731347 (10phaultfinder) [19:36:43] (03CR) 10Dzahn: [C:03+2] Revert "logspam: Consolidate CurlFactory cURL errors" [puppet] - 10https://gerrit.wikimedia.org/r/1135690 (https://phabricator.wikimedia.org/T371633) (owner: 10Lucas Werkmeister (WMDE)) [19:43:53] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:58] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:44:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2111.codfw.wmnet with OS bullseye [19:45:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P74882 and previous config saved to /var/cache/conftool/dbconfig/20250410-194514-fceratto.json [19:45:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1317:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1317 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:45:53] (03CR) 10Dwisehaupt: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:53:53] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:56:27] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74883 and previous config saved to /var/cache/conftool/dbconfig/20250410-200022-fceratto.json [20:00:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:00:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2201.codfw.wmnet with reason: Maintenance [20:01:27] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2211.codfw.wmnet with reason: Maintenance [20:02:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74884 and previous config saved to /var/cache/conftool/dbconfig/20250410-200233-fceratto.json [20:02:51] (03PS1) 10Andrew Bogott: eqiad1 openstack: enforce_policy_scope=True [puppet] - 10https://gerrit.wikimedia.org/r/1135819 (https://phabricator.wikimedia.org/T330759) [20:02:52] (03PS1) 10Andrew Bogott: codfw1dev: enforce_new_policy_defaults=true [puppet] - 10https://gerrit.wikimedia.org/r/1135820 (https://phabricator.wikimedia.org/T330759) [20:03:33] (03PS2) 10Andrew Bogott: eqiad1 openstack: enforce_policy_scope=True [puppet] - 10https://gerrit.wikimedia.org/r/1135819 (https://phabricator.wikimedia.org/T330759) [20:03:33] (03PS2) 10Andrew Bogott: codfw1dev: enforce_new_policy_defaults=true [puppet] - 10https://gerrit.wikimedia.org/r/1135820 (https://phabricator.wikimedia.org/T330759) [20:06:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74885 and previous config saved to /var/cache/conftool/dbconfig/20250410-200625-fceratto.json [20:06:27] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:29] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:07:34] !log cdobbins@dns1004 START - running authdns-update [20:09:56] !log cdobbins@dns1004 END - running authdns-update [20:12:15] (03CR) 10CDobbins: [C:03+2] geo-maps: add mapping for Peru [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [20:12:35] (03PS3) 10Andrew Bogott: eqiad1 openstack: enforce_policy_scope=True [puppet] - 10https://gerrit.wikimedia.org/r/1135819 (https://phabricator.wikimedia.org/T330759) [20:12:35] (03PS3) 10Andrew Bogott: codfw1dev: enforce_new_policy_defaults=true [puppet] - 10https://gerrit.wikimedia.org/r/1135820 (https://phabricator.wikimedia.org/T330759) [20:12:36] (03PS1) 10Andrew Bogott: Update policy tests to work with the new network stack [puppet] - 10https://gerrit.wikimedia.org/r/1135824 [20:13:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2091 to cirrussearch2091 [20:13:56] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:14:27] !log cdobbins@dns1004 START - running authdns-update [20:17:17] !log cdobbins@dns1004 END - running authdns-update [20:17:17] (03PS2) 10Andrew Bogott: Update policy tests to work with the new network stack [puppet] - 10https://gerrit.wikimedia.org/r/1135824 [20:17:18] (03PS4) 10Andrew Bogott: eqiad1 openstack: enforce_policy_scope=True [puppet] - 10https://gerrit.wikimedia.org/r/1135819 (https://phabricator.wikimedia.org/T330759) [20:17:18] (03PS4) 10Andrew Bogott: codfw1dev: enforce_new_policy_defaults=true [puppet] - 10https://gerrit.wikimedia.org/r/1135820 (https://phabricator.wikimedia.org/T330759) [20:17:53] (03CR) 10Andrew Bogott: [C:03+2] Update policy tests to work with the new network stack [puppet] - 10https://gerrit.wikimedia.org/r/1135824 (owner: 10Andrew Bogott) [20:17:57] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2091 to cirrussearch2091 - bking@cumin2002" [20:18:53] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:21:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P74886 and previous config saved to /var/cache/conftool/dbconfig/20250410-202132-fceratto.json [20:22:51] (03CR) 10Andrew Bogott: [C:03+2] eqiad1 openstack: enforce_policy_scope=True [puppet] - 10https://gerrit.wikimedia.org/r/1135819 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:22:53] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: enforce_new_policy_defaults=true [puppet] - 10https://gerrit.wikimedia.org/r/1135820 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:24:01] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2091 to cirrussearch2091 - bking@cumin2002" [20:24:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:24:02] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2091 [20:24:32] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2091 [20:25:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2091 to cirrussearch2091 [20:26:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [20:26:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [20:27:06] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [20:28:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [20:30:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [20:30:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [20:34:27] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [20:34:36] !log bking@cumin1002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [20:34:42] !log bking@cumin1002 START - Cookbook sre.dns.netbox [20:36:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P74887 and previous config saved to /var/cache/conftool/dbconfig/20250410-203640-fceratto.json [20:40:05] !log bking@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2091 - bking@cumin1002" [20:40:11] !log bking@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2091 - bking@cumin1002" [20:40:11] !log bking@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:11] !log bking@cumin1002 START - Cookbook sre.dns.wipe-cache cirrussearch2091.codfw.wmnet 99.0.192.10.in-addr.arpa 9.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:40:15] !log bking@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2091.codfw.wmnet 99.0.192.10.in-addr.arpa 9.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:40:15] !log bking@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2091 [20:41:42] !log bking@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2091 [20:41:42] !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [20:48:53] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:51:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74888 and previous config saved to /var/cache/conftool/dbconfig/20250410-205148-fceratto.json [20:51:52] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:52:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2223.codfw.wmnet with reason: Maintenance [20:52:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T391056)', diff saved to https://phabricator.wikimedia.org/P74889 and previous config saved to /var/cache/conftool/dbconfig/20250410-205211-fceratto.json [20:55:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T391056)', diff saved to https://phabricator.wikimedia.org/P74890 and previous config saved to /var/cache/conftool/dbconfig/20250410-205606-fceratto.json [20:56:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.992 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T2100) [21:11:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P74891 and previous config saved to /var/cache/conftool/dbconfig/20250410-211114-fceratto.json [21:16:38] !log bking@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye [21:22:26] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: don't use http for dhcp for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) [21:22:54] (03CR) 10Ryan Kemper: "We're not positive we want this change, but just getting the patch up for now just in case" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [21:26:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P74892 and previous config saved to /var/cache/conftool/dbconfig/20250410-212621-fceratto.json [21:28:13] (03PS1) 10Fabfur: haproxy: start staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 [21:29:08] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: don't use http for dhcp for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [21:29:48] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 10Discovery-Search (2025.03.22 - 2025.04.11): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639 (10bking) 03NEW [21:30:02] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:35:11] (03PS1) 10Ryan Kemper: cirrus: enable opensearch roles in row D [puppet] - 10https://gerrit.wikimedia.org/r/1135828 (https://phabricator.wikimedia.org/T388610) [21:35:40] (03PS2) 10Ryan Kemper: cirrus: enable opensearch roles in row D [puppet] - 10https://gerrit.wikimedia.org/r/1135828 (https://phabricator.wikimedia.org/T388610) [21:36:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3585 MB (3% inode=98%): /tmp 3585 MB (3% inode=98%): /var/tmp 3585 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:36:24] (03CR) 10Bking: [C:03+2] cirrus: enable opensearch roles in row D [puppet] - 10https://gerrit.wikimedia.org/r/1135828 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:41:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T391056)', diff saved to https://phabricator.wikimedia.org/P74893 and previous config saved to /var/cache/conftool/dbconfig/20250410-214128-fceratto.json [21:41:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:41:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2228.codfw.wmnet with reason: Maintenance [21:41:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:42:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T391056)', diff saved to https://phabricator.wikimedia.org/P74894 and previous config saved to /var/cache/conftool/dbconfig/20250410-214205-fceratto.json [21:45:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T391056)', diff saved to https://phabricator.wikimedia.org/P74896 and previous config saved to /var/cache/conftool/dbconfig/20250410-214533-fceratto.json [21:50:41] FIRING: [3x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:55:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:57:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:57:29] (03PS4) 10Ryan Kemper: cirrussearch: update conftool data with new hostnames (row A) [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:58:23] (03CR) 10Bking: [C:03+2] cirrussearch: update conftool data with new hostnames (row A) [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [22:00:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P74897 and previous config saved to /var/cache/conftool/dbconfig/20250410-220040-fceratto.json [22:02:44] (03PS1) 10Ryan Kemper: cirrus: fix some incorrect servers [puppet] - 10https://gerrit.wikimedia.org/r/1135834 (https://phabricator.wikimedia.org/T388610) [22:05:35] (03PS2) 10Ryan Kemper: cirrus: fix some incorrect servers [puppet] - 10https://gerrit.wikimedia.org/r/1135834 (https://phabricator.wikimedia.org/T388610) [22:11:00] (03CR) 10Bking: [C:03+2] cirrus: fix some incorrect servers [puppet] - 10https://gerrit.wikimedia.org/r/1135834 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [22:15:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:15:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P74898 and previous config saved to /var/cache/conftool/dbconfig/20250410-221548-fceratto.json [22:17:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [22:19:39] (03PS1) 10Novem Linguae: CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) [22:20:08] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2055.codfw.wmnet|cirrussearch2056.codfw.wmnet|cirrussearch2062.codfw.wmnet|cirrussearch2068.codfw.wmnet|cirrussearch2069.codfw.wmnet|cirrussearch2074.codfw.wmnet|cirrussearch2075.codfw.wmnet|cirrussearch2087.codfw.wmnet|cirrussearch2088.codfw.wmnet|cirrussearch2089.codfw.wmnet|cirrussearch2090.codfw.wmnet|cirrussearch2091.codf [22:20:08] w.wmnet|cirrussearch2111.codfw.wmnet [22:20:41] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:23:53] FIRING: [5x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:11] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:28:53] FIRING: [5x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:30:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T391056)', diff saved to https://phabricator.wikimedia.org/P74899 and previous config saved to /var/cache/conftool/dbconfig/20250410-223055-fceratto.json [22:30:59] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:33:37] PROBLEM - Hadoop NodeManager on an-worker1196 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:34:28] (03CR) 10RLazarus: [C:03+1] tcpircbot: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135762 (owner: 10Andrew Bogott) [22:37:11] RESOLVED: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:43:53] FIRING: [3x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2072:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:37] RECOVERY - Hadoop NodeManager on an-worker1196 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:29:34] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644 (10phaultfinder) 03NEW [23:34:28] (03CR) 10Dzahn: [C:03+2] phabricator: apply phabricator::migration role on host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:36:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3394 MB (3% inode=98%): /tmp 3394 MB (3% inode=98%): /var/tmp 3394 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [23:40:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135837 [23:40:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135837 (owner: 10TrainBranchBot) [23:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:45:02] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [23:46:30] (03PS1) 10Dzahn: phabricator: add puppet7 enforcing to migration role [puppet] - 10https://gerrit.wikimedia.org/r/1135838 [23:49:56] (03CR) 10Dzahn: [C:03+2] phabricator: add puppet7 enforcing to migration role [puppet] - 10https://gerrit.wikimedia.org/r/1135838 (owner: 10Dzahn) [23:52:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135837 (owner: 10TrainBranchBot)