[00:11:17] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 659.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:03] (03CR) 10Andrew Bogott: "Great!" [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [00:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:03] (03CR) 10Andrew Bogott: [C:03+2] validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121790 [00:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121790 (owner: 10TrainBranchBot) [00:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:48:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121790 (owner: 10TrainBranchBot) [01:08:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121791 [01:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121791 (owner: 10TrainBranchBot) [01:29:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121791 (owner: 10TrainBranchBot) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/00973272266ba27e08fe3256829775817a567c2c8295af953c5a57e917d32fe8/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:17] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 18.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:43] PROBLEM - snapshot of s8 in codfw on backupmon1001 is CRITICAL: Last snapshot for s8 at codfw (db2198) taken on 2025-02-24 01:44:27 is 1164 GiB, but the previous one was 1506 GiB, a change of -22.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:01] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10574036 (10Shizhao) [04:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:01:45] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 6.709 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:06:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:17:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29322 bytes in 9.834 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:20:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:52:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29364 bytes in 8.942 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:55:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:20:11] RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2198) taken on 2025-02-24 05:20:19 (832 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:56:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29359 bytes in 9.461 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:59:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:01:45] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29360 bytes in 7.879 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:04:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:16:36] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10574128 (10fgiunchedi) Yes exactly, whenever we ship apache httpd we should be shipping apache_exporter too. [07:24:41] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29358 bytes in 2.997 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:27:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:38:44] (03PS1) 10Aklapper: phabricator weekly changes email: Catch more color/icon tag issues [puppet] - 10https://gerrit.wikimedia.org/r/1122067 [07:40:41] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28953 bytes in 3.886 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:43:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:50:41] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28953 bytes in 2.553 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:54:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T0800) [08:00:05] LD: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:12:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29222 bytes in 9.435 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:15:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:19:25] * urbanecm cannot see LD [08:20:21] * urbanecm decides to use window [08:20:30] (03CR) 10Urbanecm: [C:03+2] revalidateLinkRecommendations: Initialize $allowedChecksums [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121600 (https://phabricator.wikimedia.org/T387001) (owner: 10Urbanecm) [08:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:16] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4963/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [08:29:53] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Initialize $allowedChecksums [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121600 (https://phabricator.wikimedia.org/T387001) (owner: 10Urbanecm) [08:30:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1121600|revalidateLinkRecommendations: Initialize $allowedChecksums (T387001)]] [08:30:48] T387001: Error: Typed property GrowthExperiments\Maintenance\RevalidateLinkRecommendations::$allowedChecksums must not be accessed before initialization - https://phabricator.wikimedia.org/T387001 [08:31:43] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29358 bytes in 5.803 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:35:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:44:20] (03PS1) 10Brouberol: define main airflow public/discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1122074 (https://phabricator.wikimedia.org/T386282) [08:47:07] (03PS1) 10Brouberol: airflow: define caching an ATS redirection rule for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) [08:47:09] (03PS1) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [08:49:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29306 bytes in 8.878 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:52:46] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121600|revalidateLinkRecommendations: Initialize $allowedChecksums (T387001)]] (duration: 22m 02s) [08:52:47] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:52:49] (03PS1) 10Brouberol: airflow: define the airflow-main namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122077 (https://phabricator.wikimedia.org/T386282) [08:52:50] T387001: Error: Typed property GrowthExperiments\Maintenance\RevalidateLinkRecommendations::$allowedChecksums must not be accessed before initialization - https://phabricator.wikimedia.org/T387001 [08:52:50] (03PS1) 10Brouberol: airflow-main: add ns to the list of tenant operator namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122078 (https://phabricator.wikimedia.org/T386282) [08:52:52] (03PS1) 10Brouberol: airflow-main: add values and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122079 (https://phabricator.wikimedia.org/T386282) [08:55:55] 06SRE: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112 (10fgiunchedi) 03NEW [08:55:57] (03PS1) 10Vgutierrez: varnish: tests, do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1122080 [08:57:14] * urbanecm deploying a security patch [08:57:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:58:52] (03PS1) 10Filippo Giunchedi: pontoon: new script to wait for puppet to converge [puppet] - 10https://gerrit.wikimedia.org/r/1122081 [09:02:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:03:05] (03PS1) 10Filippo Giunchedi: pontoon: clarify instructions post-enroll [puppet] - 10https://gerrit.wikimedia.org/r/1122082 [09:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:03:39] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28942 bytes in 0.518 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:06:59] (03PS2) 10Vgutierrez: varnish: tests, do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1122080 [09:07:02] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: new script to wait for puppet to converge [puppet] - 10https://gerrit.wikimedia.org/r/1122081 (owner: 10Filippo Giunchedi) [09:07:10] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: clarify instructions post-enroll [puppet] - 10https://gerrit.wikimedia.org/r/1122082 (owner: 10Filippo Giunchedi) [09:08:58] !log Deployed security patch for T386963 [09:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:58] (03PS1) 10Filippo Giunchedi: TMP cert auth not required when naggen talks to pdb [puppet] - 10https://gerrit.wikimedia.org/r/1122083 [09:11:58] (03PS1) 10Filippo Giunchedi: pontoonctl: add fqdn output for list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122084 [09:11:58] (03PS1) 10Filippo Giunchedi: pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 [09:11:58] (03PS1) 10Filippo Giunchedi: pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 [09:12:19] (03CR) 10Brouberol: [C:03+2] airflow-analytics: enable kerberos to allow airflow-research to hit the API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121568 (https://phabricator.wikimedia.org/T386933) (owner: 10Brouberol) [09:12:22] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2199.codfw.wmnet with reason: Upgrade and rebuild tables [09:12:41] (03CR) 10CI reject: [V:04-1] pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 (owner: 10Filippo Giunchedi) [09:12:51] (03CR) 10CI reject: [V:04-1] pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 (owner: 10Filippo Giunchedi) [09:13:05] (03CR) 10Stevemunene: [C:03+1] define main airflow public/discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1122074 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:15:09] (03CR) 10Filippo Giunchedi: [C:03+2] pontoonctl: add fqdn output for list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122084 (owner: 10Filippo Giunchedi) [09:15:17] (03PS2) 10Filippo Giunchedi: pontoonctl: add fqdn output for list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122084 [09:16:20] (03CR) 10Filippo Giunchedi: [C:03+2] pontoonctl: add fqdn output for list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122084 (owner: 10Filippo Giunchedi) [09:18:44] (03CR) 10Stevemunene: airflow: define caching an ATS redirection rule for airflow.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:20:12] (03CR) 10Stevemunene: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:20:27] (03CR) 10Stevemunene: [C:03+1] airflow: define the airflow-main namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122077 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:20:45] (03CR) 10Stevemunene: [C:03+1] airflow-main: add ns to the list of tenant operator namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122078 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:20:53] (03CR) 10Ayounsi: [C:03+1] Remove obsolete ospf interface from cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1121484 (https://phabricator.wikimedia.org/T386766) (owner: 10Papaul) [09:24:17] !log cloudsw2-d5-eqiad> restart analytics-agent gracefully - T387018 [09:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:20] T387018: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018 [09:26:01] (03Abandoned) 10Filippo Giunchedi: TMP cert auth not required when naggen talks to pdb [puppet] - 10https://gerrit.wikimedia.org/r/1122083 (owner: 10Filippo Giunchedi) [09:27:59] (03PS1) 10Filippo Giunchedi: rake: default to python3 [puppet] - 10https://gerrit.wikimedia.org/r/1122090 [09:28:34] (03CR) 10Brouberol: [C:03+2] airflow: define the airflow-main namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122077 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:28:37] (03CR) 10Brouberol: [C:03+2] airflow-main: add ns to the list of tenant operator namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122078 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:30:25] (03CR) 10CI reject: [V:04-1] rake: default to python3 [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [09:32:09] (03CR) 10Brouberol: airflow: define caching an ATS redirection rule for airflow.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:32:13] !log Start GrowthExperiments:revalidateLinkRecommendations.php for frwiki, eswiki, ptwiki and idwiki (T385780) [09:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:17] T385780: Retrain Add Link models before deploying Surfacing Structured Tasks - https://phabricator.wikimedia.org/T385780 [09:32:25] (03CR) 10Stevemunene: [C:03+1] airflow-main: add values and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122079 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:32:25] (03PS2) 10Brouberol: airflow: define caching an ATS redirection rule for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) [09:32:26] (03PS2) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [09:32:46] (03Merged) 10jenkins-bot: airflow: define the airflow-main namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122077 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:32:53] (03Merged) 10jenkins-bot: airflow-main: add ns to the list of tenant operator namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122078 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:33:04] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10574276 (10ayounsi) The switch is running a too old junos version for `analytics-agent`. I tried `cloudsw2-d5-eqiad> restart SDN-Telemetry gracefully` instead, but th... [09:33:19] (03CR) 10Stevemunene: [C:03+1] airflow: define caching an ATS redirection rule for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:33:53] (03PS3) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [09:34:28] (03CR) 10Filippo Giunchedi: "This was only a test to see what failed, for the record:" [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [09:35:22] (03CR) 10Filippo Giunchedi: "Riccardo I'm interested in your thoughts in general on this (100% not urgent) though it might be time we assume python3 vs python2" [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [09:36:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:37:43] (03CR) 10Brouberol: [C:03+2] define main airflow public/discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1122074 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:37:52] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2183.codfw.wmnet with reason: Upgrade and rebuild tables [09:37:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:38:01] !log brouberol@dns1004 START - running authdns-update [09:38:16] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2184.codfw.wmnet with reason: Upgrade and rebuild tables [09:39:19] (03PS2) 10Filippo Giunchedi: pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 [09:39:20] (03PS2) 10Filippo Giunchedi: pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 [09:39:59] !log brouberol@dns1004 END - running authdns-update [09:41:02] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 (owner: 10Filippo Giunchedi) [09:41:06] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 (owner: 10Filippo Giunchedi) [09:41:53] (03PS3) 10Filippo Giunchedi: pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 [09:41:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 (owner: 10Gmodena) [09:41:59] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/1122085 (owner: 10Filippo Giunchedi) [09:42:16] (03PS3) 10Filippo Giunchedi: pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 [09:42:23] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: add rolegroups functionality [puppet] - 10https://gerrit.wikimedia.org/r/1122086 (owner: 10Filippo Giunchedi) [09:44:33] (03CR) 10Brouberol: [C:03+2] airflow-main: add values and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122079 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:48:22] (03PS4) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [09:48:22] (03PS1) 10Brouberol: airflow-main: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1122091 (https://phabricator.wikimedia.org/T386282) [09:50:21] (03PS2) 10Elukey: services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) [09:50:49] (03CR) 10Elukey: "I realized that a resource quota change is also required to spin up this capacity, added to the change!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [09:51:09] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10574369 (10cmooney) >>! In T387018#10574276, @ayounsi wrote: > The switch is running a too old junos version for `analytics-agent`. I tried `cloudsw2-d5-eqiad> restar... [09:52:18] (03CR) 10Stevemunene: [C:03+1] airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:52:41] (03CR) 10Stevemunene: [C:03+1] airflow-main: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1122091 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:52:55] (03CR) 10Brouberol: [C:03+2] airflow-main: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1122091 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:53:19] (03PS2) 10Brouberol: airflow-main: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1122091 (https://phabricator.wikimedia.org/T386282) [09:53:19] (03PS3) 10Brouberol: airflow: define caching an ATS redirection rule for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) [09:53:19] (03PS5) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [09:55:44] (03CR) 10Brouberol: [C:03+2] airflow-main: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1122091 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:56:53] (03PS3) 10Vgutierrez: varnish: tests, do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1122080 [09:57:10] (03PS5) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) [09:57:12] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [09:57:28] (03PS3) 10Elukey: services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) [09:58:10] (03PS4) 10Elukey: services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) [09:58:40] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:59:04] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10574426 (10ayounsi) Enabling traceoptions shows a `no shared cipher` error on the switch : ` Feb 24 09:33:58 ssl_transport_security.c:948: Handshake failed with fatal... [10:00:06] (03PS4) 10Vgutierrez: varnish: tests, do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1122080 [10:01:02] (03PS6) 10Brouberol: airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) [10:01:02] (03PS4) 10Brouberol: airflow: define caching an ATS redirection rule for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) [10:05:32] (03CR) 10Filippo Giunchedi: [C:03+2] grafana: set timeinterval 30s for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [10:06:36] (03Abandoned) 10Vgutierrez: varnish: tests, do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1122080 (owner: 10Vgutierrez) [10:07:22] (03CR) 10Vgutierrez: [V:03+1 C:03+1] "I ran the tests hacking our varnish/files/tests/Dockerfile, please do not forget it to update it in the future. Text & upload tests are ha" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [10:07:30] (03CR) 10Ayounsi: [C:03+1] Update policy for K8s BGP to allow a wider range of v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1121438 (https://phabricator.wikimedia.org/T375845) (owner: 10Cathal Mooney) [10:07:55] !log set grafana thanos datasource interval to 30s - T371102 [10:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:58] T371102: Include long-retention Prometheus data from Thanos into Grafana queries - https://phabricator.wikimedia.org/T371102 [10:12:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [10:12:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [10:13:49] (03CR) 10Volans: "Yeah it's probably time :) We do have still py2 on very few hosts that are about to go away soon~ish and I think most of the code runs alr" [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [10:15:30] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114 (10Michael) 03NEW [10:15:36] (03CR) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:18:42] (03CR) 10Vgutierrez: [C:03+1] hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:20:55] (03CR) 10JMeybohm: [C:03+2] pki::get_cert: Allow to get the same cert twice [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:20:58] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:23:31] (03PS1) 10ZhaoFJx: cowikimedia: Change the logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) [10:23:35] (03CR) 10Brouberol: [C:03+2] airflow: define OIDC service block for airflow.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1122076 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:23:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [10:24:19] (03CR) 10Fabfur: [C:03+2] hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:24:33] (03PS2) 10Ayounsi: Expose _gql_execute to wmf-netbox [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [10:24:34] (03CR) 10Ayounsi: "Ready-ish. I'll probably need help for the tests." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [10:26:23] (03CR) 10Jgiannelos: [C:03+2] pcs: Increase TTL for cassandra storage in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121619 (owner: 10Jgiannelos) [10:26:52] (03CR) 10Jgiannelos: [C:03+1] trafficserver: use mobileapps directly for hewiki APIs [puppet] - 10https://gerrit.wikimedia.org/r/1117508 (https://phabricator.wikimedia.org/T372746) (owner: 10Hnowlan) [10:27:45] (03Merged) 10jenkins-bot: pcs: Increase TTL for cassandra storage in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121619 (owner: 10Jgiannelos) [10:28:40] (03CR) 10JMeybohm: [C:03+2] Add second pair of kubeconfig files for restricted users [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:31:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (owner: 10Albertoleoncio) [10:34:27] !log hashar@deploy2002 Started deploy [integration/docroot@59d9e3f]: update links to microsites source code - T300171 [10:34:30] T300171: Move micro sites from Ganeti to Kubernetes and from Gerrit to GitLab - https://phabricator.wikimedia.org/T300171 [10:34:35] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10574565 (10Urbanecm_WMF) [10:34:38] !log hashar@deploy2002 Finished deploy [integration/docroot@59d9e3f]: update links to microsites source code - T300171 (duration: 00m 10s) [10:34:48] (03PS1) 10Vgutierrez: hiera: Add scapy to cookbooks dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1122097 (https://phabricator.wikimedia.org/T373020) [10:35:03] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10574574 (10Urbanecm_WMF) [10:35:25] (03PS1) 10Fabfur: benthos: fix previous change (renamed instance) [puppet] - 10https://gerrit.wikimedia.org/r/1122098 (https://phabricator.wikimedia.org/T329332) [10:36:02] (03PS1) 10Federico Ceratto: pool.py: Add basic typing to allow mypy checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) [10:37:56] (03CR) 10Vgutierrez: [C:03+1] benthos: fix previous change (renamed instance) [puppet] - 10https://gerrit.wikimedia.org/r/1122098 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:39:29] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1122097 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [10:40:11] (03CR) 10Fabfur: [C:03+2] benthos: fix previous change (renamed instance) [puppet] - 10https://gerrit.wikimedia.org/r/1122098 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:40:36] (03CR) 10Vgutierrez: [C:03+2] hiera: Add scapy to cookbooks dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1122097 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [10:42:06] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10574725 (10DMburugu) I approve the request. [10:44:45] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [10:44:56] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1005* for test ability to ban opensearch node - brouberol@cumin2002 - T387030 [10:44:57] !log brouberol@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1005* for test ability to ban opensearch node - brouberol@cumin2002 - T387030 [10:45:00] T387030: Recompile/repackage elasticsearch-madvise for Opensearch - https://phabricator.wikimedia.org/T387030 [10:52:42] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [10:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1100) [11:01:26] 06SRE, 06Infrastructure-Foundations: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098#10574793 (10Fabfur) Just for info: this happened today on cp4037 installing Benthos [11:02:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:03:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:10:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:13:22] (03PS6) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [11:15:32] (03CR) 10Federico Ceratto: "Just a basic readability/safety improvement" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [11:28:00] (03CR) 10Filippo Giunchedi: "Thank you for the feedback and good point re: python2 files. I ran a quick audit of .py files without shebang and run 2to3 on them to get " [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [11:32:07] (03CR) 10Ladsgroup: [C:04-1] "Sorry, we can't explain what those reasons are. I hope you understand." [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [11:33:49] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10574847 (10Ladsgroup) >>! In T379942#10557092, @Ladsgroup wrote: > The second batch is around 70% done now. Will probably finish in a week or so. Almost done now. Probably by tomorrow. [11:35:08] (03PS1) 10Fabfur: benthos: removing benthos on cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1122108 (https://phabricator.wikimedia.org/T256098) [11:37:00] (03PS2) 10Fabfur: benthos: removing benthos on cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1122108 (https://phabricator.wikimedia.org/T256098) [11:38:02] (03CR) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [11:38:02] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122108 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [11:45:31] (03PS4) 10Krinkle: tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum) [11:45:34] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum) [11:45:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum) [11:46:36] (03CR) 10Kamila Součková: [C:03+1] services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [11:47:08] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum three overlarge container dbs [11:47:15] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10574878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a7456e01-a522-4f97-b5f1-504f2a01a14e) set by mvernon@cumin... [11:53:52] !log mvernon@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1066.eqiad.wmnet [11:53:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1066.eqiad.wmnet [11:55:57] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10574922 (10MatthewVernon) ms-be1066 alerted again; I vacuumed the 3 4.1G ones, leaving 13 4.0G ones left. As before, they vacuum down... [12:01:14] (03CR) 10Urbanecm: [C:04-1] mediawiki: Migrate one dry-run job to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:12:19] (03CR) 10Hnowlan: "lgtm mostly, one note" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [12:17:31] !log disable BGP to edgeuno in magru - T387006 [12:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:35] T387006: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006 [12:18:58] (03PS11) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) [12:20:54] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 2052085968 and 93 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:21:34] (03CR) 10Jgiannelos: pcs: Expose port for native prometheus metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [12:22:18] (03CR) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:37] (03Abandoned) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:22:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 80224 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:21] (03CR) 10Hnowlan: [C:03+1] pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [12:30:40] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:30:55] (03CR) 10Jgiannelos: [C:03+2] pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [12:32:07] (03Merged) 10jenkins-bot: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [12:33:09] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070484 (owner: 10PipelineBot) [12:33:09] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 (owner: 10PipelineBot) [12:33:09] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117916 (owner: 10PipelineBot) [12:33:10] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075872 (owner: 10PipelineBot) [12:35:35] (03CR) 10Fabfur: [C:03+2] benthos: removing benthos on cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1122108 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [12:43:25] (03PS1) 10Cathal Mooney: Remove SONiC interface naming pattern and add Nokia SRL one [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122128 (https://phabricator.wikimedia.org/T371088) [12:46:19] (03CR) 10JMeybohm: [C:03+2] cert-manager: Allow prometheus to scrape all components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:46:54] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 73236512 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:19] (03PS2) 10Cathal Mooney: Remove SONiC interface naming pattern and add Nokia SRL one [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122128 (https://phabricator.wikimedia.org/T371088) [12:47:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 95584 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:48:23] (03CR) 10Ayounsi: [C:03+1] Remove SONiC interface naming pattern and add Nokia SRL one [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122128 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [12:50:45] (03Merged) 10jenkins-bot: cert-manager: Allow prometheus to scrape all components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:52:05] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002" [12:52:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002" [12:52:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:56] (03CR) 10Cathal Mooney: [C:03+2] Remove SONiC interface naming pattern and add Nokia SRL one [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122128 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [12:55:51] (03Merged) 10jenkins-bot: Remove SONiC interface naming pattern and add Nokia SRL one [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122128 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [12:57:22] FIRING: ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:57] (03PS1) 10Elukey: knative-serving: fix drop capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122129 (https://phabricator.wikimedia.org/T369493) [12:58:55] (03PS2) 10Elukey: knative-serving: fix drop capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122129 (https://phabricator.wikimedia.org/T369493) [12:59:33] RESOLVED: ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:00:35] (03PS1) 10Cathal Mooney: Do not check vlan is valid for irb parent names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122132 (https://phabricator.wikimedia.org/T371088) [13:02:15] (03CR) 10Elukey: [C:03+2] services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [13:04:33] FIRING: [7x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:25] (03CR) 10Ayounsi: [C:03+1] Do not check vlan is valid for irb parent names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122132 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:10:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:12:57] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [13:14:53] (03CR) 10Cathal Mooney: [C:03+2] Do not check vlan is valid for irb parent names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122132 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:15:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:16:12] (03PS1) 10Ayounsi: Netbox: fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 [13:17:36] (03Merged) 10jenkins-bot: Do not check vlan is valid for irb parent names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1122132 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:17:50] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117884 (owner: 10PipelineBot) [13:17:52] (03PS2) 10Ayounsi: Netbox: fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 [13:17:55] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116778 (owner: 10PipelineBot) [13:21:17] (03PS8) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [13:21:23] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [13:22:19] (03CR) 10CI reject: [V:04-1] Netbox: fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi) [13:22:56] (03Restored) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 (owner: 10PipelineBot) [13:23:10] (03PS2) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 [13:23:17] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 (owner: 10PipelineBot) [13:24:23] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 (owner: 10PipelineBot) [13:29:45] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:30:32] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:30:54] (03CR) 10Volans: sre.mysql.pool: sanity check for depool operations (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [13:34:07] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:34:38] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3511 MB (3% inode=98%): /tmp 3511 MB (3% inode=98%): /var/tmp 3511 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:35:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet [13:35:42] (03PS13) 10Tiziano Fogli: cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) [13:35:42] (03CR) 10Tiziano Fogli: "Thank you @dcaro@wikimedia.org, and thank you @aborrero@wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [13:35:49] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:35:54] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet [13:37:18] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:37:23] (03PS1) 10Jgiannelos: pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 [13:37:32] (03CR) 10CI reject: [V:04-1] pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [13:37:35] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:37:47] (03PS2) 10Jgiannelos: pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 [13:37:51] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [13:38:06] (03PS1) 10Ayounsi: Add GraphQL queries to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 [13:38:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:38:18] (03CR) 10Jgiannelos: "Helm failure:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [13:38:47] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [13:40:16] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [13:40:51] (03PS3) 10Jgiannelos: pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 [13:40:55] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [13:40:59] (03CR) 10CI reject: [V:04-1] pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [13:41:14] (03PS4) 10Jgiannelos: pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 [13:41:47] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [13:41:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [13:41:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:41:52] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [13:42:15] (03CR) 10Vgutierrez: [C:03+1] trafficserver: use mobileapps directly for hewiki APIs [puppet] - 10https://gerrit.wikimedia.org/r/1117508 (https://phabricator.wikimedia.org/T372746) (owner: 10Hnowlan) [13:42:23] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [13:43:16] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [13:44:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:47:14] (03PS1) 10Elukey: services: raise memory for Kartotherian pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122140 (https://phabricator.wikimedia.org/T386926) [13:47:18] (03PS3) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 [13:48:01] (03CR) 10Volans: Fix tox bug (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [13:48:43] (03PS2) 10Elukey: services: raise memory for Kartotherian pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122140 (https://phabricator.wikimedia.org/T386926) [13:48:56] (03PS4) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 [13:50:02] (03PS5) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 [13:50:16] (03CR) 10Ayounsi: Fix tox bug (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [13:52:18] !log re-enable BGP to edgeuno in magru - T387006 [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:22] T387006: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006 [13:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:18] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [13:53:26] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10575129 (10ayounsi) > Our datacenter engineering team has concluded the on-site activity, and no problems were found on our side. Could you please confirm... [13:56:12] (03CR) 10CI reject: [V:04-1] Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [13:56:38] (03CR) 10Effie Mouzeli: [C:03+1] aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [13:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1400) [14:00:05] albertoleoncio: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#10575145 (10MatthewVernon) I spoke to @cmooney about this in Atlanta, and I think my understanding is: # This can be done host-by-host # Only c... [14:00:20] Hi! [14:00:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:33] I’m here but would really like for someone else to do the deployment :/ [14:02:08] Lucas_WMDE: Lets see if someone else is online [14:02:22] FIRING: [7x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:31] (03CR) 10Klausman: [C:03+1] knative-serving: fix drop capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122129 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:03:37] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10575157 (10Yiming) [14:04:07] (03CR) 10Elukey: [C:03+2] services: raise memory for Kartotherian pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122140 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:05:01] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#10575159 (10ayounsi) @MatthewVernon that's correct. Thanks ! [14:05:06] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [14:05:36] (03CR) 10Jgiannelos: [C:03+2] pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [14:05:38] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:07:02] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10575160 (10RobH) Ok, had some issues with the firmware version not allowing me to pull up HTTPS but updating it to the latest resolved the issue. Opened 206013283 for cp7015 as the example service tag num... [14:07:27] (03Merged) 10jenkins-bot: pcs: Change port name to max 15 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122137 (owner: 10Jgiannelos) [14:10:03] Lucas_WMDE: Yeah... we only have you here :-) [14:10:08] (03PS1) 10Jgiannelos: pcs: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122142 [14:10:32] (03PS2) 10Jgiannelos: pcs: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122142 [14:11:03] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] pcs: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122142 (owner: 10Jgiannelos) [14:11:31] (03CR) 10Jgiannelos: [C:03+2] pcs: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122142 (owner: 10Jgiannelos) [14:11:59] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Checking my local repo I had added some stuff under device_type in the device query: https://phabricator.wikimedia.org/P73509" [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 (owner: 10Ayounsi) [14:12:05] alright [14:12:09] I should also backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1121354 anyway [14:12:14] so let’s do that I guess [14:12:33] !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:12:40] !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:12:42] (03PS1) 10Lucas Werkmeister (WMDE): Fix bad state transition on unknown snak type [extensions/Wikibase] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122143 (https://phabricator.wikimedia.org/T384625) [14:12:44] (03Merged) 10jenkins-bot: pcs: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122142 (owner: 10Jgiannelos) [14:13:05] albertoleoncio: is there a task for your change? [14:13:39] Nope. Its just a logo update of our affiliate, it didnt seem necessary. [14:13:53] so far I’ve only found T386402 for a related change [14:13:53] T386402: Request to move translatable page: Wiki_Movement_Brazil_User_Group at Meta-Wiki - https://phabricator.wikimedia.org/T386402 [14:13:58] IMHO it would be good to have a task for it [14:14:14] (03Abandoned) 10Fabfur: hiera: add haproxy dummy ring configuration everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1120228 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:14:17] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:14:33] Can I tag this same ticket? [14:14:51] I don’t think so, that task is very specific [14:14:51] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:15:02] I could imagine a general “Work related to Wikimedia Brasil rename” task [14:15:06] with that and a new task as subtasks [14:15:08] (03PS1) 10Cathal Mooney: Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) [14:15:10] Ok. Let me do that really quick. [14:15:17] depends on if there’s going to be more related tasks later [14:15:25] otherwise just a task for the logo would also work imho [14:15:27] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:15:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "start gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122143 (https://phabricator.wikimedia.org/T384625) (owner: 10Lucas Werkmeister (WMDE)) [14:16:23] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:16:28] (03CR) 10CI reject: [V:04-1] Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:17:05] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:17:41] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:17:56] (03PS2) 10Albertoleoncio: brwikimedia: update icon, logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) [14:17:57] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:18:33] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:19:02] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [14:19:24] Lucas_WMDE: Done [14:19:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [14:19:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:29] thanks, looking [14:19:31] (03PS2) 10Cathal Mooney: Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) [14:20:58] (03CR) 10Lucas Werkmeister (WMDE): brwikimedia: update icon, logo and wordmark (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) (owner: 10Albertoleoncio) [14:21:02] (03CR) 10CI reject: [V:04-1] Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:21:24] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:24:27] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1005.eqiad.wmnet with OS bullseye [14:25:08] (03PS3) 10Cathal Mooney: Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) [14:26:09] (03Merged) 10jenkins-bot: Fix bad state transition on unknown snak type [extensions/Wikibase] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122143 (https://phabricator.wikimedia.org/T384625) (owner: 10Lucas Werkmeister (WMDE)) [14:26:15] hm ok [14:26:18] (03CR) 10Brouberol: [C:03+2] relforge: reassign relforge1005 to Opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1121711 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [14:26:19] then let’s deploy that before the config change [14:26:32] (03CR) 10CI reject: [V:04-1] Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:26:45] (03PS3) 10Ssingh: varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) [14:26:48] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1122143|Fix bad state transition on unknown snak type (T384625)]] [14:26:52] T384625: Special:EntityData, dump creation: LogicException: Bad transition: 10 -> 10 - https://phabricator.wikimedia.org/T384625 [14:27:40] I'm running the logo script again. It may take a minute, or two... [14:27:56] ack, thanks [14:27:59] (03CR) 10Ssingh: [C:03+2] varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) (owner: 10Ssingh) [14:29:35] already tested on mwdebug, https://www.wikidata.org/wiki/Special:EntityData/Q4941387.ttl?flavor=dump&revision=2305050347 no longer crashes there (though k8s isn’t quite finished deploying to mwdebug yet) [14:29:47] (03PS3) 10Albertoleoncio: brwikimedia: update icon, logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) [14:30:23] (03PS4) 10Cathal Mooney: Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) [14:30:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1122143|Fix bad state transition on unknown snak type (T384625)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:33] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:30:37] (03CR) 10Albertoleoncio: brwikimedia: update icon, logo and wordmark (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) (owner: 10Albertoleoncio) [14:31:09] done [14:31:34] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:32:54] !log sudo cumin -b11 A:cp-upload 'run-puppet-agent' [14:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:14] thanks! [14:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10575244 (10phaultfinder) [14:36:24] (03CR) 10Ssingh: [C:03+1] Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:06] (03CR) 10Cathal Mooney: [C:03+2] Add IPv6 reverse entries for test vlans in Nokia lab codfw rack A8 [dns] - 10https://gerrit.wikimedia.org/r/1122144 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:37:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122143|Fix bad state transition on unknown snak type (T384625)]] (duration: 10m 31s) [14:37:24] T384625: Special:EntityData, dump creation: LogicException: Bad transition: 10 -> 10 - https://phabricator.wikimedia.org/T384625 [14:37:35] !log cmooney@dns2005 START - running authdns-update [14:37:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) (owner: 10Albertoleoncio) [14:39:05] (03Merged) 10jenkins-bot: brwikimedia: update icon, logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 (https://phabricator.wikimedia.org/T387125) (owner: 10Albertoleoncio) [14:39:13] !log cmooney@dns2005 END - running authdns-update [14:39:21] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1121473|brwikimedia: update icon, logo and wordmark (T387125)]] [14:39:28] T387125: brwikimedia: update icon, logo and wordmark - https://phabricator.wikimedia.org/T387125 [14:41:39] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage [14:42:06] LGTM on k8s-mwdebug =D [14:42:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:42:44] Deployment kartotherian-main in kartotherian at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=kartotherian&var-deployment=kartotherian-main - ... [14:42:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:43:05] nice ^^ [14:43:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, albertoleoncio: Backport for [[gerrit:1121473|brwikimedia: update icon, logo and wordmark (T387125)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:43:40] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, albertoleoncio: Continuing with sync [14:45:22] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10575256 (10Dzahn) For style reasons we should not include profiles directly inside modules though. So I would say that is more like a `profile::prometheus::apache_exporter`... [14:45:29] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage [14:46:14] (03PS1) 10Bking: wdqs-categories: use new split graph hosts (wdqs-main) for categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) [14:46:46] (03CR) 10Ssingh: "Nice and clean!" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [14:47:02] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10575269 (10ssingh) 05Open→03Resolved @Gnoeee: This has been rolled out and should now be live. Please feel free to re-open this task if there are any issues. Thank you! [14:47:32] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:47:38] (03CR) 10Dzahn: [C:03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Sylheti" [dns] - 10https://gerrit.wikimedia.org/r/1119722 (https://phabricator.wikimedia.org/T386441) (owner: 10Gerrit maintenance bot) [14:47:54] (03PS2) 10Gerrit maintenance bot: Add syl to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1119722 (https://phabricator.wikimedia.org/T386441) [14:48:15] (03PS1) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [14:50:21] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:50:38] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121473|brwikimedia: update icon, logo and wordmark (T387125)]] (duration: 11m 17s) [14:50:41] T387125: brwikimedia: update icon, logo and wordmark - https://phabricator.wikimedia.org/T387125 [14:51:05] !log printf 'https://en.wikipedia.org/static/images/%s\n' icons/brwikimedia.svg mobile/copyright/wikimedia-wordmark-br.svg project-logos/brwikimedia{,-{1.5,2}x}.png | mwscript-k8s --comment=T387125 --attach -- purgeList enwiki [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:18] albertoleoncio: should be done [14:51:26] and congratulations on becoming a chapter \o/ [14:51:44] Yep, its on live now =D [14:51:52] and thanks! :-) [14:52:21] hm, but if I force-reload https://br.wikimedia.org/wiki/P%C3%A1gina_principal it still shows me the WMB logo o_O [14:53:01] !log printf 'https://br.wikimedia.org/static/images/%s\n' icons/brwikimedia.svg mobile/copyright/wikimedia-wordmark-br.svg project-logos/brwikimedia{,-{1.5,2}x}.png | mwscript-k8s --comment='T387125 (should not be necessary but let’s try it?)' --attach -- purgeList enwiki [14:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] nope, still the old logo https://br.wikimedia.org/static/images/icons/brwikimedia.svg [14:53:33] Here it shows the new logo... [14:53:50] weird [14:54:03] Maybe a cache problem. I'm on magru cache server, so maybe thats the problem [14:54:09] yeah it’s gotta be some kind of caching [14:54:15] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10575282 (10Ranjithsiji) @ssingh Thank you for doing this. This will be helpfull to schoolwiki. I will check with the server engineer of school wiki to test this. And we will implement this c... [14:54:16] but the purgeList maintenance script is supposed to purge the cache ^^ [14:54:39] Lucas_WMDE: how did you run the maint script? [14:55:03] urbanecm: mwscript-k8s, as logged to SAL [14:55:10] https://phabricator.wikimedia.org/T387125#10575277 [14:55:30] with en.wikipedia.org URLs first, and then br.wikimedia.org when the former didn’t seem to work [14:55:31] (03PS1) 10Elukey: admin_ng: increase cpu request/limits for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122154 (https://phabricator.wikimedia.org/T386926) [14:55:36] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1119722 (https://phabricator.wikimedia.org/T386441) (owner: 10Gerrit maintenance bot) [14:55:46] it should be en.wikipedia.org AFAIK [14:56:20] maybe try the oldschool way (mwscript)? [14:57:09] (03PS1) 10Ssingh: wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 [14:57:38] I guess I can try it [14:57:43] Well, I tried even https://aa.wikipedia.org/static/images/icons/brwikimedia.svg, it shows the new logo here [14:57:50] !log dzahn@dns1004 START - running authdns-update [14:58:01] that one also shows me the old logo [14:58:13] IIUC /static/ is mapped to en.wikipedia.org internally, so the host wouldn’t matter for caching purposes [14:58:13] sounds like a partial purge :/ [14:58:27] (03PS2) 10Ssingh: wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) [14:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:58:55] !log DNS - new Wikimedia project language 'syl' - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Sylheti [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] !log printf 'https://en.wikipedia.org/static/images/%s\n' icons/brwikimedia.svg mobile/copyright/wikimedia-wordmark-br.svg project-logos/brwikimedia{,-{1.5,2}x}.png | mwscript purgeList enwiki # T387125 [14:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:02] T387125: brwikimedia: update icon, logo and wordmark - https://phabricator.wikimedia.org/T387125 [14:59:06] that did the trick [14:59:11] oh that sucks :( [14:59:17] here too [14:59:28] question is, was that a quirk or does k8s have problems with purges [14:59:47] I’ll go make a task [14:59:47] !log dzahn@dns1004 END - running authdns-update [14:59:55] I wouldn’t have expected that to have any effect [15:00:05] urbanecm, Cyndywikime, and sergi0: It is that lovely time of the day again! You are hereby commanded to deploy Community Configuration migration. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1500). [15:00:06] doesn’t the cache live outside k8s anyway? or at least outside mw-on-k8s [15:00:18] I thought purgeList just sends special HTTP commands to the caches [15:00:20] anyway [15:00:24] !log UTC afternoon backport+config window done [15:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:26] urbanecm: your window now :P [15:00:26] it does, but the purge command might be stopped somewhere [15:00:31] yeah, could be [15:00:40] thank you! waiting on my colleagues now :) [15:01:00] am here :) [15:01:22] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Catch more color/icon tag issues [puppet] - 10https://gerrit.wikimedia.org/r/1122067 (owner: 10Aklapper) [15:03:22] filed T387127 [15:03:24] T387127: mwscript-k8s purgeList does not reliably purge cached URLs - https://phabricator.wikimedia.org/T387127 [15:03:26] ty! [15:04:37] (03PS1) 10Sergio Gimeno: LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122156 (https://phabricator.wikimedia.org/T369551) [15:05:39] (something something two hard problems in computer science) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:57] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1005.eqiad.wmnet with OS bullseye [15:07:04] (03PS1) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) [15:07:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:09:19] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [15:09:21] !log brouberol@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [15:09:45] (03PS2) 10Sergio Gimeno: LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122156 (https://phabricator.wikimedia.org/T369551) [15:09:45] (03PS1) 10Sergio Gimeno: Remove GELevelingUpKeepGoingNotificationThresholds usages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122158 (https://phabricator.wikimedia.org/T369551) [15:10:10] (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122156 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:10:19] (03CR) 10Sergio Gimeno: [C:03+2] Remove GELevelingUpKeepGoingNotificationThresholds usages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122158 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:11:12] (03PS1) 10Ottomata: eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122159 (https://phabricator.wikimedia.org/T383814) [15:12:26] urbanecm: please let me know when you are done! I'd like to deploy ^^ [15:13:01] ottomata: i can ping you, but note i have two one-hour windows :( [15:14:53] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [15:16:38] (03CR) 10Scott French: [C:03+1] "Always happy to see more types!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [15:21:01] !log Run `foreachwikiindblist growthexperiments CommunityConfiguration:setVersionData GrowthHomepage 1.0.0` T369551 [15:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:05] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [15:22:01] (03Merged) 10jenkins-bot: LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122156 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:22:05] (03CR) 10CI reject: [V:04-1] Remove GELevelingUpKeepGoingNotificationThresholds usages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122158 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:23:15] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1122156|LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds (T369551)]] [15:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:24:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10575404 (10Jhancock.wm) @elukey I pulled and then reinserted a disk. all yours. [15:25:47] (03CR) 10Hnowlan: [C:03+1] admin_ng: increase cpu request/limits for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122154 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:27:35] !log klausman@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ml-lab1001.eqiad.wmnet [15:29:10] urbanecm: okay! I can wait. was hoping to deploy somethign before my meetings all start in 30 mins, but it is not urgent. It should be fine, but I'd rather wait on this until window is clear since if it is not fine it could affect MW client logging intake. [15:29:47] okay. in this case, let's wait. i expect to need all of the time i reserved in the calendar though [15:32:26] 👍 [15:32:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:33:38] (03PS2) 10Scott French: php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) [15:33:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:33:47] (03PS2) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) [15:34:20] (03CR) 10Subramanya Sastry: "This is now on all wikis, and we can roll out this patch this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [15:34:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:35:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:35:11] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [15:36:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:36:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:37:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:38:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:38:24] (03CR) 10Sergio Gimeno: [C:03+2] "Flaky? Metric timer test" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122158 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:38:32] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1122156|LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds (T369551)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:38:36] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [15:38:55] (03PS3) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) [15:39:39] (03CR) 10Brouberol: [C:03+2] "The backing service is up:" [puppet] - 10https://gerrit.wikimedia.org/r/1122075 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:39:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:40:09] !log sgimeno@deploy2002 sgimeno: Continuing with sync [15:42:28] (03PS1) 10Eevans: aqs: Upgrade to 'dev' target version (4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122164 (https://phabricator.wikimedia.org/T386969) [15:42:59] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122164 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [15:44:09] (03CR) 10Elukey: [C:03+2] admin_ng: increase cpu request/limits for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122154 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:46:01] (03PS1) 10Brouberol: airflow-main: add missing config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122166 (https://phabricator.wikimedia.org/T386282) [15:46:03] (03PS2) 10Eevans: aqs: Upgrade to 'dev' target version (4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122164 (https://phabricator.wikimedia.org/T386969) [15:46:26] (03CR) 10Bking: [C:03+1] airflow-main: add missing config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122166 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:46:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [15:46:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10575544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [15:46:54] (03PS3) 10Scott French: php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) [15:46:57] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122164 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [15:47:39] (03CR) 10Brouberol: [C:03+2] airflow-main: add missing config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122166 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:50:09] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122156|LevelingUp: schema migration for GELevelingUpKeepGoingNotificationThresholds (T369551)]] (duration: 26m 53s) [15:50:12] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [15:50:48] (03Merged) 10jenkins-bot: Remove GELevelingUpKeepGoingNotificationThresholds usages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122158 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [15:53:18] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [15:53:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:53:59] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132 (10RobH) 03NEW [15:54:09] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10575589 (10RobH) [15:54:21] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1122158|Remove GELevelingUpKeepGoingNotificationThresholds usages (T369551)]] [15:54:22] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [15:55:18] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [15:55:37] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [15:55:54] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [15:56:22] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [15:57:17] !log Run `sgimeno@mwmaint2002:~$ foreachwikiindblist growthexperiments CommunityConfiguration:migrateConfig GrowthHomepage 2.0.0` T369551 [15:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [15:57:39] (03CR) 10Effie Mouzeli: [C:03+1] "One nit, I would suggest we bump to 8.1.34-1-s1, so to indicate this is an important change." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:00:04] urbanecm and Kemayo: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Deployment training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1600). [16:00:37] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2088.codfw.wmnet [16:01:24] We're still backporting one change from the CC migration, will take us around 10min :pray [16:01:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10575652 (10elukey) Thanks a lot @Jhancock.wm! This is what I see: ` [Mon Feb 24 15:23:26 2025] sd 0:2:7:0: SCSI device is removed [Mo... [16:06:55] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [16:09:55] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1122158|Remove GELevelingUpKeepGoingNotificationThresholds usages (T369551)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:10:01] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [16:10:04] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [16:10:21] !log sgimeno@deploy2002 sgimeno: Continuing with sync [16:12:23] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [16:12:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [16:12:44] Deployment kartotherian-main in kartotherian at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=kartotherian&var-deployment=kartotherian-main - ... [16:12:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:13:09] (03PS1) 10Elukey: admin_ng: extra bump for Kartotherian's requested CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122169 (https://phabricator.wikimedia.org/T386926) [16:13:43] the alert for kartotherian should be fixed by --^ [16:14:44] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:14:50] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:15:12] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:15:17] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2004.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:16:29] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1122170 [16:17:22] sergi0: please ping me once it's safe to take over [16:17:58] urbanecm: sure! [16:20:32] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122158|Remove GELevelingUpKeepGoingNotificationThresholds usages (T369551)]] (duration: 26m 10s) [16:20:36] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [16:20:41] urbanecm: all yours! [16:20:44] thank you! [16:23:16] (03PS1) 10DLynch: Archives Flow subpages but not archived pages [extensions/Flow] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122178 (https://phabricator.wikimedia.org/T383775) [16:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:23:44] !log Run `sgimeno@mwmaint2002:~$ foreachwikiindblist growthexperiments CommunityConfiguration:migrateConfig GrowthHomepage 2.0.1` T369551 [16:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122178 (https://phabricator.wikimedia.org/T383775) (owner: 10DLynch) [16:25:55] (03CR) 10Papaul: [C:03+2] Remove obsolete ospf interface from cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1121484 (https://phabricator.wikimedia.org/T386766) (owner: 10Papaul) [16:30:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [16:30:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10575775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be20... [16:31:40] (03CR) 10Hnowlan: [C:03+1] admin_ng: extra bump for Kartotherian's requested CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122169 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [16:32:16] (03Merged) 10jenkins-bot: Archives Flow subpages but not archived pages [extensions/Flow] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122178 (https://phabricator.wikimedia.org/T383775) (owner: 10DLynch) [16:32:34] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1122178|Archives Flow subpages but not archived pages (T383775)]] [16:32:38] T383775: Run Flow migration script at *Phase 2a* wikis, second run - https://phabricator.wikimedia.org/T383775 [16:33:14] !log reprepro include pcre2_10.42-1~wmf11+1 into component/pcre2 - T386006 [16:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:17] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [16:37:49] (03CR) 10Elukey: [C:03+2] admin_ng: extra bump for Kartotherian's requested CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122169 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [16:38:09] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1122178|Archives Flow subpages but not archived pages (T383775)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:38:12] T383775: Run Flow migration script at *Phase 2a* wikis, second run - https://phabricator.wikimedia.org/T383775 [16:39:17] !log kemayo@deploy2002 kemayo: Continuing with sync [16:40:18] !log sukhe@dns1004 START - running authdns-update ["Add syl to langlist helper"] [16:40:37] (03PS1) 10Ssingh: P:dns::auth: log commit message to SAL for authdns-update [puppet] - 10https://gerrit.wikimedia.org/r/1122192 [16:41:19] (03PS4) 10Scott French: php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) [16:41:47] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4972/console" [puppet] - 10https://gerrit.wikimedia.org/r/1122192 (owner: 10Ssingh) [16:42:10] !log sukhe@dns1004 END - running authdns-update ["Add syl to langlist helper"] [16:43:25] (03CR) 10Scott French: "Thanks for the discussion, Effie! This also has the benefit of being compliant with the version parsing / comparison rules despite our unf" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:45:03] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:45:52] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122178|Archives Flow subpages but not archived pages (T383775)]] (duration: 13m 18s) [16:45:56] T383775: Run Flow migration script at *Phase 2a* wikis, second run - https://phabricator.wikimedia.org/T383775 [16:47:04] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:47:18] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:51:25] (03PS8) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [16:54:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3565 MB (3% inode=98%): /tmp 3565 MB (3% inode=98%): /var/tmp 3565 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:57:37] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [16:59:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10575872 (10Jhancock.wm) a:03Jhancock.wm [17:00:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1700). [17:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:07:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10575911 (10Jhancock.wm) @MatthewVernon it's hitting the wrong puppet server, but the server has an os installed and is sshable if you wanna see if the drives are behaving now.... [17:11:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10575922 (10MatthewVernon) Much less frequent (and only two devices now), but still there :-/: ` Feb 23 01:34:08 ms-be2075 kernel: [109197.342692] sd 0:0:24:0: Power-on or device... [17:12:44] (03PS1) 10Vgutierrez: add scapy to setup.py dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122194 (https://phabricator.wikimedia.org/T373020) [17:13:20] (03CR) 10Vgutierrez: "CI scapy error should be fixed by I2c748daf5805213f8e94b4eabcb175c1cb3eec8e" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:13:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10575925 (10Jhancock.wm) those should be the boot disks. so we at least eliminated the errors on the others. gonna try a few things and i'll get back to you. [17:14:18] (03CR) 10Volans: [C:03+1] "LGTM, question on the version inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122194 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:16:35] (03CR) 10Volans: add scapy to setup.py dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122194 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:16:39] (03CR) 10Vgutierrez: "hmm so this is the wrong repo... :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122194 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:16:51] (03Abandoned) 10Vgutierrez: add scapy to setup.py dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122194 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:18:44] (03PS9) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [17:19:24] (03CR) 10Vgutierrez: "scratch that... I've added scapy to setup.py in this CR 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:24:36] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142 (10RobH) 03NEW [17:24:56] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10575987 (10RobH) [17:25:31] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:30:39] (03PS10) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [17:32:38] (03CR) 10Ssingh: "Looks good! Mostly nits but one potential blocker for --services." [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:36:42] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:37:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:41:45] 10ops-eqiad, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1019 - https://phabricator.wikimedia.org/T387145 (10RobH) 03NEW [17:41:50] 10ops-eqiad, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1019 - https://phabricator.wikimedia.org/T387145#10576084 (10RobH) [17:51:28] (03PS11) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [17:53:27] (03PS60) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) [17:53:47] 10ops-eqiad, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1019 - https://phabricator.wikimedia.org/T387145#10576115 (10Vgutierrez) per T381118#10436453 it should be lvs1017 or lvs1018, not lvs1019 [17:54:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:54:14] (03CR) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:55:04] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:56:40] (03PS6) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) [17:57:08] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10576124 (10Dzahn) We can create the list, I see no problem with this but since there is no clear and simple definition of "private" i... [17:57:27] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:59:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1800) [18:00:05] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T1800). nyaa~ [18:01:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:02:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:02:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:03:28] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10576157 (10Dzahn) 05Open→03Resolved a:03Dzahn List has been created. ` lists1004:~] $ sudo mailman-wrapper create --owner... [18:07:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:10:09] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10576183 (10Yiming) @Dzahn Thank you very much! :) I can now access list management through Postorius. But I'm not getting relevan... [18:11:44] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10576187 (10Dzahn) @Yiming I might actually be wrong about an email being generated by that command. It's possible I just assumed... [18:14:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:14:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3440 MB (3% inode=98%): /tmp 3440 MB (3% inode=98%): /var/tmp 3440 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [18:17:33] (03PS1) 10Cathal Mooney: Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) [18:18:06] (03CR) 10CI reject: [V:04-1] Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:19:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:19:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:20:32] (03PS2) 10Cathal Mooney: Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) [18:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:27:23] 06SRE, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request a mailing list for Chinese Wikipedia editors with access to edit filter - https://phabricator.wikimedia.org/T387079#10576315 (10Yiming) @Dzahn Ok, I got you. Everything is working fine for me, thanks again! [18:30:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:35:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:38:25] (03PS11) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) [18:39:30] (03CR) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:43:20] (03CR) 10Ssingh: [C:03+1] varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:51:59] (03CR) 10BCornwall: [C:03+2] varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:06:41] (03PS61) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) [19:10:42] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [19:11:38] (03CR) 10BCornwall: [C:03+1] wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:15:00] (03CR) 10Eevans: [C:03+2] aqs: Upgrade to 'dev' target version (4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122164 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [19:15:09] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T387157 (10Dillon) 03NEW [19:19:17] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10576576 (10KSarabia-WMF) [19:20:11] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[2002-2012].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [19:20:20] FIRING: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10576610 (10phaultfinder) [19:31:12] !log reprepro include php8.1_8.1.31-1+wmf11u2 into component/php81 - T386006 [19:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:17] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [19:31:54] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10576629 (10KSarabia-WMF) [19:35:20] RESOLVED: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:06] vrts2002 was maintenance / upgrade. will leave a note to add downtimes to docs. [19:41:57] !log reprepro include php-apcu_5.1.23-1+wmf11u2 into component/php81 - T386006 [19:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:01] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [19:45:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:45:31] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:49:34] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:50:31] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:52:20] (03PS1) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 [19:53:04] (03PS2) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 [19:53:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [19:53:30] (03CR) 10CI reject: [V:04-1] vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [19:55:41] (03PS3) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 [19:56:06] (03CR) 10CI reject: [V:04-1] vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [19:58:02] (03PS1) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122230 [19:58:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [19:59:44] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [20:01:59] (03PS1) 10Clare Ming: Make CTR instrument pass in experiment name [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122232 (https://phabricator.wikimedia.org/T384911) [20:02:22] (03PS12) 10Ssingh: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:02:22] (03Abandoned) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122228 (owner: 10Andrew Bogott) [20:02:24] (03PS2) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122230 [20:02:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122230 (owner: 10Andrew Bogott) [20:02:33] !log xcollazo@deploy2002 Started deploy [analytics/refinery@9975731]: Regular analytics weekly train [analytics/refinery@99757316] [20:05:14] (03PS3) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122230 [20:05:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122230 (owner: 10Andrew Bogott) [20:08:29] !log xcollazo@deploy2002 Finished deploy [analytics/refinery@9975731]: Regular analytics weekly train [analytics/refinery@99757316] (duration: 05m 55s) [20:08:39] !log xcollazo@deploy2002 Started deploy [analytics/refinery@9975731] (thin): Regular analytics weekly train THIN [analytics/refinery@99757316] [20:09:28] !log xcollazo@deploy2002 Finished deploy [analytics/refinery@9975731] (thin): Regular analytics weekly train THIN [analytics/refinery@99757316] (duration: 00m 48s) [20:09:42] !log xcollazo@deploy2002 Started deploy [analytics/refinery@9975731] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99757316] [20:10:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122232 (https://phabricator.wikimedia.org/T384911) (owner: 10Clare Ming) [20:10:19] !log xcollazo@deploy2002 Finished deploy [analytics/refinery@9975731] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99757316] (duration: 00m 36s) [20:10:33] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:11:59] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122232 (https://phabricator.wikimedia.org/T384911) (owner: 10Clare Ming) [20:18:23] (03PS13) 10Ssingh: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:24:57] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:35:06] (03PS14) 10Ssingh: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:41:33] (03PS1) 10Eevans: restbase2024: Upgrade to 'dev' (aka 4.1.8) as canary [puppet] - 10https://gerrit.wikimedia.org/r/1122241 (https://phabricator.wikimedia.org/T386969) [20:41:34] (03PS1) 10Eevans: restbase: upgrade cluster to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122242 (https://phabricator.wikimedia.org/T386969) [20:41:36] (03PS1) 10Eevans: ml-cache: upgrade cluster to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122243 (https://phabricator.wikimedia.org/T386969) [20:41:57] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[2002-2012].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:42:25] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:43:29] (03PS4) 10Andrew Bogott: vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122230 [20:44:58] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts2002.codfw.wmnet with reason: znuny upgrade [20:45:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122241 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [20:52:37] (03CR) 10Andrew Bogott: [C:03+2] vendordata: get host IP from metadata [puppet] - 10https://gerrit.wikimedia.org/r/1122230 (owner: 10Andrew Bogott) [20:54:59] (03CR) 10Ssingh: sre.loadbalancer: Add migrate-service-ipip cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [20:56:26] (03PS1) 10Jdlrobson: Lazy Load Images Part 2 [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) [20:58:49] (03CR) 10CI reject: [V:04-1] Lazy Load Images Part 2 [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) (owner: 10Jdlrobson) [20:58:56] anyone available for a quick review of the gerrit 1120152? ^^ [20:59:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 10.71% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T2100). [21:00:05] Jdlrobson, ZhaoFJx, gmodena, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] o/ [21:00:12] i can deploy [21:00:17] thanks [21:00:18] o/ [21:00:25] Jdlrobson: are you around? [21:00:52] cjming: yeh we need a few more mins though to get the security patch ready [21:00:56] so feel free to jump to next one [21:01:00] cjming I've got a patch tha I can deploy myself when you are done with the others in queue [21:01:13] Jdlrobson: sounds good [21:01:30] gmodena: sounds good [21:02:01] ZhaoFJx: i'll start with yours then [21:02:09] great [21:02:16] (03PS2) 10ZhaoFJx: kywiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121688 (https://phabricator.wikimedia.org/T386617) [21:02:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:02:31] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121688 (https://phabricator.wikimedia.org/T386617) (owner: 10ZhaoFJx) [21:03:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [21:03:17] (03Merged) 10jenkins-bot: kywiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121688 (https://phabricator.wikimedia.org/T386617) (owner: 10ZhaoFJx) [21:03:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:03:33] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121688|kywiki: Add namespace aliases (T386617)]] [21:03:37] T386617: Add namespace aliases for the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T386617 [21:03:38] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10576981 (10EPIC) NDA is signed on my part since some time ago, and SSH key should be updated as well. Just waiting for the rest. [21:06:31] ZhaoFJx: 1st patch on test servers if you'd like to verify - lmk if/when to sync [21:06:38] checking [21:07:00] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10576988 (10Dzahn) @KFrancis Did you receive the real name meanwhile? [21:07:16] !log cjming@deploy2002 cjming, zhaofjx: Backport for [[gerrit:1121688|kywiki: Add namespace aliases (T386617)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:59] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [21:08:02] cjming And all good [21:08:03] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [21:08:06] !log cjming@deploy2002 cjming, zhaofjx: Continuing with sync [21:08:31] (03PS2) 10ZhaoFJx: zhwiki: Change abusefilter-editor group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121687 (https://phabricator.wikimedia.org/T386879) [21:10:38] (03PS2) 10Jdlrobson: Lazy Load Images Part 2 [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) [21:12:34] cjming: we should be ready now when you are done with ZhaoFJx 's change. [21:13:13] well if u have extra time please give a quick look at 1120152 (prolly in merge conflict with 1121687 now) [21:13:30] Jdlrobson: sounds good - i'll do yours next then [21:14:01] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts1003.eqiad.wmnet with reason: znuny upgrade [21:14:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:14:39] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121688|kywiki: Add namespace aliases (T386617)]] (duration: 11m 06s) [21:14:43] T386617: Add namespace aliases for the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T386617 [21:16:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121687 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [21:17:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:17:24] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1122245 says it's a part 2 - is this the one you want deployed now? [21:17:31] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:17:38] cjming: yes please [21:17:41] we did part 1 last week :) [21:17:41] (03Merged) 10jenkins-bot: zhwiki: Change abusefilter-editor group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121687 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [21:17:59] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121687|zhwiki: Change abusefilter-editor group name (T386879)]] [21:18:03] T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879 [21:18:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:19:37] ok - i snuck in one more of ZhaoFJx's patches while waiting for CI on Jon's backport [21:20:52] ZhaoFJx: 1st patch should be live, 2nd patch on mwdebug - please check + lmk [21:21:29] cjming all good, name has changed [21:21:34] !log cjming@deploy2002 cjming, zhaofjx: Backport for [[gerrit:1121687|zhwiki: Change abusefilter-editor group name (T386879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:44] !log cjming@deploy2002 cjming, zhaofjx: Continuing with sync [21:22:03] (03CR) 10Jdlrobson: [C:03+1] Lazy Load Images Part 2 [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) (owner: 10Jdlrobson) [21:22:09] ZhaoFJx: i'm going to switch to Jon's patch now - i'll finish your 3rd after [21:22:16] CI is now passing cjming :) [21:22:27] Of course! [21:22:33] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.015e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:28:24] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121687|zhwiki: Change abusefilter-editor group name (T386879)]] (duration: 10m 24s) [21:28:28] T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879 [21:28:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) (owner: 10Jdlrobson) [21:28:51] ZhaoFJx: 2nd patch should be live :) [21:29:18] cjming checked without k8s [21:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10577062 (10phaultfinder) [21:31:57] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10577074 (10KFrancis) Hi @Dzahn I confirmed the NDA on Friday... was that on another ticket? Anyway, ERIC is all set on my end. Thank you! [21:33:08] !log reprepro include php8.1_8.1.31-1+wmf11u3 php-apcu_5.1.23-1+wmf11u3 into component/php81 - T386006 [21:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:12] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [21:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:34:06] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10577094 (10Pppery) (You posted "The NDA is complete" twice on T386581. Probably one of them was meant to go here) [21:37:47] (03PS1) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122252 [21:38:23] (03Abandoned) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122252 (owner: 10LD) [21:38:51] (03Merged) 10jenkins-bot: Lazy Load Images Part 2 [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122245 (https://phabricator.wikimedia.org/T366402) (owner: 10Jdlrobson) [21:39:07] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122245|Lazy Load Images Part 2 (T366402)]] [21:40:16] damn I wanna work on 1120152 not 1122252 but Im not a good merger :') [21:42:11] Jdlrobson: on test servers [21:42:35] cjming: checking [21:42:44] !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1122245|Lazy Load Images Part 2 (T366402)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:26] I'll let the job to an expert 8) [21:44:56] cjming: LGTM free to sync! [21:45:00] !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync [21:45:21] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:37] (03PS2) 10ZhaoFJx: cowikimedia: Change the logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) [21:46:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:10] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10577152 (10Dzahn) a:05EPIC→03None [21:50:24] (03PS1) 10Jdlrobson: Update ext.MobileFrontend.searchOverlay.empty hook to fire after ext.MobileFrontend.searchOverlay.open [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122254 (https://phabricator.wikimedia.org/T386735) [21:51:39] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122245|Lazy Load Images Part 2 (T366402)]] (duration: 12m 32s) [21:51:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:51:58] Jdlrobson: should be live :) [21:52:33] (03Merged) 10jenkins-bot: cowikimedia: Change the logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:52:50] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122095|cowikimedia: Change the logo (T386872)]] [21:52:54] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [21:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:54:43] ZhaoFJx: your 3rd patch on test servers - lmk [21:54:53] checking [21:55:26] !log cjming@deploy2002 cjming, zhaofjx: Backport for [[gerrit:1122095|cowikimedia: Change the logo (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:56:29] cjming I just relized silly my forgot to run tox update command... [21:56:52] Can I submit a new patchset now? [21:56:55] thanks cjming for the help today! [21:56:59] np! [21:57:17] ZhaoFJx: should i not sync for now and revert? [21:57:36] cjming yes revert please [21:57:40] ok [21:57:45] the image wouldnt load correctly [21:57:50] !log cjming@deploy2002 Sync cancelled. [21:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:58:18] (03PS1) 10TrainBranchBot: Revert "cowikimedia: Change the logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122256 [21:58:19] (03CR) 10TrainBranchBot: "cjming@deploy2002 created a revert of this change as I57a68f7b3649f28b059ef9035ec1314751f68ac4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122095 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:58:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122256 (owner: 10TrainBranchBot) [21:59:02] gmodena: happy to do your patch - sorry it got so late - or do you want to self-deploy? [21:59:24] (03Merged) 10jenkins-bot: Revert "cowikimedia: Change the logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122256 (owner: 10TrainBranchBot) [21:59:38] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122256|Revert "cowikimedia: Change the logo"]] [21:59:55] cjming be my guest :). I'll need to be around for some post-deployments check, so I was counting on a late eve. No worries! [22:00:02] jouncebot: nowand next [22:00:05] Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T2200). [22:00:22] cjming thanks for deployment, have a good day! [22:00:26] Hey all - we have two security patches we’d like to deploy today. One for core, one for ext:OAuth. [22:00:46] sbassett: can i squeeze one more config patch in? [22:02:02] gmodena: we may have to wait until after the security patches [22:02:21] cjming ack. I can stay around. [22:03:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.41s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:03:16] sbassett: i have another 3 patches i was hoping to get out today -- i'll close the window for now -- would it be ok to p/u after you all are finished? the next window after that is web team's and i think they might be done for the day [22:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:03:28] !log cjming@deploy2002 cjming, trainbranchbot: Backport for [[gerrit:1122256|Revert "cowikimedia: Change the logo"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:03:40] !log cjming@deploy2002 cjming, trainbranchbot: Continuing with sync [22:05:17] cjming: sure, these should go fairly quick... [22:05:55] cjming: Can you release your scap lock? [22:05:56] sbassett: great - just lmk when you're done - last scap backport cmd should be done here shortly (like within a minute or 2) [22:06:10] ok [22:08:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:08:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.199s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:10:07] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10577244 (10VRiley-WMF) [22:10:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [22:10:19] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122256|Revert "cowikimedia: Change the logo"]] (duration: 10m 41s) [22:10:21] sbassett: all yours [22:10:26] sorry about that [22:12:02] no prob, thanks [22:13:46] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.203s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:16:25] !log Deployed security patch for T385958 [22:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10577268 (10VRiley-WMF) Unracked and removed the following servers. However, the script has failed and returning an for both servers. @Andrew would you be... [22:21:01] !log Deployed security patch for T336113 [22:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:47] Ok, that should be it for security patches. Things seem stable in logstash. [22:23:45] sbassett asck [22:23:50] *ack [22:24:15] cjming would you still have time for backports? [22:24:39] yes! [22:24:43] we're gtg? [22:24:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10577275 (10phaultfinder) [22:24:47] thank you! [22:24:57] (03PS2) 10Gmodena: Revert "cirrus: enable mlr-2025 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 [22:25:27] cjming all good in my end. Only thing is I won't be able to test properly till we sync (because a/b test randomization) [22:25:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 (owner: 10Gmodena) [22:26:24] gmodena: np - i'll just go ahead when the time comes [22:26:42] (03Merged) 10jenkins-bot: Revert "cirrus: enable mlr-2025 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 (owner: 10Gmodena) [22:27:01] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1120534|Revert "cirrus: enable mlr-2025 for select wikis"]] [22:29:34] !log cjming@deploy2002 gmodena, cjming: Backport for [[gerrit:1120534|Revert "cirrus: enable mlr-2025 for select wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:29:38] !log cjming@deploy2002 gmodena, cjming: Continuing with sync [22:36:19] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120534|Revert "cirrus: enable mlr-2025 for select wikis"]] (duration: 09m 18s) [22:36:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122232 (https://phabricator.wikimedia.org/T384911) (owner: 10Clare Ming) [22:36:39] gmodena: should be live :) [22:36:50] cjming checking [22:40:09] cjming metrics and logs look good [22:40:15] yay! [22:40:27] thanks for the deployment! [22:40:27] (03CR) 10Cwhite: [C:03+2] logstash: update eqiad jobs host to logging-sd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1109189 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [22:40:36] yw! thanks for your patience [22:40:42] i'll be around to babysit cirrus for a while [22:40:45] no worries :) [22:41:15] jouncebot: nowandnext [22:41:15] For the next 1 hour(s) and 18 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250224T2200) [22:41:15] In 1 hour(s) and 18 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0000) [22:42:17] (03PS1) 10RLazarus: deployment_server: Add mw-script-restricted config to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1122259 (https://phabricator.wikimedia.org/T378429) [22:42:30] just fyi, I am deploying 2 more patches for DPE - should be wrapped up in the next 15ish minutes [22:42:35] (03Merged) 10jenkins-bot: Make CTR instrument pass in experiment name [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122232 (https://phabricator.wikimedia.org/T384911) (owner: 10Clare Ming) [22:42:51] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122232|Make CTR instrument pass in experiment name (T384911)]] [22:42:55] T384911: Update CTR instrument to set the instrument name - https://phabricator.wikimedia.org/T384911 [22:43:33] (03PS2) 10Clare Ming: Start test experiment for all enrolled users. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122236 (https://phabricator.wikimedia.org/T373715) [22:43:33] (03PS2) 10Clare Ming: Start test experiment for all enrolled users. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122236 (https://phabricator.wikimedia.org/T373715) [22:45:26] !log cjming@deploy2002 cjming: Backport for [[gerrit:1122232|Make CTR instrument pass in experiment name (T384911)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:45:31] !log cjming@deploy2002 cjming: Continuing with sync [22:46:49] (03CR) 10Eevans: [C:03+2] restbase2024: Upgrade to 'dev' (aka 4.1.8) as canary [puppet] - 10https://gerrit.wikimedia.org/r/1122241 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [22:48:38] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2024.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [22:52:09] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122232|Make CTR instrument pass in experiment name (T384911)]] (duration: 09m 17s) [22:52:12] T384911: Update CTR instrument to set the instrument name - https://phabricator.wikimedia.org/T384911 [22:52:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122236 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [22:57:27] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2024.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [22:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:59:11] (03Merged) 10jenkins-bot: Start test experiment for all enrolled users. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122236 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [22:59:30] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]] [22:59:34] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [23:02:14] !log cjming@deploy2002 cjming: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:02:24] !log cjming@deploy2002 cjming: Continuing with sync [23:09:03] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]] (duration: 09m 32s) [23:09:07] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [23:10:36] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2024.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [23:11:42] (03CR) 10Cwhite: [C:03+2] puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) (owner: 10Cwhite) [23:11:53] !log end of UTC late backport window [23:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:17:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.165s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:19:17] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2024.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [23:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10577457 (10phaultfinder) [23:38:50] !log mass renaming conflicting usernames in wikitech to have "~labswiki" suffix (T161859) [23:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:54] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [23:52:22] FIRING: [5x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:33] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad