[00:00:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [00:00:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:05:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [00:05:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:15:23] (03PS1) 10Dzahn: switch security.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892579 (https://phabricator.wikimedia.org/T330090) [00:16:57] (03CR) 10Dzahn: [C: 03+2] switch security.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892579 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:18:42] 10SRE, 10MediaWiki-File-management, 10Traffic, 10MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), and 2 others: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10tstarling) > TimStarling: curious if you have thoughts on https://www.mediawiki.org/wiki/Manual:Security#Upload_securit... [00:24:11] (03PS1) 10Dzahn: switch sitemaps.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892580 (https://phabricator.wikimedia.org/T330090) [00:24:31] here is something else I would instead wonder if it should be kept [00:24:39] sitemaps.wikimedia.org that dont get updated [00:25:47] (03CR) 10Dzahn: [C: 03+2] switch sitemaps.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892580 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:30:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:12] (03PS1) 10Dzahn: switch tendril.wikimedia.org and dbtree.wikimedia.org to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892582 (https://phabricator.wikimedia.org/T330090) [00:33:41] (03CR) 10Dzahn: [C: 03+2] switch tendril.wikimedia.org and dbtree.wikimedia.org to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892582 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:34:22] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:41] (03PS1) 10Dzahn: switch os-reports.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892583 (https://phabricator.wikimedia.org/T330090) [00:41:17] (03CR) 10Dzahn: [C: 03+2] switch os-reports.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892583 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:45:41] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) [00:46:52] (03PS1) 10Dzahn: switch research.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892584 (https://phabricator.wikimedia.org/T330090) [00:48:11] (03CR) 10Dzahn: [C: 03+2] switch research.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892584 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:49:02] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) >>! In T330670#8649554, @Volans wrote: > @ssing > > 1) for the cookbooks all that I see is that they use the `A:dns-auth` cumin alias, so they will follow along. No... [00:52:46] (03CR) 10Ssingh: config: Add brett for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [00:56:48] (03PS1) 10Dzahn: define role owner for gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/892587 [01:02:12] PROBLEM - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10phaultfinder) [01:09:28] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2002.wikimedia.org.service,rsync-data-backup-gitlab2002.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:32] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:38] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:36] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:06] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [02:05:22] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3321 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:06] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 139137 bytes in 2.932 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [02:21:36] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0300) [03:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.25 [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892612 (https://phabricator.wikimedia.org/T325588) [03:07:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.25 [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892612 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [03:14:56] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.25 [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892612 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0400) [04:01:14] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892602 (https://phabricator.wikimedia.org/T325588) [04:01:16] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892602 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [04:01:59] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892602 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [04:02:27] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.25 refs T325588 [04:02:31] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [04:11:26] PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - free space: /srv 14441 MB (5% inode=72%): /srv/docker/overlay2/aec0c6d10844cd16a7adb5f0a9d8ba1bd61e31c9f401c3f758b58611f4afdd87/merged 14441 MB (5% inode=72%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [04:20:01] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330701 (10Papaul) 05Open→03Resolved a:03Papaul This was fixed today as well new ms-be nodes [04:35:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (gitlab2002, ...), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:55:29] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.25 refs T325588 (duration: 53m 02s) [04:55:35] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [04:57:53] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.23 (duration: 02m 18s) [05:36:20] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:27:53] (03PS3) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) [06:31:57] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) (owner: 10Marostegui) [06:32:40] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) (owner: 10Marostegui) [06:33:55] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:892478|ProductionServices.php: Promote pc1014 to pc1 master (T330653)]] [06:34:00] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [06:35:25] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892417 [06:35:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:36] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892417 (owner: 10Marostegui) [06:35:46] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:892478|ProductionServices.php: Promote pc1014 to pc1 master (T330653)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [06:41:49] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:892478|ProductionServices.php: Promote pc1014 to pc1 master (T330653)]] (duration: 07m 54s) [06:41:54] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [06:42:12] (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892417 (owner: 10Marostegui) [06:42:28] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892417 (owner: 10Marostegui) [06:43:11] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892417 (owner: 10Marostegui) [06:43:39] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:892417|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] [06:43:52] PROBLEM - Check systemd state on mw2314 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:26] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:892417|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [06:46:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:25] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:892417|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 07m 46s) [06:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:54:02] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) Looks like it was slot 3 ` === RaidStatus (does not include components in optimal state) name: Adapter #0 Virtual Drive: 0 (Target Id: 0) RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0... [06:54:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1267 [06:54:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1267 [06:55:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23951 [06:56:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23951 [06:56:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 18187 [06:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:57:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18187 [06:57:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4621 [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0700) [07:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0700) [07:00:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4621 [07:01:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56099 [07:01:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56099 [07:01:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1820 [07:02:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1820 [07:04:02] !log Stop mysql on db2094 T326596 [07:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:06] T326596: Productionize db218[567] - https://phabricator.wikimedia.org/T326596 [07:07:11] (03PS1) 10Marostegui: mariadb: Productionize db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892827 (https://phabricator.wikimedia.org/T326596) [07:07:32] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892827 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [07:08:38] (03PS2) 10Marostegui: mariadb: Productionize db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892827 (https://phabricator.wikimedia.org/T326596) [07:08:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:29:23] (03CR) 10Ayounsi: "Is this compatible with upstream? If so shouldn't it be sent on the Github instead?" [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [07:29:33] (03PS1) 10Elukey: admin_ng: refactor DSE 1.23 config and disable istio sidecars in ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/892870 (https://phabricator.wikimedia.org/T330261) [07:35:22] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:29] (03CR) 10Elukey: [C: 03+2] admin_ng: refactor DSE 1.23 config and disable istio sidecars in ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/892870 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [07:39:46] (03PS1) 10Jcrespo: Update sql to add newly history table files_history [software/mediabackups] - 10https://gerrit.wikimedia.org/r/892891 (https://phabricator.wikimedia.org/T327157) [07:39:51] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [07:39:54] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [07:40:47] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [07:40:48] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [07:41:16] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:52] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [07:45:33] (03PS2) 10Jcrespo: Update sql to add newly history table file_history [software/mediabackups] - 10https://gerrit.wikimedia.org/r/892891 (https://phabricator.wikimedia.org/T327157) [07:45:47] (03PS3) 10Jcrespo: Update sql to add newly history table file_history [software/mediabackups] - 10https://gerrit.wikimedia.org/r/892891 (https://phabricator.wikimedia.org/T327157) [07:51:54] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [07:52:46] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [07:52:48] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [07:52:53] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [07:53:17] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [07:56:27] (03PS2) 10Nicolas Fraison: Failover hive to standby server [dns] - 10https://gerrit.wikimedia.org/r/892460 (https://phabricator.wikimedia.org/T303168) [07:59:33] (03PS1) 10Jelto: gitlab: enable restore for replicas, disable on active_host [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) [08:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39857/console" [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) (owner: 10Jelto) [08:06:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/892587 (owner: 10Dzahn) [08:11:05] !log installing openssl security updates on buster [08:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892827 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [08:19:51] (03CR) 10Btullis: [C: 03+1] "Thanks" [dns] - 10https://gerrit.wikimedia.org/r/892460 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:20:33] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/890810 (owner: 10Nicolas Fraison) [08:22:47] (03CR) 10Nicolas Fraison: [C: 03+2] Failover hive to standby server [dns] - 10https://gerrit.wikimedia.org/r/892460 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:24:57] (03PS1) 10Nicolas Fraison: Failover to primary server [dns] - 10https://gerrit.wikimedia.org/r/892893 (https://phabricator.wikimedia.org/T303168) [08:25:47] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Joe) >>! In T330651#8648604, @Joe wrote: > We should probably test that both scap works and a scap3 deployment also works (e.g. `docker-pkg`) when w... [08:36:42] (03PS1) 10Hashar: systemd::timer::job: fix email body indentation [puppet] - 10https://gerrit.wikimedia.org/r/892894 (https://phabricator.wikimedia.org/T330120) [08:38:14] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:37] (03CR) 10Hashar: "Follows up https://gerrit.wikimedia.org/r/c/operations/puppet/+/890800 ;) The mail we received this morning looked like:" [puppet] - 10https://gerrit.wikimedia.org/r/892894 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [08:40:32] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable haproxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [08:43:51] !log enable system hardening for haproxy in ulsfo - T323944 [08:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:56] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [08:52:25] !log restarting r/w slapd to pick up openssl security updates [08:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:05] jnuche and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T0900). [09:04:14] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [09:05:18] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892896 (https://phabricator.wikimedia.org/T325588) [09:05:20] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892896 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:06:00] (03CR) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [09:06:04] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892896 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:06:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [09:09:20] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [09:09:27] o/ [09:10:47] (03CR) 10David Caro: [C: 03+2] puppet: update firewall rules for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [09:11:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [09:13:27] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10JMeybohm) >>! In T330095#8650233, @KFrancis wrote: > @JMeybohm Please provide Norman Schwirz's email address and I'll put the agreement together. Please send it to kfrancis@wikimedi... [09:13:40] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.25 refs T325588 [09:13:44] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [09:13:51] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [09:21:18] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Ioacc1234red) Completely resolved. [09:25:39] (03PS1) 10Marostegui: production-m5.sql.erb: New IP [puppet] - 10https://gerrit.wikimedia.org/r/892898 (https://phabricator.wikimedia.org/T330697) [09:26:07] (03CR) 10Jelto: [V: 03+1] "Before Ia1f4791fef13aebfdf8b78241be04d68b612fb78 the restore was configured depending on replica/active status of the instance. As this ma" [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) (owner: 10Jelto) [09:26:17] (03CR) 10Muehlenhoff: Add a cookbook to restart/reboot ncredir nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 (owner: 10Muehlenhoff) [09:26:19] (03PS2) 10Muehlenhoff: Add a cookbook to restart/reboot ncredir nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 [09:26:42] (03PS2) 10Marostegui: production-m5.sql.erb: New IP [puppet] - 10https://gerrit.wikimedia.org/r/892898 (https://phabricator.wikimedia.org/T330697) [09:29:41] PROBLEM - Check systemd state on dumpsdata1004 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:43] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: New IP [puppet] - 10https://gerrit.wikimedia.org/r/892898 (https://phabricator.wikimedia.org/T330697) (owner: 10Marostegui) [09:32:01] (03PS1) 10Jbond: 2.5.5: Prepare new release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/892899 [09:32:33] (03PS1) 10Marostegui: Revert "production-m5.sql.erb: New IP" [puppet] - 10https://gerrit.wikimedia.org/r/892419 [09:33:02] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) >>! In T330651#8651629, @Joe wrote: >>>! In T330651#8648604, @Joe wrote: >> We should probably test that both scap works and a scap... [09:33:23] (03CR) 10Jbond: [C: 03+2] 2.5.5: Prepare new release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/892899 (owner: 10Jbond) [09:33:50] (03CR) 10Clément Goubert: [V: 03+1 C: 04-2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39858/console" [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [09:35:08] (03CR) 10Marostegui: [C: 03+2] Revert "production-m5.sql.erb: New IP" [puppet] - 10https://gerrit.wikimedia.org/r/892419 (owner: 10Marostegui) [09:35:53] (03Merged) 10jenkins-bot: 2.5.5: Prepare new release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/892899 (owner: 10Jbond) [09:36:32] (03PS1) 10Majavah: Remove osmdb records [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) [09:36:57] (03CR) 10Nicolas Fraison: [C: 03+2] Failover to primary server [dns] - 10https://gerrit.wikimedia.org/r/892893 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [09:38:49] (03CR) 10Nicolas Fraison: [C: 03+2] presto.coordinator: reduce max heap size of coordinator [puppet] - 10https://gerrit.wikimedia.org/r/890810 (owner: 10Nicolas Fraison) [09:39:35] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/892902 [09:40:15] (03PS2) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/892902 (https://phabricator.wikimedia.org/T330484) [09:41:04] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/892902 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [09:45:28] (03PS1) 10Majavah: openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159) [09:45:30] (03PS1) 10Majavah: P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) [09:45:44] (03PS2) 10Majavah: Remove osmdb records [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) [09:46:38] !log zabe@mwmaint1002:~$ mwscript createAndPromote.php --wiki azwikimedia --bureaucrat Zabe REDACTED [09:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:05] (03PS1) 10Majavah: osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905 [09:49:54] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on 16 hosts with reason: etcd cluster upgrade failed, waiting for k8s upgrade [09:49:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39859/console" [puppet] - 10https://gerrit.wikimedia.org/r/892902 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [09:50:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on 16 hosts with reason: etcd cluster upgrade failed, waiting for k8s upgrade [09:50:35] (03PS3) 10Jbond: standard_packages: also manage the rasdaemon service [puppet] - 10https://gerrit.wikimedia.org/r/892444 [09:51:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39860/console" [puppet] - 10https://gerrit.wikimedia.org/r/892444 (owner: 10Jbond) [09:52:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] standard_packages: also manage the rasdaemon service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892444 (owner: 10Jbond) [09:54:49] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd2001.codfw.wmnet with OS bullseye [09:55:05] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd2002.codfw.wmnet with OS bullseye [09:55:52] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd2003.codfw.wmnet with OS bullseye [09:56:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kuber [09:56:35] -codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-c [09:56:35] 64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:57:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kuber [09:57:01] -codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-c [09:57:01] 64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:57:07] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2001.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2002.codfw.wmnet, ml-serve2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:57:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2001.codfw.wmnet, ml-serve2007.codfw.wmnet, ml-serve2002.codfw.wmnet, ml-serve2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:57:53] ahem sorry this is me [09:57:58] (KubernetesCalicoDown) firing: (10) ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:58:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39861/console" [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [09:58:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:14] silenced most of the alarms, the BGP ones will stay until the cluster is upgraded [09:59:25] * elukey sends wikilove from the ML team [10:00:07] !log klausman@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=codfw [10:00:21] (03PS1) 10Zabe: Add png logo for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892906 (https://phabricator.wikimedia.org/T306015) [10:00:25] I hope I didn't just depool all of codfw :-/ [10:01:24] doesn't seem so, should only be the inference service [10:01:35] you can check the status of codfw with get IIRC if you want to be sure :) [10:01:56] (03PS1) 10Vgutierrez: acme-chief: Add CN to SNI list for idm and idm-test [puppet] - 10https://gerrit.wikimedia.org/r/892908 [10:01:58] `{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=inference"}` [10:02:45] (03CR) 10Zabe: [C: 04-1] "completly wrong dimensions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892906 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [10:02:48] (03Abandoned) 10Zabe: Add png logo for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892906 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [10:02:54] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/892908 (owner: 10Vgutierrez) [10:03:21] (03PS2) 10Elukey: role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) [10:03:25] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Add CN to SNI list for idm and idm-test [puppet] - 10https://gerrit.wikimedia.org/r/892908 (owner: 10Vgutierrez) [10:03:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:05] (03CR) 10EoghanGaffney: [C: 03+1] "This looks good. I've also taken a quick look at the gitlab::restore class to make sure that this will be removed if not required and it l" [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) (owner: 10Jelto) [10:06:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39863/console" [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [10:06:28] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage [10:06:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2002.codfw.wmnet with reason: host reimage [10:06:38] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2003.codfw.wmnet with reason: host reimage [10:07:38] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage [10:08:42] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.06529 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:08:53] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-etcd2003.codfw.wmnet with reason: host reimage [10:10:08] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-etcd2002.codfw.wmnet with reason: host reimage [10:13:17] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-etcd2001.codfw.wmnet with OS bullseye [10:14:53] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-etcd2003.codfw.wmnet with OS bullseye [10:15:20] * jbond looking at puppet [10:15:54] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-etcd2002.codfw.wmnet with OS bullseye [10:15:55] * jbond fix incomming [10:16:02] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [10:17:05] (03PS1) 10Jbond: base:standard_packages: uses ensure => running, not started [puppet] - 10https://gerrit.wikimedia.org/r/892914 [10:17:45] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [10:17:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/892914 (owner: 10Jbond) [10:19:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 (owner: 10Muehlenhoff) [10:19:31] !log jnuche@deploy1002 Installing scap version "latest" for 1 hosts [10:19:44] !log jnuche@deploy1002 Installation of scap version "latest" completed for 1 hosts [10:20:04] (03CR) 10Jbond: [C: 03+2] base:standard_packages: uses ensure => running, not started [puppet] - 10https://gerrit.wikimedia.org/r/892914 (owner: 10Jbond) [10:20:29] (03CR) 10Vgutierrez: [C: 03+1] Add a cookbook to restart/reboot ncredir nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 (owner: 10Muehlenhoff) [10:21:37] !log installing apr-util security updates on buster [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:37] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 23951 [10:24:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 23951 [10:26:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [10:26:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:28:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:04] !log jnuche@deploy1002 Installing scap version "latest" for 8 hosts [10:29:12] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: enable restore for replicas, disable on active_host [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) (owner: 10Jelto) [10:29:12] spike of logs [10:29:17] !log jnuche@deploy1002 Installation of scap version "latest" completed for 8 hosts [10:29:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:32:09] !log root@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23 [10:32:41] !log root@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade to k8s 1.23 [10:33:25] !log root@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23 [10:34:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] "After forcing a puppet run on the new production instance gitlab2002, the restore timer is gone (systectl list-timers)" [puppet] - 10https://gerrit.wikimedia.org/r/892892 (https://phabricator.wikimedia.org/T329931) (owner: 10Jelto) [10:34:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:36:13] lots of NOTICE logs from api-gateway, aren't they? [10:36:36] looking [10:37:00] not an ongoing issue, but surprised on the volume [10:37:19] up to 1 million per minute [10:37:45] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [10:38:34] !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl2001.codfw.wmnet with OS bullseye [10:38:40] we are logging every single request, looks like [10:39:24] ouch [10:39:53] In theory we should have basic rate limits for api-gateway [10:40:12] jynus: any clear culprit? [10:40:14] maybe I am wrong, haven't looked too deep into it [10:40:25] just the volume seems high [10:41:14] it's internal traffic to the linkrecommendation service [10:41:17] but it seems unusually high yeah [10:41:19] https://logstash.wikimedia.org/goto/eb11f35387e54698ca21c2dc294dae59 [10:41:44] ^this is what led me to it, but I am not familiar enough with the services to go deeper [10:41:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:01] looking into it too, seems like a few spikes and then we're back? [10:43:03] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&from=now-30m&to=now [10:43:12] going to enable sampling for those logs [10:43:15] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0004904 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:43:28] godog: yeah, not worried about any ongoing issue [10:43:44] but the spike led me to notice maybe too much logging in general [10:43:54] that is what I wanted to ask about [10:43:58] that could be for sure yeah [10:44:25] as far as I understand it those NOTICE level logs are just access logs to the service [10:44:31] yep [10:44:34] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to restart/reboot ncredir nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 (owner: 10Muehlenhoff) [10:44:37] they originally went to eventgate for $reasons [10:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:46:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:14] (03PS1) 10Kosta Harlan: gerrit: Add "All-Jobs-Passed" non-blocking label [puppet] - 10https://gerrit.wikimedia.org/r/892922 (https://phabricator.wikimedia.org/T330741) [10:48:34] Just a quick reminder that we're switching over services and traffic today, starting at 14:00UTC. If we could please freeze any merge and deployment starting around 13:00UTC I'd be very grateful :) [10:49:29] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10jbond) lgtm just some curiosity :) > After the above change, we will have three DNS boxes in the core DCs, with ns0 pointing to dns1001 in eqiad and ns1 pointing to dns2001... [10:49:42] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [10:49:43] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [10:50:16] (03CR) 10Kosta Harlan: [C: 04-1] gerrit: Add "All-Jobs-Passed" non-blocking label (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892922 (https://phabricator.wikimedia.org/T330741) (owner: 10Kosta Harlan) [10:50:30] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir-ulsfo [10:50:56] I'll add a scap lock at 13:00UTC by the way [10:51:32] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: add monitoring for cert-manager [puppet] - 10https://gerrit.wikimedia.org/r/889965 (owner: 10Majavah) [10:51:38] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: host reimage [10:51:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir-ulsfo [10:52:05] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: deploy alert rules from GitLab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T284860) (owner: 10Majavah) [10:53:31] (03PS4) 10David Caro: P:toolforge::prometheus: deploy alert rules from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T284860) (owner: 10Majavah) [10:54:17] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: host reimage [10:54:41] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [10:55:13] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [10:55:34] (03Abandoned) 10Kosta Harlan: gerrit: Add "All-Jobs-Passed" non-blocking label [puppet] - 10https://gerrit.wikimedia.org/r/892922 (https://phabricator.wikimedia.org/T330741) (owner: 10Kosta Harlan) [10:57:46] (03PS2) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 [10:57:48] (03PS1) 10Volans: sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 [10:58:48] (03PS1) 10QChris: Add .gitreview [debs/pint] - 10https://gerrit.wikimedia.org/r/892924 [10:58:50] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/pint] - 10https://gerrit.wikimedia.org/r/892924 (owner: 10QChris) [10:58:54] (03CR) 10Klausman: [C: 03+1] sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 (owner: 10Volans) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1100) [11:00:41] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:03:48] (03PS1) 10Gmodena: page-content-change: add state to values file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892928 [11:04:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [11:05:38] (03PS2) 10Gmodena: page-content-change: add state to values file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892928 (https://phabricator.wikimedia.org/T328569) [11:11:01] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet with OS bullseye [11:11:13] !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl2002.codfw.wmnet with OS bullseye [11:17:57] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/892894 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [11:19:17] (03PS1) 10Muehlenhoff: Add a cookbook to roll restart/reboot AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 [11:20:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 (owner: 10Volans) [11:21:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:21:41] !log Install MariaDB 11.0.1 on db1106 T330643 [11:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] T330643: Compile and package MariaDB 11.0.1 - https://phabricator.wikimedia.org/T330643 [11:21:48] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: host reimage [11:22:49] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:24:35] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: host reimage [11:28:28] ofc we have WikimediaCommandLineInc.php because why not [11:28:44] ACKNOWLEDGEMENT - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service John Bond T330393 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:19] ACKNOWLEDGEMENT - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service John Bond T330660 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:19] (03CR) 10Muehlenhoff: Add a cookbook to roll restart/reboot AQS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:31:59] (03CR) 10Stevemunene: [V: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:32:12] (03CR) 10Btullis: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:32:18] (03CR) 10Btullis: [C: 03+2] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:32:53] (03CR) 10Muehlenhoff: [C: 03+2] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/892389 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [11:33:04] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:37:43] (03CR) 10Btullis: [C: 04-1] "There is a problem with the version string comparison." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:39:52] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl2002.codfw.wmnet with OS bullseye [11:40:36] (03CR) 10Volans: [C: 03+1] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:41:34] (03PS3) 10Volans: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 [11:43:20] (03CR) 10Muehlenhoff: Add a cookbook to roll restart/reboot AQS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:44:12] (03CR) 10Volans: "Addressed comments, added examples to docstrings." [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [11:44:39] (03CR) 10Volans: [C: 03+1] Add a cookbook to roll restart/reboot AQS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:48:04] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2001.codfw.wmnet with OS bullseye [11:48:14] (03PS66) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [11:48:38] (03CR) 10Btullis: [C: 03+1] "Really useful, many thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [11:49:58] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bullseye [11:50:48] (03CR) 10Btullis: "We decided to put the complete version string including the suffix into the parameter." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:50:48] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2003.codfw.wmnet with OS bullseye [11:51:30] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39864/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:51:53] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bullseye [11:54:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris) >>! In T327920#8647481, @Tgr wrote: > MwHttpRequest (that is, Guzzle/php-curl) and the URLs from https://wikitech.wikimedia.org/wiki/Ana... [11:55:33] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2008.codfw.wmnet with OS bullseye [11:55:36] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS bullseye [11:55:44] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2006.codfw.wmnet with OS bullseye [11:56:00] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS bullseye [11:56:09] (03CR) 10Nicolas Fraison: [C: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:57:59] ACKNOWLEDGEMENT - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2002.wikimedia.org.service,rsync-data-backup-gitlab2002.wikimedia.org.service John Bond T330744 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:01] jouncebot: nowandnext [12:02:01] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [12:02:01] In 1 hour(s) and 57 minute(s): Datacenter Switchover - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400) [12:02:01] In 1 hour(s) and 57 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400) [12:04:00] (03PS2) 10Muehlenhoff: Add a cookbook to roll restart/reboot AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 [12:07:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [12:07:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [12:09:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [12:10:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [12:11:34] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [12:11:42] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [12:11:53] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [12:12:08] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [12:12:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [12:12:57] (03PS1) 10Jbond: gitlab::rsync: ensure we remove old jobs when promoting a server to active [puppet] - 10https://gerrit.wikimedia.org/r/892938 (https://phabricator.wikimedia.org/T330744) [12:13:18] (03PS1) 10Marostegui: install_server: Do not reimage db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892939 (https://phabricator.wikimedia.org/T326596) [12:13:55] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2186 [puppet] - 10https://gerrit.wikimedia.org/r/892939 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [12:14:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39865/console" [puppet] - 10https://gerrit.wikimedia.org/r/892938 (https://phabricator.wikimedia.org/T330744) (owner: 10Jbond) [12:14:29] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [12:15:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [12:16:34] (03CR) 10Jbond: [V: 03+1] "pcc is a bit messy, due to the amount of resources a timer adds. the key thing to notice is that we now have resources for gitlab2002 an" [puppet] - 10https://gerrit.wikimedia.org/r/892938 (https://phabricator.wikimedia.org/T330744) (owner: 10Jbond) [12:17:03] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [12:17:21] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [12:19:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [12:21:17] jouncebot: next [12:21:18] In 1 hour(s) and 38 minute(s): Datacenter Switchover - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400) [12:21:18] In 1 hour(s) and 38 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400) [12:21:34] o_O I don’t see a mediawiki switchover window today in the calendar [12:21:43] only a traffic one, one hour after the services [12:21:57] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [12:21:59] Lucas_WMDE: MW is scheduled for tomorrow [12:22:08] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [12:22:15] yup, I think I just fixed it https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2057105 [12:22:33] it was under the right day section on the wiki page, but with the wrong day in the template [12:23:40] jouncebot: refresh [12:23:41] I refreshed my knowledge about deployments. [12:23:43] jouncebot: next [12:23:43] In 1 hour(s) and 36 minute(s): Datacenter Switchover - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400) [12:23:45] yay [12:24:05] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [12:24:10] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to roll restart/reboot AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/892931 (owner: 10Muehlenhoff) [12:24:33] (03PS1) 10Marostegui: mariadb: Productionize db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892941 (https://phabricator.wikimedia.org/T326596) [12:24:54] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892941 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [12:25:51] PROBLEM - configured eth on ml-serve2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.78: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:25:59] (03PS2) 10Marostegui: mariadb: Productionize db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892941 (https://phabricator.wikimedia.org/T326596) [12:26:47] PROBLEM - confd service on ml-serve2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.78: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:25] PROBLEM - dhclient process on ml-serve2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.78: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:27:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2002.codfw.wmnet with OS bullseye [12:28:27] PROBLEM - MD RAID on ml-serve2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.78: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:29:17] PROBLEM - puppet last run on ml-serve2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.78: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:29:45] RECOVERY - Check systemd state on mw2314 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2003.codfw.wmnet with OS bullseye [12:30:25] RECOVERY - confd service on ml-serve2007 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:30:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892941 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [12:31:08] (03CR) 10Muehlenhoff: "Looks good, a few nits/typos inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [12:31:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2008.codfw.wmnet with OS bullseye [12:32:04] (03PS1) 10Majavah: labs_boostrapvz: Remove class [puppet] - 10https://gerrit.wikimedia.org/r/892944 [12:33:07] (03CR) 10Elukey: [C: 03+2] admin_ng: upgrade ml-serve-codfw's settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892483 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [12:33:53] ACKNOWLEDGEMENT - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2001.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2002.codfw.wmnet, ml-serve2008.codfw.wmnet are marked down but pooled John Bond T327253 https://wikitech.wikimedia.org/wiki/PyBal [12:34:05] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS bullseye [12:34:45] RECOVERY - puppet last run on ml-serve2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:35:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:35:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:35:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:35:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:36:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:36:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:36:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:37:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:37:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:37:35] PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:41] RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [12:37:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:38:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:38:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:38:27] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2006.codfw.wmnet with OS bullseye [12:38:52] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:38:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2004.codfw.wmnet with OS bullseye [12:39:03] RECOVERY - MD RAID on ml-serve2007 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:39:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:39:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:39:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:39:57] RECOVERY - dhclient process on ml-serve2007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:39:57] RECOVERY - configured eth on ml-serve2007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:40:13] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2007.codfw.wmnet with OS bullseye [12:41:58] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:42:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:42:09] RECOVERY - puppet last run on puppetdb2003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:42:42] (03PS1) 10Marostegui: dbprov2002.cnf.erb: Change db_inventory target [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) [12:43:19] (03CR) 10Marostegui: "jcrespo if you could test this before your holidays in order not to get the decommissioning on db2093 blocked that'd be great." [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [12:44:57] (03PS1) 10Elukey: role::ml_k8s::worker: set istio-cni to 1.15 in ml-serve-codfw [puppet] - 10https://gerrit.wikimedia.org/r/892949 (https://phabricator.wikimedia.org/T330669) [12:45:12] (03PS3) 10Ladsgroup: mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) [12:45:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:45:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:45:30] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:45:53] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: set istio-cni to 1.15 in ml-serve-codfw [puppet] - 10https://gerrit.wikimedia.org/r/892949 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [12:45:54] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:45:55] (03CR) 10CI reject: [V: 04-1] mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [12:46:06] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:46:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:46:32] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::worker: set istio-cni to 1.15 in ml-serve-codfw [puppet] - 10https://gerrit.wikimedia.org/r/892949 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [12:46:43] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:46:56] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:47:16] (03PS4) 10Ladsgroup: mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) [12:47:29] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:47:31] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:48:21] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:48:28] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:48:33] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:49:00] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:49:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:23] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:51:03] RECOVERY - puppet last run on puppetdb1003 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:52:30] who wants to review my mwscript patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/889259 to make it use run.php [12:52:37] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:53:16] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2001.codfw.wmnet with OS bullseye [12:53:47] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:54:09] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:55:47] !log root@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade to k8s 1.23 [12:55:58] \o/ [12:56:31] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:56:32] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:58:02] (03CR) 10Marostegui: [C: 03+2] dbprov2002.cnf.erb: Change db_inventory target [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [12:58:06] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:58:08] (03CR) 10Jcrespo: [C: 03+1] dbprov2002.cnf.erb: Change db_inventory target [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [13:00:43] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:01:08] (03PS1) 10Marostegui: common.yaml: Add db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892953 (https://phabricator.wikimedia.org/T326596) [13:01:20] (03PS4) 10Volans: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 [13:01:22] (03CR) 10Volans: "Thanks for the quick review, addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:01:50] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [13:02:29] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [13:02:53] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:03:43] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [13:03:50] Heads up Emperor, jbond, and anyone it may concern, locking scap deployments in advance of services switchover [13:04:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:04:13] (03CR) 10Jcrespo: [C: 03+1] common.yaml: Add db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892953 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [13:04:15] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks for the refactoring!" [puppet] - 10https://gerrit.wikimedia.org/r/892938 (https://phabricator.wikimedia.org/T330744) (owner: 10Jbond) [13:04:22] (03CR) 10Marostegui: [C: 03+2] common.yaml: Add db2185 [puppet] - 10https://gerrit.wikimedia.org/r/892953 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [13:04:36] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:04:38] !log Locking scap deployments for service switchover - T330651 [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [13:04:54] !log jnuche@deploy1002 Installing scap version "latest" for 550 hosts [13:05:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:05:15] jnuche: err [13:05:33] I just put a scap lock on deploy1002 [13:05:47] !log jnuche@deploy1002 Installation of scap version "latest" completed for 550 hosts [13:05:56] (03CR) 10Jelto: [C: 03+2] gitlab::rsync: ensure we remove old jobs when promoting a server to active [puppet] - 10https://gerrit.wikimedia.org/r/892938 (https://phabricator.wikimedia.org/T330744) (owner: 10Jbond) [13:06:00] claime: o/ we have the inference endpoint pooled only for eqiad at the moment, we are going to return to A/A in some mins [13:06:20] elukey: ack, I'm putting the lock way in advance, switchover starts at 1400UTC [13:06:46] claime: sorry, I didn't see the lock, I won't touch scap until the switchover is done [13:06:52] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [13:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:07:13] jnuche: No worries, I should have communicated in releng that I'd put a lock up in advance [13:08:21] Rationale for locking 1h in advance is that it gives time to fix before the switchover window if the latest deploys go wrong or something [13:08:45] jnuche: Be advised we are switching deployment servers too after the services switchover [13:08:51] jnuche: https://phabricator.wikimedia.org/T330651 [13:08:53] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:09:26] claime: ack, thx 👍 [13:10:35] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:10:42] (03PS1) 10Stang: srwiki: Update logo/wordmark/tagline, add latin variant version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545) [13:11:09] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:22] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:11:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:46] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:14:48] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:02] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:17:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page-content-change: add state to values file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892928 (https://phabricator.wikimedia.org/T328569) (owner: 10Gmodena) [13:18:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:20:12] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [13:20:18] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [13:20:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:21:26] (03PS1) 10Zabe: MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892966 (https://phabricator.wikimedia.org/T330746) [13:21:41] (03PS1) 10Zabe: MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892967 (https://phabricator.wikimedia.org/T330746) [13:21:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:17] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [13:24:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:24:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:25:05] (03CR) 10Kosta Harlan: GrowthExperiments: Enable Growth features by default on testwikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [13:25:30] (03CR) 10Kosta Harlan: GrowthExperiments: Enable Growth features by default on testwikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [13:25:37] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [13:26:00] (03CR) 10Zabe: mwscript: Switch to use run.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [13:26:09] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:29:48] (ProbeDown) resolved: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:03] (03PS5) 10Ladsgroup: mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) [13:30:05] (03CR) 10Ladsgroup: mwscript: Switch to use run.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [13:31:09] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:31:55] (03CR) 10Volans: [C: 03+2] apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:32:10] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw [13:32:21] claime: repooled! [13:32:32] elukey: Thank you <3 [13:34:48] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:47] (03Merged) 10jenkins-bot: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:38:37] PROBLEM - Check systemd state on ml-serve2003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:46:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:49:05] (03PS1) 10Zabe: Optimize some logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892958 [13:49:25] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [13:49:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [13:51:07] (03Abandoned) 10Zabe: Optimize some logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892958 (owner: 10Zabe) [13:55:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/892960 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [13:56:00] (03CR) 10Clément Goubert: [C: 03+2] service::catalog: Remove discovery stanza for apt [puppet] - 10https://gerrit.wikimedia.org/r/892960 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [13:56:47] (03CR) 10Jbond: "i think this could break apt.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/892960 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [13:59:42] !log Create dummy and empty enwiki.text table on db2186:3311 to test check_private_data T326596 [13:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:47] T326596: Productionize db218[567] - https://phabricator.wikimedia.org/T326596 [13:59:52] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) **Quick update from Shopify below + Please let me know if I should send them a reply with anything else worth pointing out or the rails link shared above and I'll happily to so... [14:00:05] claime: (Dis)respected human, time to deploy Datacenter Switchover - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1400). Please do the needful. [14:00:35] (03PS1) 10MSantos: mobileapps: bump to 2023-02-20-130053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/892962 [14:01:15] We are going to delay the Datacenter Switchover - Services deploy for a few minutes [14:01:44] (03PS1) 10Jbond: apt: add discovery for apt [dns] - 10https://gerrit.wikimedia.org/r/892963 [14:01:49] (03PS1) 10Clément Goubert: Revert "service::catalog: Remove discovery stanza for apt" [puppet] - 10https://gerrit.wikimedia.org/r/892968 [14:02:25] PROBLEM - gdnsd checkconf on dns2002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:03:29] PROBLEM - gdnsd checkconf on dns1002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:03:47] (03CR) 10BBlack: [C: 03+1] "Breaks config on authdns servers, causing agent failure runs to boot" [puppet] - 10https://gerrit.wikimedia.org/r/892968 (owner: 10Clément Goubert) [14:04:03] RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:55] (03CR) 10Clément Goubert: [C: 03+2] Revert "service::catalog: Remove discovery stanza for apt" [puppet] - 10https://gerrit.wikimedia.org/r/892968 (owner: 10Clément Goubert) [14:05:57] (03PS1) 10Marostegui: check_private_data.pp: Let's make it daily [puppet] - 10https://gerrit.wikimedia.org/r/892964 [14:06:35] PROBLEM - gdnsd checkconf on dns4004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:06:52] (03PS1) 10Filippo Giunchedi: prometheus: refactor blackbox configuration [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) [14:06:54] (03PS1) 10Filippo Giunchedi: prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) [14:06:56] (03PS1) 10Filippo Giunchedi: prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) [14:07:55] PROBLEM - gdnsd checkconf on dns3002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:07:55] PROBLEM - gdnsd checkconf on dns5004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:07:58] (03CR) 10Andrew Bogott: "I'm late to this but I suggest that we include a rule for all cloudcontrol nodes (e.g. profile::openstack::eqiad1::openstack_controllers) " [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:08:25] RECOVERY - gdnsd checkconf on dns4004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:08:45] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:08:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10... [14:08:59] RECOVERY - gdnsd checkconf on dns1002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:09:44] (03CR) 10Jcrespo: [C: 03+1] check_private_data.pp: Let's make it daily [puppet] - 10https://gerrit.wikimedia.org/r/892964 (owner: 10Marostegui) [14:09:45] RECOVERY - gdnsd checkconf on dns2002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:09:47] RECOVERY - gdnsd checkconf on dns3002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:09:47] RECOVERY - gdnsd checkconf on dns5004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:11:05] (03CR) 10Andrew Bogott: [C: 03+1] "looks good but let's sit on this until we actually decom the hardware involved." [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [14:12:26] (03CR) 10Jbond: [C: 03+2] apt: add discovery for apt [dns] - 10https://gerrit.wikimedia.org/r/892963 (owner: 10Jbond) [14:12:34] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2023-02-20-130053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/892962 (owner: 10MSantos) [14:14:50] (03PS2) 10Marostegui: check_private_data.pp: Let's make it daily [puppet] - 10https://gerrit.wikimedia.org/r/892964 [14:16:52] (03CR) 10Klausman: [C: 03+1] "One nit, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [14:17:19] (03Merged) 10jenkins-bot: mobileapps: bump to 2023-02-20-130053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/892962 (owner: 10MSantos) [14:18:04] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/output/892964/39867/" [puppet] - 10https://gerrit.wikimedia.org/r/892964 (owner: 10Marostegui) [14:18:07] (03CR) 10Marostegui: [C: 03+2] check_private_data.pp: Let's make it daily [puppet] - 10https://gerrit.wikimedia.org/r/892964 (owner: 10Marostegui) [14:20:46] heads up, services being switched in a bit [14:21:05] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:21:16] hm, why was a mobileapps deployment commit just merged? [14:21:36] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:21:39] !log switching services over to codfw - T330651 [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:44] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [14:21:53] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 [14:22:12] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 s... [14:22:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:22:32] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:22:52] mbsantos: you probably want to wait on deploying mobileapps [14:24:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:25:04] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:26:32] thanks for the heads up akosiaris I'm following the discussions [14:28:06] as an update, we are ~30% there [14:29:54] thanks, I'll skip the current window and do in the later one [14:32:54] (03PS1) 10Filippo Giunchedi: Add debianization [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) [14:35:04] I'll be switching over the deployment server next, so your deploy will be from deploy2002 [14:35:30] (03CR) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 (031 comment) [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [14:38:31] (03PS1) 10Elukey: role::ml_k8s::{master,worker}: update ml-serve-eqiad to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) [14:42:00] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:42:02] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in eqiad: Datacenter Switchover - T330651 [14:42:06] !log cgoubert@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 [14:42:08] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [14:42:17] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 f... [14:42:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:42:40] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 s... [14:44:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover - T330651 [14:44:16] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 c... [14:44:26] and done [14:44:36] !log Services switched over to codfw - T329193 [14:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:41] T329193: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 [14:44:43] on to switching the deployment server now [14:44:47] (03PS1) 10Elukey: admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) [14:44:50] !log oblivian@cumin1001 START - Cookbook sre.discovery.datacenter status all services in eqiad: None - None [14:44:51] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in eqiad: None - None [14:44:55] I'm going to take a 2 minutes breather [14:45:29] yup, no rush [14:45:48] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [14:47:28] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39868/console" [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [14:48:12] !log dcaro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1005.eqiad.wmnet [14:50:55] (03PS2) 10Elukey: role::ml_k8s::{master,worker}: update ml-serve-eqiad to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) [14:51:08] !log Switch deployment server to deploy2002.codfw.wmnet - T330651 [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:12] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [14:51:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39869/console" [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [14:52:09] (03PS3) 10Clément Goubert: wmnet: Switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) [14:52:37] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [14:52:58] (03CR) 10CI reject: [V: 04-1] wmnet: Switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [14:53:27] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [14:53:44] (03PS3) 10Elukey: role::ml_k8s::{master,worker}: update ml-serve-eqiad to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) [14:53:48] (03PS1) 10Giuseppe Lavagetto: sre.discovery.datacenter: support a/p state when depooled [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 [14:53:50] (03PS1) 10Giuseppe Lavagetto: sre.discovery.datacenter: uniform style [cookbooks] - 10https://gerrit.wikimedia.org/r/893000 [14:53:58] <_joe_> claime: ^^ [14:54:28] (03CR) 10Vgutierrez: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650) (owner: 10Clément Goubert) [14:54:30] congrats _joe_, you got gerrit 893000 :D [14:54:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39870/console" [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [14:55:00] <_joe_> volans: and for a commit just changing single to double quotes, your favourites [14:55:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) * dumpdata1007 has ssds raid1 virtual drive id 238, hdds raid10 virtual drive id 239, installs fine. * dumpdata1006 issues: * clear config, reboot, setup raid1 ssd... [14:55:41] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/893000 (owner: 10Giuseppe Lavagetto) [14:56:15] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:56:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10... [14:57:10] (03CR) 10Clément Goubert: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [14:59:25] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_pagetriage_cleanup_testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:05] claime: I, the Bot under the Fountain, call upon thee, The Deployer, to do Datacenter Switchover - Traffic deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1500). [15:00:59] I know jouncebot but we're stuck for now :) [15:04:14] !log zabe@mwmaint1002:~$ mwscript extensions/Flow/maintenance/FlowFixInconsistentBoards.php --wiki=zhwiki --namespaceName='USER_TALK' # T330761 [15:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:19] T330761: Fix non-associated flow talk page - https://phabricator.wikimedia.org/T330761 [15:04:22] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [15:05:15] zabe: please hold on until the switchover is complete [15:05:28] (03PS1) 10Kosta Harlan: planet: More specific feed for kostaharlan.net [puppet] - 10https://gerrit.wikimedia.org/r/893001 [15:05:40] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [15:05:40] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:05:41] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcephosd1005.eqiad.wmnet [15:05:44] ok [15:05:51] (03PS2) 10Kosta Harlan: planet: Wikimedia-specific feed for kostaharlan.net [puppet] - 10https://gerrit.wikimedia.org/r/893001 [15:06:11] (03CR) 10CI reject: [V: 04-1] planet: Wikimedia-specific feed for kostaharlan.net [puppet] - 10https://gerrit.wikimedia.org/r/893001 (owner: 10Kosta Harlan) [15:06:52] (03CR) 10Clément Goubert: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [15:10:17] (03PS3) 10Kosta Harlan: planet: Wikimedia-specific feed for kostaharlan.net [puppet] - 10https://gerrit.wikimedia.org/r/893001 [15:10:36] !log Running authdns-update for deployment server switch - T330651 [15:10:39] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:40] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [15:11:23] (03PS4) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) [15:11:34] (03CR) 10Clément Goubert: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [15:11:37] <_joe_> uhm mobileapps is overloaded in codfw? [15:11:53] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:25] <_joe_> no it seems ok tbh [15:13:13] no, there is some minor issue [15:13:30] 1 (at least) of the pods is close to maximum memory [15:13:43] judging by the average, it should be just 1 though [15:13:43] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [15:14:01] mateus was about to upgrade it, that should refresh them [15:14:24] (03CR) 10Clément Goubert: [C: 03+2] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [15:14:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [15:14:31] (03PS1) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [15:16:37] !log Running puppet on all deployment servers - T330651 [15:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:42] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [15:17:47] <_joe_> claime: you should also run puppet where tcpircbot runs :) [15:18:05] _joe_: It'll be run fleet-wide right after just to be sure [15:18:28] !log Running puppet on fleet-wide - T330651 [15:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:38] is that with some batch right? [15:18:39] <_joe_> uh wait, how are you running? [15:18:50] sudo cumin -b 20 -p 95 '*' 'run-puppet-agent -q' [15:18:57] <_joe_> tbh I usually think it's hard for it to take less than 30 minutes [15:19:01] yes [15:19:08] usually not worth [15:19:12] <_joe_> unless you're very aggressive [15:19:19] As you wish, I can skip it [15:19:23] <_joe_> so not -b 20 [15:19:29] (03PS2) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [15:19:30] which might kill the puppetmasters ;-) [15:19:31] (03PS1) 10Jbond: check_puppet_run_changes: use black [puppet] - 10https://gerrit.wikimedia.org/r/893005 [15:19:40] I'll just run it on alert then, that's where tcpircbot runs right? [15:19:45] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:49] cit T280622 [15:19:49] T280622: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 [15:20:09] <_joe_> I'm looking into imagecatalog [15:20:12] ack [15:20:25] !log Disregard running puppet on fleet-wide - T330651 [15:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [15:21:09] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 28 days, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: new OS but some puppet stuff doesn't work yet [15:21:15] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:22] <_joe_> no it's not :P [15:21:33] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:22:14] <_joe_> akosiaris: can you take a look at rb? I'm looking into imagecatalog (cc rzl) [15:22:25] looking at rb also [15:22:37] yeah, I was about to say, this is the second alert this hour [15:23:56] <_joe_> !log oblivian@deploy2002:~ $ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite [15:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:09] akosiaris@restbase1030:/var/log/restbase$ check-restbase [15:24:09] All endpoints are healthy [15:24:13] interesting [15:24:20] <_joe_> yeah those are transient issues clearly [15:24:43] got 1 error out of 10 [15:24:50] so there is something underlying [15:25:57] !log Removing scap lock on deploy2002.codfw.wmnet [15:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:13] !log Testing scap deployment from deploy2002.codfw.wmnet - T330651 [15:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:18] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [15:27:02] errors getting responses from citoid mostly for restbase1030 [15:27:49] (03PS1) 10Elukey: ores: change monitoring for the service [puppet] - 10https://gerrit.wikimedia.org/r/893008 [15:27:54] Why is it trying to reach restbase1030? [15:27:56] (03CR) 10Ladsgroup: mediawiki-cache-warmup: Add POSTs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [15:28:06] claime: restbase1030 is reaching citoid [15:28:11] Ah [15:28:14] citoid not looking great in codfw https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid&viewPanel=15 [15:28:21] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jnuche) Scap updates on Thumbor hosts are currently failing since Scap requires Python 3.7. I would like to disable any Scap deployments until this ticket is comple... [15:28:52] this is zotero actually [15:29:02] https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid&viewPanel=44 [15:29:18] (03PS3) 10BCornwall: config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) [15:29:22] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [15:29:32] (03CR) 10BCornwall: config: Add brett for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:29:38] Tell me if rollback to eqiad is necessary for some services [15:29:40] but it is the usual rate though [15:29:47] ~4rps [15:30:10] <_joe_> zotero, sigh [15:30:15] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:23] but it does have an error rate of 22.7% currently [15:31:11] I don't see something very out of the ordinary tbh [15:31:16] lots of recent restarts on citoid in k8s [15:31:19] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:28] !log installing tiff security updates [15:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:43] (03PS2) 10Jbond: check_puppet_run_changes: use black [puppet] - 10https://gerrit.wikimedia.org/r/893005 [15:31:45] (03PS3) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [15:32:11] claime: don't pause on our behalf btw [15:32:19] akosiaris: I'm not :p [15:33:16] yeah, that rate of 500s is common for Citoid https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid&from=now-30d&to=now&viewPanel=15 [15:34:46] lmao, ~600 restarts per pod in eqiad [15:35:25] that bad ? [15:35:40] on brand for it at least, compared to codfw [15:36:06] 600 for 26d [15:36:10] lol [15:36:40] well, we are doing these switchovers to unearth things [15:36:47] we just unearthed that [15:41:40] (03PS3) 10Clément Goubert: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650) [15:41:45] (03CR) 10Clément Goubert: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650) (owner: 10Clément Goubert) [15:42:04] I think we have a Wikidata-related alert firing since the services switchover [15:42:19] it’s not urgent, because I think only the alert is wrong and not the thing it’s warning about, but when the dust has settled I’d like to talk it over with someone :) [15:44:57] (03CR) 10Clément Goubert: [C: 03+2] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650) (owner: 10Clément Goubert) [15:45:21] !log Traffic: depool eqiad from user traffic - T330650 [15:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:26] T330650: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 [15:45:36] (03CR) 10JMeybohm: [C: 04-1] Add a spark-operator chart and helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:45:38] !log Running authdns-update - T330650 [15:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:48] !log Traffic: eqiad depooled - T330650 [15:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:56] !log cgoubert@deploy2002 Synchronized README: check the deployment server after switchover - T330651 (duration: 20m 56s) [15:48:01] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [15:48:47] 10SRE, 10Citoid: citoid having stability issues - https://phabricator.wikimedia.org/T330768 (10hnowlan) [15:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:35] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:50:19] !log hnowlan@deploy2002 Started deploy [restbase/deploy@5271b8f]: New wikis: gucwp, gurwp, vewikimedia T320899 T326237 T327843 [15:50:28] T326237: Add gucwiki to RESTBase - https://phabricator.wikimedia.org/T326237 [15:50:30] T320899: Add vewikimedia to RESTBase - https://phabricator.wikimedia.org/T320899 [15:50:31] T327843: Add gurwiki to RESTBase - https://phabricator.wikimedia.org/T327843 [15:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:26] ^ we can still ignore that, right? [15:55:36] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/893005 (owner: 10Jbond) [15:55:49] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) [15:55:57] on the above restbase issue - as restbase is deprecated for various services we'll need to be more aware of the services themselves. We had no insight into citoid being in such poor shape [15:56:00] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [15:56:26] (03PS1) 10Muehlenhoff: Install pbuilder hook for ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) [15:56:35] jynus: It's just being a bit slow, don't think there's anything to worry about [15:56:41] POST pods is for updating resources [15:56:53] (03CR) 10CI reject: [V: 04-1] Install pbuilder hook for ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [15:57:23] what I mean, claime, is that almost no POST api shoudl happen on codfw, they should all go to eqiad k8s [15:57:41] Err, yes [15:57:43] (question mark) [15:57:44] We switched over [15:57:54] oh [15:59:18] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/893002 (owner: 10Jbond) [15:59:53] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:59:56] (03PS2) 10Muehlenhoff: Install pbuilder hook for ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) [16:00:17] Lucas_WMDE: I saw your message btw [16:00:35] I'm just watching graphs rn, what's happening with wikidata ? [16:00:55] we got an alert email for https://grafana-rw.wikimedia.org/alerting/grafana/MF0FSjJ4z/view [16:01:18] somehow it got resolved on alerts.w.o (idk how), but it’s still Firing on grafana [16:01:34] and I assume it’s because it’s querying eqiad prometheus/k8s and should now query codfw prometheus/k8s [16:01:48] (assuming the job queue is now processed in coqfw?) [16:01:53] *codfw [16:02:30] hmm I don't know about what it should be querying tbh [16:02:33] godog may [16:02:37] but I haven’t felt bold enough to edit it myself yet ;) (though I seem to have the permission to) [16:02:49] and idk if that would be the wikidata team’s responsibility or someone else’s [16:03:26] mhh I'll take a look [16:03:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [16:03:33] thanks both! [16:03:40] Lucas_WMDE: yes the job queue is in codfw right now [16:03:44] ah no wait, not yet [16:03:52] that's for tomorrow [16:03:57] hm [16:04:08] some things are indeed in codfw [16:04:15] but not the jobqueue [16:04:49] we definitely need a dashboard with pictures for silly people like me [16:04:50] but https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-6h&to=now looks like something moved from eqiad to codfw ca. 1½h ago [16:05:19] eventgate [16:05:39] but the job themselves aren't being run in codfw yet [16:05:54] the events however are being routed to codfw and inserted into kafka there [16:05:57] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@5271b8f]: New wikis: gucwp, gurwp, vewikimedia T320899 T326237 T327843 (duration: 15m 38s) [16:06:00] I see [16:06:07] T326237: Add gucwiki to RESTBase - https://phabricator.wikimedia.org/T326237 [16:06:08] T320899: Add vewikimedia to RESTBase - https://phabricator.wikimedia.org/T320899 [16:06:09] T327843: Add gurwiki to RESTBase - https://phabricator.wikimedia.org/T327843 [16:06:14] and I guess the alert is looking at that [16:06:52] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores: change monitoring for the service [puppet] - 10https://gerrit.wikimedia.org/r/893008 (owner: 10Elukey) [16:07:15] (03CR) 10Urbanecm: GrowthExperiments: Enable Growth features by default on testwikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [16:08:03] Lucas_WMDE: the easiest fix I can think of right now is switching to "thanos" datasource, that'll query both k8s/eqiad and k8s/codfw (in fact it'll query all prometheus, not only k8s) [16:08:13] * Lucas_WMDE looks [16:08:43] that seems to produce an uninterrupted curve over the past 6h at least [16:08:55] with a spike where the switch happened, but that doesn’t sound wrong to me [16:09:05] that there would briefly have been a delay in job processing [16:09:24] !log Switching netbox back to eqiad - T330651 [16:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:31] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:09:41] I’m not familiar with editing alerts in general – is that something I should !log here? [16:09:56] or just save the change? [16:10:07] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Volans) Netbox was switched too to codfw as part of the discovery services switch and appears to be quite slow. This setup has not been tested properly (and the DB was n... [16:10:09] Lucas_WMDE: mind opening a task to #observability re: the "disappeared" alert ? to your question I think !log is a nice courtesy but not required before saving the change [16:10:28] will do, and thanks [16:10:50] thank you! [16:10:50] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route pool netbox in eqiad: T330651 [16:10:51] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet on all recursors [16:10:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet on all recursors [16:10:59] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:25] this it's a race was probably running [16:11:26] I'll check it [16:11:57] !log changed data source of https://grafana-rw.wikimedia.org/alerting/grafana/MF0FSjJ4z/view from “eqiad prometheus/k8s” to “thanos” to query both eqiad and codfw after dc switch [16:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:27] netbox_ganeti_codfw_sync.service above {done} all good [16:12:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:31] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:26] godog: T330770 [16:14:27] T330770: Investigate DispatchChanges Normal job backlog time (mean avg, 15min) alert post datacenter switch - https://phabricator.wikimedia.org/T330770 [16:14:42] and thanks godog and akosiaris for the help :) [16:15:51] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) pool netbox in eqiad: T330651 [16:15:52] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool netbox in codfw: T330651 [16:15:54] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet on all recursors [16:15:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet on all recursors [16:15:58] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:16:05] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [16:16:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [16:16:43] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Ioacc1234red) 05Resolved→03Open Check T324545. [16:17:05] (ConfdResourceFailed) firing: (3) confd resource _var_lib_gdnsd_discovery-netbox.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:19:38] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Zabe) 05Open→03Resolved see T172284#8653345 [16:20:23] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [16:20:54] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool netbox in codfw: T330651 [16:21:01] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:21:43] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Ioacc1234red) 05Resolved→03Open Check T324545#8652803. [16:21:51] Lucas_WMDE: cheers [16:22:05] (ConfdResourceFailed) resolved: (4) confd resource _var_lib_gdnsd_discovery-netbox.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:25:50] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Urbanecm) 05Open→03Resolved Issue reported in this task has been resolved. For new issues, there are other tasks. Thanks. [16:27:13] (03PS2) 10Filippo Giunchedi: Add debianization [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) [16:27:14] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 (10Urbanecm) [16:33:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1006.eqiad.wmnet with reason: host reimage [16:33:53] (03PS3) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) [16:34:48] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Whatever array you create first on the new 15th gen raid controller gets the higher ID, which is the opposite of 14th generation raid controller. I recalled it was... [16:35:40] (03PS1) 10Ladsgroup: Convert eval script to Maintenance class [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892971 [16:35:50] (03CR) 10RLazarus: mediawiki-cache-warmup: Add POSTs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [16:36:09] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:23] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1006.eqiad.wmnet with reason: host reimage [16:36:33] (03PS1) 10JMeybohm: admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) [16:36:36] (03PS1) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) [16:38:04] !log stale discovery files wiped for netbox - T330651 [16:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:10] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:38:25] (03CR) 10CI reject: [V: 04-1] Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [16:40:45] (03CR) 10Herron: [C: 03+1] Add debianization [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:41:17] (03PS4) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [16:41:19] (03PS1) 10Jbond: check_puppet_run_changes: refactor and switch to pql [puppet] - 10https://gerrit.wikimedia.org/r/893022 [16:42:03] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) 05Open→03Resolved [16:42:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [16:43:05] (03PS2) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) [16:43:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [16:43:32] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) 05Open→03Resolved Marking as Resolved for now, will reopen in a week (or whenever `restbase-async` wants us to switch it back). [16:43:49] jouncebot: nowandnext [16:43:49] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [16:43:49] In 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1700) [16:43:52] (03CR) 10Ladsgroup: "tested heavily in mwdebug1002, fixes everything and doesn't break calling to mwscripts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [16:44:28] (03CR) 10Ladsgroup: [C: 03+2] Convert eval script to Maintenance class [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892971 (owner: 10Ladsgroup) [16:44:31] (03CR) 10Zabe: [C: 03+1] mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [16:44:33] Amir1: I'm all done with switchovers [16:44:51] awesome. Thanks. Sorry didn't see it in the logs [16:45:01] !log Traffic and Service switchovers to codfw finished - T330651 - T330650 [16:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:45:10] T330650: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 [16:45:15] Amir1: No worries I didn't have time to log it because I got netsplat [16:47:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892971 (owner: 10Ladsgroup) [16:47:45] (03CR) 10Volans: "Simpler solution proposed inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 (owner: 10Giuseppe Lavagetto) [16:49:03] (03PS3) 10Jbond: check_puppet_run_changes: use black [puppet] - 10https://gerrit.wikimedia.org/r/893005 [16:49:13] (03PS5) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [16:50:41] (03PS6) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [16:52:15] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:55:20] (03PS7) 10Jbond: cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 [16:55:32] (03PS2) 10Jbond: check_puppet_run_changes: refactor and switch to pql [puppet] - 10https://gerrit.wikimedia.org/r/893022 [16:58:09] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:59:14] (03PS3) 10Jbond: check_puppet_run_changes: refactor and switch to pql [puppet] - 10https://gerrit.wikimedia.org/r/893022 [17:00:04] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:13] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [17:00:18] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:00:19] (03CR) 10Ayounsi: [C: 03+1] "lgtm, let me know if you need help rolling it out." [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:00:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye completed: - dumpsdata1006 (**PASS*... [17:01:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [17:01:54] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH) [17:01:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) 05In progress→03Resolved a:05RobH→03ArielGlenn both hosts online and ready for your use! [17:02:24] cdanis: aokoth: nothing to report ( cc Emperor ) [17:03:21] (03Merged) 10jenkins-bot: Convert eval script to Maintenance class [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892971 (owner: 10Ladsgroup) [17:03:56] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:892971|Convert eval script to Maintenance class]] [17:05:48] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:892971|Convert eval script to Maintenance class]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [17:07:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2070'] [17:07:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2070'] [17:09:12] (03CR) 10Herron: [C: 03+1] prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [17:09:21] (03CR) 10Herron: [C: 03+1] prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [17:09:57] exit [17:10:00] heh :) [17:10:13] didn't work :D [17:10:25] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:43] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:892971|Convert eval script to Maintenance class]] (duration: 07m 47s) [17:13:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [17:13:52] (03Merged) 10jenkins-bot: mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [17:14:17] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:889259|mwscript: Switch to use run.php (T326800)]] [17:14:24] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [17:14:44] (03CR) 10Vgutierrez: [C: 03+1] ores: change monitoring for the service [puppet] - 10https://gerrit.wikimedia.org/r/893008 (owner: 10Elukey) [17:16:27] thanks jbond <3 [17:18:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2070'] [17:23:48] (03PS3) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 [17:23:50] (03PS2) 10Volans: sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 [17:23:52] (03PS1) 10Volans: sre.aqs.roll-restart-reboot: fix file path [cookbooks] - 10https://gerrit.wikimedia.org/r/893026 [17:26:29] (03CR) 10Elukey: [C: 03+1] sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 (owner: 10Volans) [17:26:41] (03CR) 10Elukey: [C: 03+1] sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 (owner: 10Volans) [17:31:31] wamp, `MWException: Error: invalid magic word 'pendingchangelevel'` on the *beta cluster*, who pushed a bad change hmm? [17:33:09] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:889259|mwscript: Switch to use run.php (T326800)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [17:33:14] (03CR) 10FNegri: [C: 03+1] "I discussed this with Andrew, let's wait a few more days before merging it." [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [17:33:16] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [17:36:04] (03CR) 10Elukey: [C: 03+1] sre.aqs.roll-restart-reboot: fix file path [cookbooks] - 10https://gerrit.wikimedia.org/r/893026 (owner: 10Volans) [17:36:52] (03PS2) 10Volans: sre.aqs.roll-restart-reboot: fix file path [cookbooks] - 10https://gerrit.wikimedia.org/r/893026 [17:36:56] (03CR) 10Volans: [C: 03+2] sre.aqs.roll-restart-reboot: fix file path [cookbooks] - 10https://gerrit.wikimedia.org/r/893026 (owner: 10Volans) [17:38:22] !log ladsgroup@deploy2002 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [17:38:22] !log ladsgroup@deploy2002 scap failed: RuntimeError Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. (duration: 24m 05s) [17:38:52] (03Merged) 10jenkins-bot: sre.aqs.roll-restart-reboot: fix file path [cookbooks] - 10https://gerrit.wikimedia.org/r/893026 (owner: 10Volans) [17:39:38] ah okay, 889259 was bad? [17:40:26] (03PS3) 10Volans: sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 [17:41:03] TheresNoTime: needs sync order I think [17:41:03] sigh [17:41:11] I need a drink [17:41:17] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:41:32] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 (owner: 10Volans) [17:41:39] (03PS2) 10ArielGlenn: for dumpsdata1004 through 1007 use the partman recipe for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/892437 (https://phabricator.wikimedia.org/T330573) [17:41:44] it works well in mwdebug [17:41:54] (03PS1) 10Nicolas Fraison: hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 [17:42:10] sigh, well, it doesn't [17:42:32] (03PS1) 10TrainBranchBot: Revert "mwscript: Switch to use run.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893030 [17:42:34] (03CR) 10TrainBranchBot: "ladsgroup@deploy2002 created a revert of this change as I81175e27611ce7270e27850953892eb5ef15e8db" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [17:42:44] are you sure, going to en.wikipedia.org through mwdebug currently throws "MediaWiki internal error." [17:43:03] zabe: so I tested mwscript in mwdebug [17:43:06] (03PS2) 10Nicolas Fraison: hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) [17:43:15] and that worked fine [17:43:21] 🙃 [17:43:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893030 (owner: 10TrainBranchBot) [17:43:31] (03Merged) 10jenkins-bot: sre.hosts.reimage: expand help message for --os [cookbooks] - 10https://gerrit.wikimedia.org/r/892923 (owner: 10Volans) [17:43:32] I didn't see why it could break the appserver [17:43:34] yeah ok, tbh, I also don't really get where this effect is comming from [17:43:49] yeah, I need to debug [17:44:02] gonna revert the revert and start debugging in mwdebug1002 [17:44:09] (03Merged) 10jenkins-bot: Revert "mwscript: Switch to use run.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893030 (owner: 10TrainBranchBot) [17:44:26] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39874/console" [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [17:44:29] noted on T330779, *how* that broke MagicWords is beyond me [17:44:29] T330779: en.wikipedia.beta.wmflabs.org — MWException: Error: invalid magic word 'pendingchangelevel' - https://phabricator.wikimedia.org/T330779 [17:44:31] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:893030|Revert "mwscript: Switch to use run.php"]] [17:47:32] (03PS1) 10ArielGlenn: Add dumpsdata1004 and dumpsdata1005 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) [17:49:00] 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jgreen) a:05Jgreen→03None [17:50:06] gosh why this thing is so slow again [17:50:39] * urbanecm asks that often when deploying [17:50:56] i sort of miss the times when scap sync-file took less than a minute. scap backport's much more convenient though. [17:51:33] urbanecm: sync-file would be slow as well, the source of slowness right now is building k8s images [17:51:50] indeed. that's why i said that i miss those times :) [17:52:43] took 35 minutes just to reach the canaries (from the patch being merged) [17:52:55] akosiaris: claime I think this thing broke again [17:53:54] so now we have a partial outage for the next 35 min, I guess [17:54:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jgreen) a:03Jclark-ctr Frpm1001 is powered off and all set for decommissioning. Please note, per prior discussion with Willy I did not run the cookbook stuff since this is a f... [17:54:44] pybal should depool them, doesn't? [17:54:50] sigh [17:55:00] (it's just canaries) [17:55:27] no it doesn't, we a have a nice error rate of ~300 per min [17:56:11] Amir1: what broke ? [17:56:17] scap ? [17:56:21] yup [17:56:29] 35 minutes just to reach canaries [17:56:53] lemme have a look [17:57:06] and now I'm syncing a revert and it's taken 11 minutes already and not even testservers yet [17:57:35] can we depool the canaries manually in the meantime? [17:57:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:57:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:58:17] https://phabricator.wikimedia.org/P44904 These are [17:59:13] Looks like l10n files have been updated. [17:59:24] Always a recipe for a long deployemtn [17:59:27] *deployment [18:00:00] dancy: does php array files make it faster instead of cdb? [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1800) [18:00:06] I can't say for sure, maybe [18:00:49] I would support depooling canaries, they make up something between 5 and 10 percent of all appservers, so we basically have a 5-10% outage at the moment [18:00:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2071'] [18:01:22] okay, let me see [18:03:05] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1449.eqiad.wmnet [18:03:13] depooled one [18:03:24] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1414.eqiad.wmnet [18:03:33] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1450.eqiad.wmnet [18:03:45] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1447.eqiad.wmnet [18:03:51] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1448.eqiad.wmnet [18:03:59] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1415.eqiad.wmnet [18:04:06] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1416.eqiad.wmnet [18:04:13] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1417.eqiad.wmnet [18:04:20] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1418.eqiad.wmnet [18:04:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) [18:04:25] all depooled [18:05:01] Amir1: The CDB files are not drastically larger than their PHP counterparts. PHP l10n files might rsync more efficiently though. Regardless of rsync efficiency, if all of the l10n files change (which can happen if just a single English source l10n file changes), that's several gigabytes of data to store in the container image and deploy. [18:06:09] I am not clear on why the revert is taking a long time [18:06:19] and why depooling the canaries helps [18:06:35] <_joe_> I am not clear either on why it's taking a long time [18:06:47] the problematic patch got stopped at canaries [18:06:52] <_joe_> we already deployed from deploy2002 so any change to the mw image should be small [18:06:55] didn't roll out further [18:06:57] <_joe_> ok [18:06:59] Scap doesn't keep an old copy, so a revert is a resync of older content. [18:07:05] <_joe_> why is the rollback slow then? [18:07:14] <_joe_> what is being slow [18:07:42] It started at :42 [18:07:55] <_joe_> is it the image builds? [18:07:57] <_joe_> or what? [18:07:58] the revert started at :42, it's :07 now [18:08:08] let me grab logs [18:08:24] am i needed or can i stay in hibernation? [18:08:28] <_joe_> dancy: are you taking a look too? [18:08:31] <_joe_> claime: go away [18:08:34] _joe_: yeah [18:08:40] Error creating: pods "mediawiki-main-5d88c88555-hd9b8" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=8250m, used: limits.cpu=123750m, limited: limits.cpu=125 [18:08:44] _joe_: ^ [18:08:50] ok I got something from mw-web [18:08:51] <_joe_> uh this is new [18:08:52] _joe_: ok [18:08:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2071'] [18:09:05] <_joe_> akosiaris: are we doing anything with the k8s nodes? [18:09:07] but also right before the deploy I deployed a backport [18:09:24] <_joe_> because this doesn't make much sense [18:09:27] https://phabricator.wikimedia.org/P44905 [18:09:37] logs of all deploys [18:09:49] that's eqiad btw [18:09:55] I don't see anything similar in codfw [18:10:19] but I do get some weird error [18:10:22] errors* [18:10:47] no wait, "warnings" [18:10:52] mw-api-int 33m Warning FailedMount pod/mediawiki-canary-5496b45666-jc29z MountVolume.SetUp failed for volume "mediawiki-canary-httpd-sites" : failed to sync configmap cache: timed out waiting for the condition [18:10:52] mw-api-int 33m Warning FailedMount pod/mediawiki-canary-5496b45666-jc29z MountVolume.SetUp failed for volume "mediawiki-canary-mcrouter" : failed to sync configmap cache: timed out waiting for the condition [18:11:08] I depooled canaries, we are still getting errors it seems [18:11:09] (03PS1) 10Andrew Bogott: OpenStack: rename 'user' role to 'member' [puppet] - 10https://gerrit.wikimedia.org/r/893036 (https://phabricator.wikimedia.org/T330759) [18:11:16] I don't know why but change 889259 resulted in l10n rebuild.. .and so did reverting it [18:11:24] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=63&from=now-3h&to=now [18:11:24] <_joe_> sigh ok [18:11:31] <_joe_> now we're past the test servers, right? [18:11:43] not yet [18:11:54] still at sync-testservers [18:12:20] <_joe_> ... [18:12:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2072'] [18:12:39] just jumped 25% [18:12:44] 75% [18:12:53] I hope it reaches canaries soon [18:13:04] !log ladsgroup@deploy2002 trainbranchbot and ladsgroup: Backport for [[gerrit:893030|Revert "mwscript: Switch to use run.php"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [18:14:22] no finally at sync-canaries [18:14:29] 0% though [18:14:56] <_joe_> I would assume it's some issue with deploy2002 at this point [18:15:06] <_joe_> unless we're syncing a ton of stuff [18:15:21] <_joe_> which might be what dancy was saying, we're syncing all l10n files [18:15:50] it really doesn't make sense, I backported a change like one minute before it [18:16:11] <_joe_> why did this change trigger a full rebuild of l10n is puzzling [18:16:18] very puzzling.. [18:16:23] no l10n source files were affected [18:17:10] <_joe_> but also, it's 7.15 pm. I have to prepare dinner. [18:17:21] why do you think there was a l10n rebuild, in Amirs logs it says "0 languages rebuilt out of 477" and l10n-update finished within 4 seconds [18:17:39] sorry [18:17:42] wrong part [18:17:46] I'm silent [18:18:31] (03PS6) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [18:19:05] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) @ayounsi @cmooney this task was for Inbound interface errors on vcp-255/0/48 - vcp-255/0/48 (asw-b-codfw) and got rename today when we did have another Inbound interface errors for ge-6/0/6 on asw-c6-codfw. it supposed... [18:20:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2072'] [18:20:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2073'] [18:20:44] okay we are in a normal state, I repool them, I don't know what I did wrong that the depool didn't work either [18:21:08] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1418.eqiad.wmnet [18:21:18] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:21:38] <_joe_> Amir1: FWIW, I think the error rate was from pybal checks [18:21:46] <_joe_> so the servers were already depooled from traffic [18:22:00] aaah [18:22:02] nice [18:22:19] (03CR) 10BCornwall: [C: 03+2] config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:22:33] <_joe_> see https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-site=All&var-cluster=appserver&var-node=mw1418&var-php_version=proxy:unix:%2Frun%2Fphp%2Ffpm-www.%2A&viewPanel=92 [18:22:33] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1449.eqiad.wmnet [18:22:46] (03CR) 10BCornwall: [V: 03+2 C: 03+2] config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:22:47] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1414.eqiad.wmnet [18:22:49] <_joe_> so our safety checks saved us in this case it seems :) [18:22:55] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1450.eqiad.wmnet [18:23:08] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1447.eqiad.wmnet [18:23:14] <_joe_> now I really need to go afk [18:23:16] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1448.eqiad.wmnet [18:23:21] See ya joe [18:23:26] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1415.eqiad.wmnet [18:23:34] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1416.eqiad.wmnet [18:23:39] I'll spend some time seeing if I can figure out why there was unexpected l10n rebuild [18:23:42] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1417.eqiad.wmnet [18:23:50] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: name=mw1418.eqiad.wmnet [18:24:54] thank you <3 [18:25:40] why it broke itself is also baffling to me, the code only touches CLI stuff and that worked well. I think it might be sync order [18:26:06] Order of what? [18:26:35] two files have changes, one depends on the other, order of sync is random [18:26:43] it has more outages that I can count [18:26:51] l10n rebuild happens before sync is involved. [18:26:52] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:893030|Revert "mwscript: Switch to use run.php"]] (duration: 42m 21s) [18:27:17] 42m 21s 😭 [18:27:44] dancy: sorry to confuse you, I mean the original patch I was deployed that broke canaries [18:28:01] ah, gotcha [18:28:18] Amir: Butchering English language as long as he remembers [18:28:53] anyway, I go rest a bit, will check it later [18:31:09] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:34:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:25] topranks: Looks like https://gerrit.wikimedia.org/r/c/operations/homer/public/+/889635 hasn't been applied via homer yet, and I'm getting concerning changes I don't want to apply :) https://paste.debian.net/plain/1272463 [18:40:06] It looks like previously-disabled interfaces are getting set up because they're not defined in the templating. Is this intended, and should I be applying these changes? [18:40:39] or rather, netbox changes outside the homer change is causing these interface diffs [18:45:25] (03PS3) 10Bas dehaan: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) [18:49:01] (03CR) 10Bas dehaan: Added extended confirmed on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [18:49:10] (03CR) 10BryanDavis: [C: 03+1] Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) (owner: 10Majavah) [18:55:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2073'] [18:58:22] (03CR) 10Bas dehaan: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [19:01:09] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10KFrancis) Hi all, the agreement is out for signatures. I'll confirm when it's complete. Thanks! [19:10:40] (03CR) 10Majavah: [C: 03+2] Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) (owner: 10Majavah) [19:11:54] (03Merged) 10jenkins-bot: Drop Tomcat support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/888296 (https://phabricator.wikimedia.org/T141396) (owner: 10Majavah) [19:16:31] brett: hey [19:16:40] that patch you list was merged, so should be applied everywhere [19:17:33] topranks: I'm concerned about the out-of-band changes via netbox. Are you saying that we should just accept them when they come up with the assumption that they're safe? [19:17:54] Those changes are because papaul added that device to Netbox a short time ago [19:17:55] https://www.irccloud.com/irc/libera.chat/channel/wikimedia-operations [19:18:07] sry, wrong link: https://netbox.wikimedia.org/dcim/devices/4630/changelog/ [19:18:47] Usual pattern is dc-ops would add the new server to Netbox (and specify switch port) after cabling it up physically [19:19:02] They then run Homer to configure the switch, but seems like that hasn't happened in this case [19:19:17] It's safe to apply. [19:19:24] gracias [19:19:28] What changes are you making on asw-b-codfw ? [19:19:46] I'm adding myself as a user so I can do some "depooling" during reimaging [19:20:15] ah yeah that makes sense - the user add [19:20:23] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [19:20:30] I seen the discussion on that, I think the changes you will need to make are just on _routers_ [19:20:55] But you should run Homer against all devices, so makes sense you need to modify asw-b-codfw [19:21:16] !log [WDQS] (The following was ~20 hours ago, forgot to press enter) T301167 Transferred `/srv/wdqs/categories.jnl` from `wdqs2001` (in-service host) to `wdqs20[09-12]` (new hosts being brought into service) [19:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:22] brett: anyhow safe to proceed with that, let me know if any problems [19:21:23] T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167 [19:21:51] !log [WDQS] (Current time) T301167 Re-enabled icinga notifications for `wdqs20[09-12]` [19:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2073'] [19:28:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2073'] [19:29:41] Amir1, _joe_: I added notes: https://phabricator.wikimedia.org/P44905#182569 [19:30:06] & zabe [19:30:19] <_joe_> dancy: daaaamn [19:30:42] <_joe_> good catch [19:31:16] (03PS4) 10Ryan Kemper: wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [19:35:16] (03PS5) 10Ryan Kemper: wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [19:35:32] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/893047 [19:36:13] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/893047 (owner: 10Ahmon Dancy) [19:37:00] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/893047 (owner: 10Ahmon Dancy) [19:51:22] !log dancy@deploy2002 Installing scap version "latest" for 550 hosts [19:51:48] (03CR) 10Gehel: [C: 03+1] "LGTM, tested on grafana" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:52:11] (03CR) 10Ryan Kemper: "Tested here: https://grafana.wikimedia.org/dashboard/snapshot/pbpobbTGLIm1wOp2IbXla8xnH1qM6rJp?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:54:25] !log dancy@deploy2002 Started scap: testing [19:58:22] (03PS6) 10Ryan Kemper: wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [20:00:37] (03PS7) 10Ryan Kemper: wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [20:02:16] (03CR) 10Ryan Kemper: "Actually I'd accidentally rebased not onto the origin so hadn't really rebased at all. New preview: https://grafana.wikimedia.org/dashboar" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:02:29] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:02:35] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:08:23] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper) [20:10:09] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper) [20:10:54] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper) [20:14:53] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:15:10] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:16:01] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:16:16] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:19:07] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:20:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:20:03] (03PS1) 10ArielGlenn: delay start of the March xml dump rn unti the evening [puppet] - 10https://gerrit.wikimedia.org/r/893055 (https://phabricator.wikimedia.org/T330573) [20:25:27] jouncebot: nowandnext [20:25:28] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [20:25:28] In 0 hour(s) and 34 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T2100) [20:25:38] RECOVERY - Host ms-fe2013 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [20:25:44] (03CR) 10Zabe: [C: 03+2] MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892967 (https://phabricator.wikimedia.org/T330746) (owner: 10Zabe) [20:25:50] (03CR) 10Zabe: [C: 03+2] MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892966 (https://phabricator.wikimedia.org/T330746) (owner: 10Zabe) [20:37:06] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Thanks for all the help!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:37:29] (03Abandoned) 10Ryan Kemper: wdqs: fix request request error ratio sli pane [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/867695 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:43:17] (03Merged) 10jenkins-bot: MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/892967 (https://phabricator.wikimedia.org/T330746) (owner: 10Zabe) [20:43:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892966 (https://phabricator.wikimedia.org/T330746) (owner: 10Zabe) [20:43:29] (03Merged) 10jenkins-bot: MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/892966 (https://phabricator.wikimedia.org/T330746) (owner: 10Zabe) [20:43:54] !log zabe@deploy2002 Started scap: Backport for [[gerrit:892967|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]], [[gerrit:892966|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]] [20:44:03] T321881: Add namespace translations in Wayuu - https://phabricator.wikimedia.org/T321881 [20:44:03] T330746: MediaWiki\Page\PageAssertionException: The given PageIdentity {pageIdentity} does not represent a proper page - https://phabricator.wikimedia.org/T330746 [20:45:22] (03CR) 10Superpes15: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [20:45:51] !log zabe@deploy2002 zabe: Backport for [[gerrit:892967|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]], [[gerrit:892966|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:50:12] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:27] (03PS1) 10Zabe: close testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893058 (https://phabricator.wikimedia.org/T213295) [20:54:00] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:892967|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]], [[gerrit:892966|MessagesGuc: Remove trailing space from NS_TEMPLATE_TALK translation (T330746 T321881)]] (duration: 10m 06s) [20:54:08] T321881: Add namespace translations in Wayuu - https://phabricator.wikimedia.org/T321881 [20:54:09] T330746: MediaWiki\Page\PageAssertionException: The given PageIdentity {pageIdentity} does not represent a proper page - https://phabricator.wikimedia.org/T330746 [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T2100). [21:00:05] edsanders and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:24] * TheresNoTime can deploy [21:00:34] o/ [21:00:49] zabe: are you still deploying? [21:00:59] nope [21:01:11] (03PS2) 10Samtar: Disable VectorPromoteAddTopic on production wikis initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891836 (https://phabricator.wikimedia.org/T267444) (owner: 10Esanders) [21:02:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891836 (https://phabricator.wikimedia.org/T267444) (owner: 10Esanders) [21:02:58] (03Merged) 10jenkins-bot: Disable VectorPromoteAddTopic on production wikis initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891836 (https://phabricator.wikimedia.org/T267444) (owner: 10Esanders) [21:03:25] !log samtar@deploy2002 Started scap: Backport for [[gerrit:891836|Disable VectorPromoteAddTopic on production wikis initially (T267444)]] [21:03:32] T267444: Make the affordance(s) for adding a new topic easier to identify and access (Vector 2022) - https://phabricator.wikimedia.org/T267444 [21:03:44] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10demon) From the looks of it, we can add OIDC as a second [omniauth provider](https://docs.gitlab.com/ee/integration/omniauth.html). We... [21:05:20] !log samtar@deploy2002 esanders and samtar: Backport for [[gerrit:891836|Disable VectorPromoteAddTopic on production wikis initially (T267444)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:05:31] subbu: that's live on mwdebug, are you able to test? [21:05:58] And, TheresNoTime ed says: "it's a new feature that hasn't been enabled before" ... and this config patch simply sets the default to false so that the new feature isn't enabled anywhere. So, I am not sure there is a place to test ... although this week the train rolled out to group0 already .. so let me see if this disables anything on testwiki. [21:07:37] ack, is there anything you want to check before I sync? [21:07:54] actually, yes, it works ... on https://test.wikipedia.org/wiki/Talk:Testesda .. on mwdebug on codfw, the 'add topic' button disappears after this patch. [21:07:56] so, good to go. [21:08:08] syncing [21:08:16] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:01] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:891836|Disable VectorPromoteAddTopic on production wikis initially (T267444)]] (duration: 10m 36s) [21:14:08] T267444: Make the affordance(s) for adding a new topic easier to identify and access (Vector 2022) - https://phabricator.wikimedia.org/T267444 [21:14:08] subbu: that's live :) [21:14:40] thanks! [21:16:08] TheresNoTime: are you still deploying? [21:16:15] I'd like to push sth out if not [21:16:20] urbanecm: nope, go ahead :) [21:16:23] ty [21:17:15] (03PS4) 10Urbanecm: GrowthExperiments: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) [21:18:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [21:18:42] (03Merged) 10jenkins-bot: GrowthExperiments: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (https://phabricator.wikimedia.org/T330748) (owner: 10Urbanecm) [21:19:07] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:892529|GrowthExperiments: Enable Growth features by default on testwikis (T330748)]] [21:19:14] T330748: Enable Growth features by default on test and test2.wikipedia.org - https://phabricator.wikimedia.org/T330748 [21:26:50] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:892529|GrowthExperiments: Enable Growth features by default on testwikis (T330748)]] (duration: 07m 43s) [21:26:57] T330748: Enable Growth features by default on test and test2.wikipedia.org - https://phabricator.wikimedia.org/T330748 [21:32:27] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@fc4e023]: Deploying section_image_recommendations DAG to platform_eng Airflow instance [21:32:48] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@fc4e023]: Deploying section_image_recommendations DAG to platform_eng Airflow instance (duration: 00m 21s) [21:42:39] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:47:28] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Jclark-ctr) [21:47:36] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Jclark-ctr) 05Open→03Resolved [21:48:22] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10Jclark-ctr) [21:48:31] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10Jclark-ctr) 05Open→03Resolved [21:49:12] 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10Jclark-ctr) [21:50:29] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:50:35] 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10Jclark-ctr) 05Open→03Resolved [21:51:44] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:27] 10SRE, 10ops-eqiad, 10decommission-hardware, 10User-fgiunchedi: decommission graphite1004.eqiad.wmnet - https://phabricator.wikimedia.org/T324089 (10Jclark-ctr) 05Open→03Resolved [21:53:34] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:16] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission restbase-dev100{4,5,6} - https://phabricator.wikimedia.org/T325387 (10Jclark-ctr) 05Open→03Resolved [21:55:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 (10Jclark-ctr) 05Open→03Resolved [21:57:50] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:00:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:01:40] !log started rsync from dumpsdata1001 to dumpsdata1004 of /data/otherdumps, running in ariel screen session, no bandwidth cap [22:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Jclark-ctr) [22:21:18] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump-s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:46] (03CR) 10Zabe: [C: 03+2] close testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893058 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [22:23:41] (03Merged) 10jenkins-bot: close testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893058 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [22:23:49] !log brennen@deploy2002 Started deploy [phabricator/deployment@3f2dd1b]: debug deploy to aphlict2001 [22:24:26] !log brennen@deploy2002 Finished deploy [phabricator/deployment@3f2dd1b]: debug deploy to aphlict2001 (duration: 00m 37s) [22:25:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Jclark-ctr) 05Open→03Resolved [22:25:56] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) [22:31:01] !log zabe@deploy2002 Synchronized dblists/: close testcommonswiki T213295 (duration: 06m 40s) [22:31:08] T213295: Close and delete TestCommons from production - https://phabricator.wikimedia.org/T213295 [22:42:58] (03PS1) 10Zabe: Drop custom testcommonswiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893066 (https://phabricator.wikimedia.org/T213295) [22:46:43] !log zabe@deploy2002 Synchronized dblists/: close testcommonswiki T213295 (duration: 07m 11s) [22:46:50] T213295: Close and delete TestCommons from production - https://phabricator.wikimedia.org/T213295 [22:46:54] (03CR) 10Krinkle: [C: 03+1] Drop custom testcommonswiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893066 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [22:49:40] (03PS1) 10JHathaway: gitignore: ignore vendor/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/893068 (https://phabricator.wikimedia.org/T320554) [22:53:37] (03PS1) 10Dzahn: switch static-codereview.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893069 (https://phabricator.wikimedia.org/T330090) [22:53:58] (03CR) 10Dzahn: [C: 03+2] "read-only HTML version of MediaWiki's SVN CodeReview system." [puppet] - 10https://gerrit.wikimedia.org/r/892584 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [22:55:18] (03CR) 10Dzahn: [C: 03+2] "read-only HTML version of MediaWiki's SVN CodeReview system." [puppet] - 10https://gerrit.wikimedia.org/r/893069 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [22:55:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:59:57] (03PS1) 10JHathaway: jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893071 (https://phabricator.wikimedia.org/T320554) [23:00:31] (03CR) 10JHathaway: [C: 03+2] gitignore: ignore vendor/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/893068 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:06:09] (03PS1) 10Dzahn: switch transparency-archive.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893072 (https://phabricator.wikimedia.org/T330090) [23:06:49] (03CR) 10Dzahn: [C: 03+2] switch transparency-archive.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893072 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [23:09:11] (03Merged) 10jenkins-bot: gitignore: ignore vendor/bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/893068 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:09:31] (03CR) 10JHathaway: [C: 03+2] jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893071 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:11:06] (03CR) 10Zabe: [C: 03+2] Drop custom testcommonswiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893066 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [23:12:00] (03Merged) 10jenkins-bot: Drop custom testcommonswiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893066 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [23:12:05] (03PS1) 10Dzahn: switch transparency.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893073 (https://phabricator.wikimedia.org/T330090) [23:12:33] !log zabe@deploy2002 Started scap: Backport for [[gerrit:893066|Drop custom testcommonswiki groups (T213295)]] [23:12:40] T213295: Close and delete TestCommons from production - https://phabricator.wikimedia.org/T213295 [23:12:49] (03CR) 10Dzahn: [C: 03+2] switch transparency.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893073 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [23:13:41] (03CR) 10Krinkle: Start using the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [23:14:01] (03CR) 10Krinkle: [C: 04-1] "I believe this will fail in prod in logging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [23:14:17] !log zabe@deploy2002 zabe: Backport for [[gerrit:893066|Drop custom testcommonswiki groups (T213295)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [23:18:03] (03PS6) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [23:18:05] (03PS1) 10JHathaway: Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) [23:18:07] (03PS1) 10JHathaway: jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) [23:18:20] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:18:22] (03CR) 10CI reject: [V: 04-1] Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:18:30] (03CR) 10CI reject: [V: 04-1] jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:20:30] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:893066|Drop custom testcommonswiki groups (T213295)]] (duration: 07m 57s) [23:20:34] (03CR) 10JHathaway: "kindly review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:20:37] T213295: Close and delete TestCommons from production - https://phabricator.wikimedia.org/T213295 [23:24:16] !log miscweb2002 rm -rf /srv/org/wikimedia/design/blog/ - this has moved to /srv/org/wikimedia/design-blog but was not deleted in codfw - bringing both to the same state before switching design.wikimedia.org over T330090 [23:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:23] T330090: Switchover static miscweb services to codfw - https://phabricator.wikimedia.org/T330090 [23:28:47] (03PS1) 10Dzahn: switch design.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893078 (https://phabricator.wikimedia.org/T330090) [23:33:26] (03PS1) 10Zabe: testcommonswiki: Remove some settings which are no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893080 (https://phabricator.wikimedia.org/T213295) [23:35:40] (03CR) 10Zabe: [C: 03+2] testcommonswiki: Remove some settings which are no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893080 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [23:36:26] (03Merged) 10jenkins-bot: testcommonswiki: Remove some settings which are no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893080 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [23:43:54] !log zabe@deploy2002 Synchronized wmf-config/InitialiseSettings.php: T213295 (duration: 06m 56s) [23:44:02] T213295: Close and delete TestCommons from production - https://phabricator.wikimedia.org/T213295 [23:47:44] (03CR) 10Dzahn: [C: 03+2] switch design.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893078 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [23:48:49] (03PS5) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:48:57] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:50:29] (03PS6) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:50:37] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:50:43] (03PS7) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:51:29] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:52:27] (03PS1) 10Krinkle: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) [23:53:08] (03PS8) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:53:51] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:54:54] (03PS9) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:55:35] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:56:22] (03PS10) 10Zabe: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:57:31] (03CR) 10Zabe: [C: 03+2] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:58:13] (03Merged) 10jenkins-bot: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [23:58:37] !log zabe@deploy2002 Started scap: T198673 [23:58:44] T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673