[00:01:19] (03PS2) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [00:01:55] 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10BCornwall) p:05Medium→03Low [00:04:21] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01036 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:06:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:06:26] I see no recent merge that caused this. It's just that special cases added up to be slightly over the treshold. [00:06:33] re: puppet failure [00:07:05] tries puppet though on random stat host because there are multiple of those in the list [00:07:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:44] and yea, it has errors but nothing global [00:08:21] something about missing packages and scap targets that doesnt seem urgent [00:09:26] maybe I should just make a single "failed puppet runs" ticket with checkboxes for different teams to check [00:09:48] because we need to get under the alerting threshold again.. that's kind of the point of the alert anyways [00:10:32] then on the other hand.. I did this before and afair people don't like tickets that need to be passed on to multiple teams.. but also we do exactly that for reboots.. shrug [00:12:44] (03PS3) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [00:15:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:16:59] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:26] (03PS1) 10Zabe: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891379 [00:17:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:18:33] (03PS2) 10Zabe: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891379 [00:19:33] (03CR) 10Zabe: [C: 03+2] build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891379 (owner: 10Zabe) [00:19:45] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10Dzahn) [00:20:12] (03Merged) 10jenkins-bot: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891379 (owner: 10Zabe) [00:24:59] 10SRE: too many puppet failures (puppet errors on logstash hosts) - https://phabricator.wikimedia.org/T330361 (10Dzahn) [00:25:43] (03PS1) 10Zabe: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891380 [00:26:02] (03CR) 10Zabe: [C: 03+2] throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891380 (owner: 10Zabe) [00:26:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891380 (owner: 10Zabe) [00:27:12] (03Merged) 10jenkins-bot: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891380 (owner: 10Zabe) [00:27:35] !log zabe@deploy1002 Started scap: Backport for [[gerrit:891380|throttle: Remove expired rules]] [00:28:37] (03CR) 10Andrew Bogott: "@Taavi, adding you because there's some explanation here about why cumin didn't work on all the canaries." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [00:29:33] !log zabe@deploy1002 zabe: Backport for [[gerrit:891380|throttle: Remove expired rules]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [00:30:01] (03CR) 10Andrew Bogott: "@Volans we were bit by this issue again -- cumin is largely broken for cloud-vps currently. Can we please get this or a variant merged soo" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [00:32:35] (03PS1) 10Dzahn: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) [00:32:50] (03PS2) 10Dzahn: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) [00:33:40] (03CR) 10CI reject: [V: 04-1] switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [00:34:17] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:11] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:891380|throttle: Remove expired rules]] (duration: 08m 36s) [00:36:34] (03CR) 10Dzahn: [V: 04-1] "oh wow, DNS lint points out mistakes from the past. we replaced people2001 with people2002 but did not change the commented out line. 16:" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [00:37:09] (03PS2) 10Zabe: Fix interwiki prefix for generic wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) (owner: 10Aklapper) [00:37:12] (03CR) 10Zabe: [C: 03+2] Fix interwiki prefix for generic wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) (owner: 10Aklapper) [00:37:52] (03PS3) 10Dzahn: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) [00:37:54] (03Merged) 10jenkins-bot: Fix interwiki prefix for generic wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) (owner: 10Aklapper) [00:38:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) (owner: 10Aklapper) [00:39:03] !log zabe@deploy1002 Started scap: Backport for [[gerrit:886949|Fix interwiki prefix for generic wikimaniawiki (T327575)]] [00:39:07] T327575: Broken interwikis in translation notifications for Wikimania wiki - https://phabricator.wikimedia.org/T327575 [00:40:53] !log zabe@deploy1002 aklapper and zabe: Backport for [[gerrit:886949|Fix interwiki prefix for generic wikimaniawiki (T327575)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [00:41:13] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:45:34] (03PS1) 10Dzahn: peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/891382 (https://phabricator.wikimedia.org/T330091) [00:47:36] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886949|Fix interwiki prefix for generic wikimaniawiki (T327575)]] (duration: 08m 33s) [00:47:41] T327575: Broken interwikis in translation notifications for Wikimania wiki - https://phabricator.wikimedia.org/T327575 [00:51:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:56:59] (03CR) 10Dzahn: "as opposed to planet this does need a puppet change as well, but a harmless one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/8913" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [00:57:17] (03PS1) 10Dzahn: re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) [01:05:25] (03PS1) 10Dzahn: switch annual.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/891406 (https://phabricator.wikimedia.org/T330090) [01:07:36] (03PS2) 10Dzahn: re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) [01:29:25] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:31:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new wdqs nodes - pt1979@cumin2002" [01:32:06] (03CR) 10Krinkle: [C: 04-1] Added extended confirmed on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [01:32:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new wdqs nodes - pt1979@cumin2002" [01:32:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:33:23] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2013.mgmt.codfw.wmnet with reboot policy FORCED [01:35:13] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2014.mgmt.codfw.wmnet with reboot policy FORCED [01:35:59] (03PS5) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [01:37:29] (03CR) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [01:37:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2015.mgmt.codfw.wmnet with reboot policy FORCED [01:39:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2016.mgmt.codfw.wmnet with reboot policy FORCED [01:39:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:45:17] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:48:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2013.mgmt.codfw.wmnet with reboot policy FORCED [01:48:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2014.mgmt.codfw.wmnet with reboot policy FORCED [01:48:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2015.mgmt.codfw.wmnet with reboot policy FORCED [01:51:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2016.mgmt.codfw.wmnet with reboot policy FORCED [01:52:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2017.mgmt.codfw.wmnet with reboot policy FORCED [01:53:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2018.mgmt.codfw.wmnet with reboot policy FORCED [01:55:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2019.mgmt.codfw.wmnet with reboot policy FORCED [01:57:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2020.mgmt.codfw.wmnet with reboot policy FORCED [01:59:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2018.mgmt.codfw.wmnet with reboot policy FORCED [01:59:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2017.mgmt.codfw.wmnet with reboot policy FORCED [02:00:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2021.mgmt.codfw.wmnet with reboot policy FORCED [02:01:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2019.mgmt.codfw.wmnet with reboot policy FORCED [02:04:50] 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10leila) [02:05:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2020.mgmt.codfw.wmnet with reboot policy FORCED [02:06:30] 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10leila) This request is approved on my end. The access expiry date (in-line with the contract expiry date) is June 30, 2023. Thanks for your work on this in adva... [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2021.mgmt.codfw.wmnet with reboot policy FORCED [02:16:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2022.mgmt.codfw.wmnet with reboot policy FORCED [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:41] (03PS1) 10Stang: trwiki: Restrict ContentTranslation to autoreview/patroller/sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891408 (https://phabricator.wikimedia.org/T330363) [02:31:03] (03PS2) 10Stang: trwiki: Restrict ContentTranslation to autoreview/patroller/sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891408 (https://phabricator.wikimedia.org/T330363) [02:36:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2022.mgmt.codfw.wmnet with reboot policy FORCED [02:36:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) [02:42:04] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) @ayounsi will look into it when i am back tomorrow on site. Thanks [02:46:44] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2070 [02:48:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2070 [02:48:46] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330343 (10Papaul) 05Open→03Resolved a:03Papaul This was one of the new ms-be node it is now fixed [02:49:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:51:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS ms-be2070 - pt1979@cumin2002" [02:57:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS ms-be2070 - pt1979@cumin2002" [02:57:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:02:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) [03:11:08] (03PS54) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:11:30] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:12:23] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39807/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:20:25] (03CR) 10Ottomata: [V: 03+1] "Okay! Check out the latest patchset and the PCC output and let me know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:21:11] (03PS55) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:21:33] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:22:21] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39808/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:23:16] (03PS56) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:23:38] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:24:20] (03PS57) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:24:42] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:27:52] (03PS58) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:28:13] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:28:22] (03PS59) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:28:44] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:38:07] (03PS60) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:38:29] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:39:20] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39809/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:40:44] (03PS61) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:41:08] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:41:58] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39810/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:43:55] (03CR) 10Ottomata: [V: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:44:28] (03PS62) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:47:33] (03PS63) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [03:47:48] (03PS64) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [04:51:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:37:37] (03PS2) 10BCornwall: utils: Add SPDX Apache-2.0 license to utils [dns] - 10https://gerrit.wikimedia.org/r/890016 (https://phabricator.wikimedia.org/T291323) [05:42:05] (03CR) 10BCornwall: [C: 03+2] utils: Add SPDX Apache-2.0 license to utils [dns] - 10https://gerrit.wikimedia.org/r/890016 (https://phabricator.wikimedia.org/T291323) (owner: 10BCornwall) [05:44:27] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10BCornwall) 05In progress→03Resolved [06:50:34] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330372 (10phaultfinder) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T0700) [07:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T0700). nyaa~ [07:03:55] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:05:37] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:49:03] !log operations/mediawiki-config will no run `tox` to verify logos | T329231 | https://gerrit.wikimedia.org/r/c/integration/config/+/891317 [07:49:04] 10SRE, 10Infrastructure-Foundations: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10SLyngshede-WMF) We end up here, resulting in an empty list of command, which are then parsed to _icinga_host.run_sync, which fails because ther... [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:08] T329231: CI should ensure that wmf-config/logos.php matches logos/config.yaml - https://phabricator.wikimedia.org/T329231 [08:00:05] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:11] morning. there are no trainees signed up today and I see two patches from our stalwart self-deployer, kart_ (are you self-deploying today? if so, proceed when ready :-) ) [08:00:33] Yeah, will go ahead. Pretty long CI time ahead :) [08:00:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890864 (https://phabricator.wikimedia.org/T329893) (owner: 10KartikMistry) [08:00:51] heh ok [08:01:26] `08:01:01 Retrying (Retry(total=9, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /r/changes/890864/detail?o=COMMIT_FOOTERS&o=CURRENT_REVISION` [08:01:35] Anyone know what is this ^ [08:05:08] (03CR) 10Muehlenhoff: [C: 03+2] idm::jobs: Adapt auto restart to only run of idm-rq is active/present [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [08:06:03] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:18] hi apergos and kart_ , I have updated CI to trigger `tox` for operations/mediawiki-config but the job should pass just fine [08:07:23] in case you have a config change this morning :-] [08:07:39] thanks for the heads up :-) [08:07:48] note you can CR+2 the changes ahead of the backport window [08:07:52] that saves a bit of time [08:08:28] hashar: cool. I'll do that for the second change. [08:08:43] the wmf-quibble-* jobs can potentially be speeded up by moving Wikibase tests to be run more or less standalone [08:08:57] but well that needs a little bit of investigation and time to accomplish :-\ [08:09:05] (03CR) 10KartikMistry: [C: 03+2] Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890863 (https://phabricator.wikimedia.org/T329893) (owner: 10KartikMistry) [08:13:46] that will be nice when it happens though [08:14:58] (03PS5) 10Muehlenhoff: Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) [08:15:29] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: idm-sync-permissions.service,rq-idm.service,wmf_auto_restart_rq-idm.service Slyngshede Test https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:51] (03Merged) 10jenkins-bot: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890864 (https://phabricator.wikimedia.org/T329893) (owner: 10KartikMistry) [08:17:19] !log kartik@deploy1002 Started scap: Backport for [[gerrit:890864|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] [08:17:24] T329893: User menu in Vector 2022 shows links on hover (logged-in users) - https://phabricator.wikimedia.org/T329893 [08:19:08] !log kartik@deploy1002 kartik: Backport for [[gerrit:890864|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:19:56] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic-Icebox: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188 (10hashar) The task is from 2016, and I most probably filed it as a placeholder to track effort to enhance test coverage on various repositories. We were on a rampage to... [08:25:31] (03Merged) 10jenkins-bot: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890863 (https://phabricator.wikimedia.org/T329893) (owner: 10KartikMistry) [08:29:27] (03CR) 10Muehlenhoff: [C: 03+2] Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [08:30:27] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890864|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] (duration: 13m 08s) [08:30:32] T329893: User menu in Vector 2022 shows links on hover (logged-in users) - https://phabricator.wikimedia.org/T329893 [08:31:49] !log kartik@deploy1002 Started scap: Backport for [[gerrit:890863|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] [08:31:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] Revert "mw-on-k8s: reduce codfw replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris) [08:32:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "mw-on-k8s: reduce codfw replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris) [08:33:43] !log kartik@deploy1002 kartik: Backport for [[gerrit:890863|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:35:07] (03CR) 10Muehlenhoff: [C: 03+2] Adjust monitoring for KDC processes if worker threads are in use [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [08:36:20] !log dcaro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1003.eqiad.wmnet [08:36:47] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [08:36:55] (03Merged) 10jenkins-bot: Revert "mw-on-k8s: reduce codfw replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris) [08:38:17] RECOVERY - puppet last run on puppetdb1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:19] RECOVERY - Kerberos KDC daemon on krb2001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [08:39:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch role::puppetdb to Nginx custom flavour [puppet] - 10https://gerrit.wikimedia.org/r/890439 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:42:17] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890863|Fix contribution menu entrypoint in vector-2022 skin (T329893)]] (duration: 10m 27s) [08:42:21] T329893: User menu in Vector 2022 shows links on hover (logged-in users) - https://phabricator.wikimedia.org/T329893 [08:42:40] (03PS1) 10Muehlenhoff: Only set profile::nginx::variant to custom for the new bookworm nodes [puppet] - 10https://gerrit.wikimedia.org/r/891487 (https://phabricator.wikimedia.org/T321783) [08:43:54] I'm done with backport. [08:44:04] (03CR) 10Vgutierrez: [C: 03+2] sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [08:44:19] apergos: It seems no other patches in the window. [08:44:39] (03CR) 10Muehlenhoff: [C: 03+2] Only set profile::nginx::variant to custom for the new bookworm nodes [puppet] - 10https://gerrit.wikimedia.org/r/891487 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:45:10] i will do the train in fifteen minutes [08:45:14] in that case we are done for today, thanks for coming! [08:45:28] !log UTC morning backport and config training done [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:01] (03Merged) 10jenkins-bot: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [08:51:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:53:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Followup): Provide a cookbook to perform HAProxy upgrades on CDN nodes - https://phabricator.wikimedia.org/T330272 (10Vgutierrez) 05Open→03Resolved [08:55:05] (03PS1) 10Muehlenhoff: nginx: Drop require for the nginx package [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) [08:55:26] (03CR) 10CI reject: [V: 04-1] nginx: Drop require for the nginx package [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:59:52] (03PS1) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) [09:00:04] hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T0900). [09:00:56] (03PS2) 10Muehlenhoff: nginx: Drop require for the nginx package [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) [09:01:49] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891492 (https://phabricator.wikimedia.org/T325587) [09:01:51] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891492 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:01:54] lets goo [09:02:41] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891492 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:09:51] (03PS2) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) [09:09:51] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [09:09:54] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.24 refs T325587 [09:09:59] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [09:10:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:47] http fails but that is from cumin? [09:11:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:29] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:47] 10SRE, 10Infrastructure-Foundations: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) The timing on this was funny, as we just released a change on the icinga module, but AFAICT, this has nothing to do with it. So my cur... [09:12:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [09:12:18] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [09:13:06] (03PS4) 10Clément Goubert: Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 [09:14:42] (03CR) 10Clément Goubert: Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [09:15:33] (03PS2) 10Elukey: admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) [09:16:06] this train is boring, no new errors showing up on the PHP backend ;] [09:16:53] (03PS3) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [09:17:05] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:17:13] (03CR) 10CI reject: [V: 04-1] acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:18:04] (03CR) 10Volans: [C: 03+1] sre.switchdc.mediawiki: Set both datacenters to rw (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (https://phabricator.wikimedia.org/T330300) (owner: 10Clément Goubert) [09:18:26] (03PS4) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [09:19:41] hashar: The httpbb tests run on cumin [09:20:13] They probably shouldn´t but they do [09:20:41] (03PS1) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [09:22:22] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10fgiunchedi) I think you are correct in the sense that this should be fixed once centrallog1001 is decom, cc @andrea.denisse [09:23:15] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [09:23:15] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:23:16] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1003.eqiad.wmnet [09:23:43] !log dcaro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1004.eqiad.wmnet [09:24:10] (03CR) 10CI reject: [V: 04-1] Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [09:26:19] (03PS5) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [09:31:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) @ssingh did you complete manually all the steps missed by the reimage because of this failure? [09:32:44] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [09:33:45] (03PS6) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [09:35:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39814/console" [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:36:32] (03PS4) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 [09:37:30] (03PS5) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 [09:37:33] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [09:37:54] !log powercycle thumbor1005 - OEM even for DIMM B1 detected in `getsel`, no tty available via mgmt console [09:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:03] hnowlan: o/ --^ [09:38:16] I didn't find anything about thumbor in SAL/phabricator [09:38:46] (03PS6) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 [09:40:39] RECOVERY - Host thumbor1005 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:40:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [09:45:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) Chatting with Luca I got that over 5 reimages he got 2 failures too, so this looked more like a race condition bu... [09:47:27] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [09:47:36] !log uploaded php7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 to component/php74 T323358 [09:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:46] (03PS1) 10Slyngshede: Icinga: Service should also be marked as failed on warnings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) [09:50:04] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [09:51:15] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [09:51:15] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:51:16] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1004.eqiad.wmnet [09:51:53] !log uploaded php7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 to component/php74 T330270 [09:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:03] (03CR) 10CI reject: [V: 04-1] Icinga: Service should also be marked as failed on warnings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [09:53:31] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [09:54:01] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:14] (03PS2) 10Slyngshede: Icinga: Service should also be marked as failed on warnings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) [09:58:12] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [10:01:26] (03CR) 10Slyngshede: [C: 03+2] Icinga: Service should also be marked as failed on warnings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [10:01:59] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:04:52] (03Merged) 10jenkins-bot: Icinga: Service should also be marked as failed on warnings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/891498 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [10:06:05] (03PS2) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [10:07:15] (03CR) 10Vgutierrez: [C: 03+1] "looks good. What's the level field standardized name? level or levelname? I'd like to get rid of one of them in varnish logs as it's clear" [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [10:09:37] (03CR) 10CI reject: [V: 04-1] Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [10:19:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) p:05Triage→03High [10:20:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891346 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [10:22:24] (03PS1) 10Ilias Sarantopoulos: ml-services: Deploy nsfw model with debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/891501 (https://phabricator.wikimedia.org/T329612) [10:23:13] (03CR) 10Elukey: [C: 03+2] Add K8s DSE intermediate PKI configs and public certs [puppet] - 10https://gerrit.wikimedia.org/r/891346 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [10:26:28] (03PS3) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [10:27:29] (03PS4) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [10:28:10] (03PS1) 10Alexandros Kosiaris: developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140) [10:28:29] (03PS1) 10Urbanecm: cswiki: Remove changetags from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) [10:29:21] (03PS1) 10Jbond: add acme account [labs/private] - 10https://gerrit.wikimedia.org/r/891504 [10:29:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] add acme account [labs/private] - 10https://gerrit.wikimedia.org/r/891504 (owner: 10Jbond) [10:30:01] elukey: ack, thanks! [10:30:02] (03PS2) 10Urbanecm: cswiki: Grant changetags only to bots/sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) [10:31:00] (03PS2) 10Volans: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 [10:32:59] (03CR) 10Volans: add domain param to openstack backend (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [10:35:36] (03PS1) 10Urbanecm: [DNM] Testing CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891505 (https://phabricator.wikimedia.org/T329231) [10:36:20] (03CR) 10CI reject: [V: 04-1] [DNM] Testing CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891505 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [10:36:37] awesome :) [10:36:44] (03Abandoned) 10Urbanecm: [DNM] Testing CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891505 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [10:38:48] (03CR) 10Volans: "LGTM, small nits/missing tests inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [10:41:33] (03CR) 10Vgutierrez: [V: 03+1] "I've added support for both a single host provided as a string, or even the empty string that I've seen being used in some WMCS projects l" [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:42:54] (03CR) 10Volans: [C: 03+1] "The change looks ok to me, a question inline for what will be the workflow." [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [10:49:26] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf/ops LDAP groups for Kavitha Appakayala - https://phabricator.wikimedia.org/T327403 (10jbond) @Kappakayala for no i have removed you from the ops group. We currently have a consistency check to ensure everyone in the ldap ops groups is also in the unix ops gro... [10:49:27] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [10:53:01] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:56:20] (03PS1) 10Elukey: istio: fix environment variable in proxyv2 Docker file. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/891506 [10:57:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] istio: fix environment variable in proxyv2 Docker file. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/891506 (owner: 10Elukey) [10:58:04] (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: fix environment variable in proxyv2 Docker file. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/891506 (owner: 10Elukey) [10:58:50] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [10:59:10] !log updating mw canaries to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [10:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1100). [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1100) [11:00:49] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [11:01:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:02:09] (03PS5) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [11:05:42] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [11:06:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:19] (03CR) 10Volans: "This didn't get merged..." [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond) [11:06:22] (03PS4) 10Volans: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond) [11:08:13] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:39] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add rest gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:10:11] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10fgiunchedi) I've noticed some scap deploy-local failures on logstash hosts too for phatality, investigating [11:10:28] (03CR) 10Volans: "LGTM as approach, nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [11:11:30] (03PS1) 10Vgutierrez: hiera: Get rid of digicert-2021 mentions on traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/891507 [11:11:57] (03CR) 10Vgutierrez: [C: 03+2] hiera: Get rid of digicert-2021 mentions on traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/891507 (owner: 10Vgutierrez) [11:12:37] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10fgiunchedi) I got this when running as `deploy-service`: ` deploy-service@logstash1032:~$ scap deploy-local --repo releng/phatality -D log_json:False Traceback (most recent call last): File "/usr/bin/... [11:14:18] (03CR) 10Jbond: [C: 03+1] "lgtm optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:14:52] (03Merged) 10jenkins-bot: api-gateway: add rest gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:14:53] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10MoritzMuehlenhoff) Same issue as https://phabricator.wikimedia.org/T326668 I suppose? [11:16:10] (03PS1) 10Elukey: custom.d: update istio configs for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891508 (https://phabricator.wikimedia.org/T329664) [11:16:50] (03CR) 10Klausman: [C: 03+1] role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [11:16:52] (03CR) 10Muehlenhoff: nginx: Drop require for the nginx package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:17:40] (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [11:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44744 and previous config saved to /var/cache/conftool/dbconfig/20230223-111935-root.json [11:19:43] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593) [11:20:07] (03CR) 10Muehlenhoff: [C: 03+2] nginx: Drop require for the nginx package [puppet] - 10https://gerrit.wikimedia.org/r/891489 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:21:55] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: switch to using _upload_session (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [11:22:06] (03CR) 10Volans: "question inline, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [11:22:54] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10fgiunchedi) I am getting the same `ModuleNotFoundError` on arclamp1001 (where scap predictably fails too) [11:23:58] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [11:25:38] (03PS1) 10Hnowlan: service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) [11:25:39] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10fgiunchedi) >>! In T330360#8640332, @MoritzMuehlenhoff wrote: > Same issue as https://phabricator.wikimedia.org/T326668 I suppose? Yes that's right, I'll followup there! thank you [11:26:35] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10fgiunchedi) Adding #sre for visibility [11:27:28] (03PS1) 10Superpes15: [shnwiktionary] Create 8 new namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891512 (https://phabricator.wikimedia.org/T330376) [11:28:16] (03CR) 10Volans: [C: 03+1] redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [11:29:52] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10fgiunchedi) For logstash and arclamp hosts the `scap` module doesn't seem to be found/loadable: ` deploy-service@logstash1032:~$ scap deploy-local --re... [11:34:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44746 and previous config saved to /var/cache/conftool/dbconfig/20230223-113440-root.json [11:39:50] (03PS1) 10Muehlenhoff: Deal with variant/custom mismatches in more places [puppet] - 10https://gerrit.wikimedia.org/r/891515 (https://phabricator.wikimedia.org/T329529) [11:40:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891515 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:49:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44747 and previous config saved to /var/cache/conftool/dbconfig/20230223-114944-root.json [11:56:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] custom.d: update istio configs for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891508 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [11:57:24] (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [12:01:19] (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [12:02:02] (03Merged) 10jenkins-bot: custom.d: update istio configs for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891508 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [12:03:25] (03CR) 10Jbond: [C: 03+1] Deal with variant/custom mismatches in more places [puppet] - 10https://gerrit.wikimedia.org/r/891515 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44749 and previous config saved to /var/cache/conftool/dbconfig/20230223-120449-root.json [12:10:05] (03CR) 10Muehlenhoff: [C: 03+2] Deal with variant/custom mismatches in more places [puppet] - 10https://gerrit.wikimedia.org/r/891515 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:11:14] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert) [12:12:52] (03PS3) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) [12:20:56] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10jnuche) @fgiunchedi as a quickfix I manually installed scap's latest version on those hosts (arclamp and logstash): ` scap@logstash1032:~$ scap version... [12:22:45] (03PS4) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [12:28:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [12:32:22] (03CR) 10Atieno: [C: 03+2] imagemagick: use JSON output from exiftool [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) (owner: 10Hnowlan) [12:33:40] (03CR) 10Atieno: [C: 03+1] Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 (owner: 10Hnowlan) [12:37:41] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10jbond) >>! In T326668#8640118, @jnuche wrote: > @jbond, that seems a different issue while installing Scap3 services. This ticket's issue is about a prob... [12:39:03] RECOVERY - puppet last run on puppetdb2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:09] (03Merged) 10jenkins-bot: imagemagick: use JSON output from exiftool [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) (owner: 10Hnowlan) [12:40:26] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:41:16] 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10RobH) This seems to have stalled with an inconclusive assumption that the NIC firmware update solved it, but we have no confirmation of that (see @ssingh's last comment above).... [12:41:29] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005923 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:49:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:50:24] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) 05Open→03Resolved [12:50:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [12:50:36] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) [12:53:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [13:01:19] (03PS6) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [13:06:32] (03CR) 10Muehlenhoff: [C: 03+2] Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/891491 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [13:25:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330 (owner: 10Alexandros Kosiaris) [13:26:06] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [13:37:56] (03PS1) 10Alexandros Kosiaris: tegola: Remove annotations from the CronJob spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/891542 [13:41:36] (03CR) 10Nikerabbit: Update $wgTranslateDisabledTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [13:42:36] (03PS1) 10Volans: wmf-update-known-hosts-production: fix CNAMEs [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891544 [13:44:01] (03CR) 10Muehlenhoff: "Looks good, a few comments here and there." [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:45:08] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) The same issue occurs on the deployment target as reported by T330394: ` nfraison@st... [13:45:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330 (owner: 10Alexandros Kosiaris) [13:46:20] (03CR) 10Muehlenhoff: [C: 03+1] "When merged, I'll push a new deb out, there were also some other recent changes." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330 (owner: 10Alexandros Kosiaris) [13:48:16] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [13:48:38] (03PS5) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [13:48:56] (03CR) 10Slyngshede: Icinga: Handle edge case where status is not optimal (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [13:49:15] (03CR) 10Majavah: "Most of cloudvirt-canaries are not padded. Would replacing the few that are solve the problem instead?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [13:51:32] (03PS6) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [13:53:44] (03PS2) 10Jbond: wmf-update-known-hosts-production: handle multiple algorithems [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891326 [13:54:25] (03CR) 10Jbond: [C: 03+1] "i went for an alternate approach but im happy with either one" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891544 (owner: 10Volans) [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1400). [14:00:04] cirno and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:47] Ciao :) [14:01:47] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Remove annotations from the CronJob spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/891542 (owner: 10Alexandros Kosiaris) [14:02:37] (03CR) 10Hashar: [C: 04-1] "So that is a bit more complicated. We have the same issue on a deployment target (stat1004) which is reported as T330394, with the newer " [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [14:03:20] I’m in a meeting… if no one else is available, I could deploy around 14:30 UTC if someone pings me :) [14:06:28] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:08:29] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:11:12] (03CR) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [14:11:15] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [14:11:18] Testing is currently broken in CI (just as a heads up) [14:13:04] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/891551 [14:13:35] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/891551 (owner: 10Volans) [14:13:37] !log upgrade istio in wikikube codfw, staging-eqiad, staging-codfw to 1.15.3-2 to re-enable istio metrics [14:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:51] (03CR) 10Jbond: scap: disable git safe.directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [14:14:16] (03PS1) 10Clément Goubert: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) [14:14:35] (03PS7) 10Slyngshede: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) [14:15:40] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-03-01 datacenter switchover." [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:15:54] am I being a terrible coder or CI is failing for mw [14:16:09] Superpes: and cirno I can take a look [14:16:54] Hi Amir1 Thanks! you can start with cirno :P [14:17:15] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/891551 (owner: 10Volans) [14:18:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891408 (https://phabricator.wikimedia.org/T330363) (owner: 10Stang) [14:18:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2013'] [14:19:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2014'] [14:19:23] (03Merged) 10jenkins-bot: trwiki: Restrict ContentTranslation to autoreview/patroller/sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891408 (https://phabricator.wikimedia.org/T330363) (owner: 10Stang) [14:19:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:891408|trwiki: Restrict ContentTranslation to autoreview/patroller/sysop (T330363)]] [14:19:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2015'] [14:19:40] T330363: Allow using ContentTranslation only for autoreview, patroller, and sysop at trwiki - https://phabricator.wikimedia.org/T330363 [14:20:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2016'] [14:21:38] !log ladsgroup@deploy1002 stang and ladsgroup: Backport for [[gerrit:891408|trwiki: Restrict ContentTranslation to autoreview/patroller/sysop (T330363)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:21:44] (03PS3) 10David Caro: wmcs.ceph: move cloudcephosd1003/1004 to e4/f4 [puppet] - 10https://gerrit.wikimedia.org/r/888660 (https://phabricator.wikimedia.org/T329502) [14:22:18] cirno: live in mwdebug, please test [14:22:24] looking [14:22:41] * Lucas_WMDE around if needed [14:22:53] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-02-28 datacenter traffic switchover." [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:23:35] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:17] (03PS1) 10Nicolas Fraison: provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 [14:24:23] (03PS1) 10Volans: Upstream release v6.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/891556 [14:24:38] Amir1, it works now: I'm not inside those groups mentioned in the commit msg, when I try to use CX I got a notice said "Publishing only allowed to experienced users" [14:24:51] (03CR) 10Muehlenhoff: apt: add new module with new AptGetHosts class (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [14:24:51] awesome [14:24:57] (03PS2) 10Nicolas Fraison: provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 [14:25:36] (03PS2) 10Volans: Upstream release v6.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/891556 [14:25:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2013'] [14:26:28] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2014'] [14:27:02] (03CR) 10CI reject: [V: 04-1] provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 (owner: 10Nicolas Fraison) [14:27:04] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39817/console" [puppet] - 10https://gerrit.wikimedia.org/r/891555 (owner: 10Nicolas Fraison) [14:27:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017'] [14:27:22] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2018'] [14:27:27] (03PS3) 10Nicolas Fraison: provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 (https://phabricator.wikimedia.org/T330394) [14:27:32] (03CR) 10Volans: apt: add new module with new AptGetHosts class (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [14:27:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2015'] [14:27:47] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2019'] [14:28:11] (03PS1) 10Jbond: scap - provider: update scap provider to run git with correct user [puppet] - 10https://gerrit.wikimedia.org/r/891557 (https://phabricator.wikimedia.org/T330394) [14:28:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/891556 (owner: 10Volans) [14:28:39] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:28:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2016'] [14:29:13] (03CR) 10Volans: [C: 03+2] Upstream release v6.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/891556 (owner: 10Volans) [14:29:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2020'] [14:29:20] (03CR) 10CI reject: [V: 04-1] provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 (https://phabricator.wikimedia.org/T330394) (owner: 10Nicolas Fraison) [14:29:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:29:34] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:30:14] (03CR) 10CI reject: [V: 04-1] scap - provider: update scap provider to run git with correct user [puppet] - 10https://gerrit.wikimedia.org/r/891557 (https://phabricator.wikimedia.org/T330394) (owner: 10Jbond) [14:30:46] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:891408|trwiki: Restrict ContentTranslation to autoreview/patroller/sysop (T330363)]] (duration: 11m 10s) [14:30:50] T330363: Allow using ContentTranslation only for autoreview, patroller, and sysop at trwiki - https://phabricator.wikimedia.org/T330363 [14:31:12] cirno: done, moving to Superpes [14:31:20] Yup :P [14:31:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [14:32:15] (03PS4) 10Ladsgroup: [sysop_itwiki] Change the logo, the favicon, and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [14:32:22] (03PS1) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: extend CNAME message [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 [14:32:39] (03CR) 10TrainBranchBot: "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [14:32:49] (03CR) 10Lucas Werkmeister (WMDE): "At least, I was confused by the lack of deployment.eqiad.wmnet in the output :)" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 (owner: 10Lucas Werkmeister (WMDE)) [14:33:02] (03Merged) 10jenkins-bot: Upstream release v6.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/891556 (owner: 10Volans) [14:33:23] (03Merged) 10jenkins-bot: [sysop_itwiki] Change the logo, the favicon, and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [14:33:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:891261|[sysop_itwiki] Change the logo, the favicon, and add a wordmark (T330279)]] [14:33:40] T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279 [14:33:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:34:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Deployment window for this one is: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230228T1500" [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:34:35] (03CR) 10Effie Mouzeli: [C: 03+2] tegola: Remove annotations from the CronJob spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/891542 (owner: 10Alexandros Kosiaris) [14:35:40] !log ladsgroup@deploy1002 ladsgroup and superpes: Backport for [[gerrit:891261|[sysop_itwiki] Change the logo, the favicon, and add a wordmark (T330279)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:35:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2017'] [14:36:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2018'] [14:36:16] Looking [14:36:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10ssingh) Thanks all for the quick response to this task, everyone! >>! In T330318#8640112, @Volans wrote: > @ssingh did y... [14:36:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2019'] [14:37:04] (03PS1) 10Andrew Bogott: cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) [14:37:26] (03CR) 10CI reject: [V: 04-1] cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:37:31] Amir1 Everything is fine! From both vector 2022/mobile and vector legacy [14:37:42] awesome [14:38:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2020'] [14:38:06] !log uploaded spicerack_6.2.2 to apt.wikimedia.org bullseye-wikimedia [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) @ssingh just run the Netbox script https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetD... [14:38:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 71): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39818/console" [puppet] - 10https://gerrit.wikimedia.org/r/891557 (https://phabricator.wikimedia.org/T330394) (owner: 10Jbond) [14:38:31] (03PS4) 10Nicolas Fraison: provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 (https://phabricator.wikimedia.org/T330394) [14:39:01] 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Mvolz) [14:40:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10ssingh) >>! In T330318#8640963, @Volans wrote: > @ssingh just run the Netbox script https://netbox.wikimedia.org/extras/s... [14:40:11] (03Merged) 10jenkins-bot: tegola: Remove annotations from the CronJob spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/891542 (owner: 10Alexandros Kosiaris) [14:40:17] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10fgiunchedi) >>! In T326668#8640506, @jnuche wrote: > @fgiunchedi as a quickfix I manually installed scap's latest version on those hosts (arclamp and log... [14:40:54] (03PS2) 10Andrew Bogott: cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) [14:41:16] (03CR) 10CI reject: [V: 04-1] cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:41:42] !log installed spicearck 6.2.2 to cumin hosts [14:41:44] (03Abandoned) 10Jbond: scap - provider: update scap provider to run git with correct user [puppet] - 10https://gerrit.wikimedia.org/r/891557 (https://phabricator.wikimedia.org/T330394) (owner: 10Jbond) [14:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:26] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10Volans) 05Open→03Resolved a:03Volans This is now deployed, you can resume your reimages. Sorry for the trouble. [14:42:29] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:43:00] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:43:34] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:891261|[sysop_itwiki] Change the logo, the favicon, and add a wordmark (T330279)]] (duration: 09m 58s) [14:43:38] T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279 [14:43:42] deployed [14:43:53] (03PS3) 10Andrew Bogott: cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) [14:44:11] (03PS4) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [14:44:13] Wonderful :D [14:44:24] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:44:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891512 (https://phabricator.wikimedia.org/T330376) (owner: 10Superpes15) [14:44:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] wmf-update-known-hosts-production: fix CNAMEs [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891544 (owner: 10Volans) [14:45:03] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:45:07] (03CR) 10Slyngshede: Icinga: Handle edge case where status is not optimal (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [14:45:43] (03Merged) 10jenkins-bot: [shnwiktionary] Create 8 new namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891512 (https://phabricator.wikimedia.org/T330376) (owner: 10Superpes15) [14:45:57] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:891512|[shnwiktionary] Create 8 new namespaces (T330376)]] [14:46:01] T330376: Create additional namespaces on shn.wiktionary - https://phabricator.wikimedia.org/T330376 [14:46:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:02] (03CR) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [14:48:00] !log ladsgroup@deploy1002 ladsgroup and superpes: Backport for [[gerrit:891512|[shnwiktionary] Create 8 new namespaces (T330376)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:48:03] (03PS1) 10Muehlenhoff: Update comment [dns] - 10https://gerrit.wikimedia.org/r/891564 [14:48:06] Testing [14:49:24] Amir1 All right! :) [14:49:29] awesome [14:49:53] (03CR) 10Andrew Bogott: [C: 04-1] "this won't work because the env isn't set up" [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:50:22] (03CR) 10Ssingh: [C: 03+1] Update comment [dns] - 10https://gerrit.wikimedia.org/r/891564 (owner: 10Muehlenhoff) [14:51:45] (03CR) 10Muehlenhoff: [C: 03+2] Update comment [dns] - 10https://gerrit.wikimedia.org/r/891564 (owner: 10Muehlenhoff) [14:52:15] (03PS7) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [14:52:42] (03PS8) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [14:53:22] (03PS9) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [14:53:28] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: move cloudcephosd1003/1004 to e4/f4 [puppet] - 10https://gerrit.wikimedia.org/r/888660 (https://phabricator.wikimedia.org/T329502) (owner: 10David Caro) [14:53:43] (03CR) 10Jbond: "LGTM lets see how the tests go" [puppet] - 10https://gerrit.wikimedia.org/r/891555 (https://phabricator.wikimedia.org/T330394) (owner: 10Nicolas Fraison) [14:53:50] (03CR) 10David Caro: [C: 03+2] "The name of the interfaces is a guess for now, but will fix if needed after reimaging." [puppet] - 10https://gerrit.wikimedia.org/r/888660 (https://phabricator.wikimedia.org/T329502) (owner: 10David Caro) [14:55:06] (03PS1) 10Ladsgroup: Move more ns-related config out of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891565 (https://phabricator.wikimedia.org/T308932) [14:55:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] service, k8s: Add service definitions for rest-gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [14:55:19] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:891512|[shnwiktionary] Create 8 new namespaces (T330376)]] (duration: 09m 21s) [14:55:24] T330376: Create additional namespaces on shn.wiktionary - https://phabricator.wikimedia.org/T330376 [14:55:29] (03PS4) 10Andrew Bogott: cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) [14:55:32] (03Abandoned) 10Jbond: wmf-update-known-hosts-production: extend CNAME message [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 (owner: 10Lucas Werkmeister (WMDE)) [14:55:50] (03CR) 10CI reject: [V: 04-1] Move more ns-related config out of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891565 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:56:12] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [14:56:12] Wonderful :) [14:56:14] Amir1 Thanks for your time :3 [14:56:27] oh no worries, thank you for making the patches! [14:56:29] (03Restored) 10Jbond: wmf-update-known-hosts-production: extend CNAME message [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 (owner: 10Lucas Werkmeister (WMDE)) [14:56:56] (03CR) 10Jbond: [C: 03+1] "restoring i thought this was the same as riccrdos" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 (owner: 10Lucas Werkmeister (WMDE)) [14:57:42] (03Abandoned) 10Jbond: wmf-update-known-hosts-production: handle multiple algorithems [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891326 (owner: 10Jbond) [14:58:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: fix CNAMEs [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891544 (owner: 10Volans) [14:58:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: extend CNAME message [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891558 (owner: 10Lucas Werkmeister (WMDE)) [14:59:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] "lgtm merging" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert) [14:59:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] "merging" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330 (owner: 10Alexandros Kosiaris) [15:00:30] (03PS8) 10Volans: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [15:00:34] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [15:00:48] (03PS2) 10Ladsgroup: Move more ns-related config out of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891565 (https://phabricator.wikimedia.org/T308932) [15:01:00] (03CR) 10David Caro: [C: 03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [15:01:49] RECOVERY - Check systemd state on cp6004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:16] (03CR) 10Ladsgroup: [C: 03+2] Move more ns-related config out of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891565 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:05:59] (03Merged) 10jenkins-bot: Move more ns-related config out of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891565 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:10:43] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/891390 [15:10:57] (03PS1) 10David Caro: cloud: refactor the tests to use the super() abstraction [puppet] - 10https://gerrit.wikimedia.org/r/891567 [15:11:40] (03PS1) 10Jbond: ssh config: Add ControlPath and ControlPersist parameters [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568 [15:12:56] (03PS1) 10Ssingh: ntp/ulsfo: set to dns4004 [dns] - 10https://gerrit.wikimedia.org/r/891569 (https://phabricator.wikimedia.org/T321309) [15:14:23] (03CR) 10Ssingh: [C: 03+2] ntp/ulsfo: set to dns4004 [dns] - 10https://gerrit.wikimedia.org/r/891569 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:15:00] !log ladsgroup@deploy1002 Synchronized wmf-config/core-Namespaces.php: Move more ns-related config out of InitialiseSettings, part I (T308932) (duration: 07m 01s) [15:15:00] !log running authdns-update for CR 891569 [15:15:05] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [15:20:04] Lucas_WMDE: I'm not sure about the status of unconnected pages backend refactor but maybe we can get rid of wmgWikibaseClientUnconnectedPageMigrationStage and other configs now? [15:20:28] probably, yeah [15:20:33] pretty sure nothing related to that is still ongoing [15:20:34] let me check [15:21:20] we (Wikidata team) are generally not always great at cleaning up the config, I think [15:21:21] (03CR) 10Nicolas Fraison: [C: 03+2] provider_scap3: update the query to execute as the deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/891555 (https://phabricator.wikimedia.org/T330394) (owner: 10Nicolas Fraison) [15:21:50] (03PS10) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T330409) [15:22:38] !log klausman@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [15:26:52] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused Wikibase config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) [15:27:06] Amir1: T330410 [15:27:06] T330410: Clean up last remnants of Special:UnconnectedPages / unexpectedUnconnectedPageProp migration - https://phabricator.wikimedia.org/T330410 [15:27:13] thanks [15:27:39] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move more ns-related config out of InitialiseSettings, part II (T308932) (duration: 06m 35s) [15:27:43] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [15:27:46] (03CR) 10Lucas Werkmeister (WMDE): "I think in theory this can be deployed at any time, but I suggest waiting until I73e5d289e0 is rolled out with the train, so we’re absolut" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) (owner: 10Lucas Werkmeister (WMDE)) [15:31:01] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10nfraison) [15:34:31] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/891567 (owner: 10David Caro) [15:36:49] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [15:38:03] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:12] (03CR) 10Slyngshede: [C: 03+2] Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [15:42:04] !log klausman@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [15:44:58] (03Merged) 10jenkins-bot: Icinga: Handle edge case where status is not optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/891494 (https://phabricator.wikimedia.org/T330318) (owner: 10Slyngshede) [15:45:05] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [15:51:17] (03PS1) 10Bking: dse-k8s: raise memory for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) [15:51:50] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [15:54:33] (03CR) 10David Caro: [C: 03+2] cloud: refactor the tests to use the super() abstraction [puppet] - 10https://gerrit.wikimedia.org/r/891567 (owner: 10David Caro) [15:54:52] (03CR) 10Ottomata: [C: 03+1] "Fine with me but I think you might need luca to agree?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:57:56] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [15:57:56] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:03] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [16:00:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568 (owner: 10Jbond) [16:00:41] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:03:02] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [16:03:06] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1001 [16:03:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1001 [16:03:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1002 [16:03:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1002 [16:03:23] !log installing c-ares security updates [16:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:07] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [16:04:07] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2020'] [16:04:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2020'] [16:07:01] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2021'] [16:07:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2021'] [16:07:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2021'] [16:07:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2021'] [16:10:22] (03PS1) 10Cwhite: varnish: ignore levelname field [puppet] - 10https://gerrit.wikimedia.org/r/891391 (https://phabricator.wikimedia.org/T330267) [16:11:37] (03CR) 10Cwhite: logstash: remove SEVERITY_LABEL from syslog messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [16:13:02] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [16:13:18] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [16:14:12] !log codfw: roll-restarting swift frontends and thumbor hosts for key rotation [16:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Incident doc relating the minor editing incident due to {T330300} https://wikitech.wikimedia.org/wiki/Incidents/2023-02-22_read_only [16:15:03] !log hnowlan@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [16:17:07] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10MoritzMuehlenhoff) >>! In T325128#8640737, @hashar wrote: > The same issue occurs on the dep... [16:18:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [16:19:24] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: drop version parsing for bios and idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/891582 [16:21:37] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/891582 (owner: 10Jbond) [16:22:19] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: drop version parsing for bios and idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/891582 [16:23:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/891582 (owner: 10Jbond) [16:23:20] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: drop version parsing for bios and idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/891582 (owner: 10Jbond) [16:25:51] PROBLEM - IPMI Sensor Status on db2125 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:26:12] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [16:27:54] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10eoghan) [16:29:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2021'] [16:29:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2021'] [16:41:50] !log eqiad: roll-restarting swift frontends and thumbor hosts for key rotation [16:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:07] !log hnowlan@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [16:43:06] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) [16:43:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10MatthewVernon) [16:44:10] (03CR) 10Vgutierrez: [C: 03+1] varnish: ignore levelname field [puppet] - 10https://gerrit.wikimedia.org/r/891391 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [16:44:14] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Jhancock.wm) replacement received today. Tracking: 1Z7AF3880355021907 [16:49:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10MatthewVernon) Hi @Jclark-ctr could you give me an update on timescales for getting this hardware ready to go, please? From an operational perspective... [16:49:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10MatthewVernon) Hi @papaul could you give me an update on timescales for getting this hardware ready to go, please? From an operational perspective, it... [16:52:16] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [16:52:23] !log raplacing fpc2 to fpc1 DAC cable [16:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Hi, sorry I've been on leave, and now we're approaching the switchover. Can we do it after that, say Thursday 9th March, at whatever is the earliest comfortable time of day for you? [16:54:43] (03PS1) 10Cwhite: profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 [16:56:12] (03PS1) 10Vgutierrez: varnish: Limit ESI depth to 1 [puppet] - 10https://gerrit.wikimedia.org/r/891586 (https://phabricator.wikimedia.org/T308799) [16:58:18] !log raplacing fpc2 to fpc1 DAC cable complete [16:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:43] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [16:59:55] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish: ignore levelname field [puppet] - 10https://gerrit.wikimedia.org/r/891391 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [17:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1700). [17:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:02:17] I can merge in just a sec [17:02:19] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) DAC cable replace, statistic clear on the vcp no more errors show for now. [17:02:44] Urbanecm: will you want me to kick off a test run? [17:03:08] hi rzl, not needed, it's for an in-development feature :) [17:03:42] sounds good 👍 [17:03:45] (03PS2) 10Cwhite: profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 [17:04:41] (03PS1) 10Bking: analytics: rename postgres DB user for search platform [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) [17:06:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye [17:06:30] (03CR) 10RLazarus: [C: 03+2] growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [17:06:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS bullseye [17:06:36] some DNS and BGP alerts expected (in cr*-ulsfo): please ignore thank you [17:07:22] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [17:08:06] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/output/891392/39820/" [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [17:09:17] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:11:15] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:11:17] (03CR) 10BBlack: [C: 03+1] "Looks like a good thing to me!" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [17:11:19] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [17:11:27] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:39] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:11:53] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BBlack) Looks good to me, and appropriate at the Varnish layer in this case. [17:12:03] all set, puppet request window complete [17:12:10] thanks rzl [17:12:17] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:14:14] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:14:17] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:15:35] PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100% [17:17:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:17:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [17:18:00] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:18:14] Here [17:18:49] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-datasource=eqiad%20prometheus%2Fops [17:18:50] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:19:06] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:19:32] (03CR) 10Nicolas Fraison: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:19:51] (03CR) 10Hashar: [C: 04-1] scap: disable git safe.directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [17:19:54] Seeing a lot of SAL stuff, anyone doing things? [17:20:51] PROBLEM - Recursive DNS on 198.35.26.7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:21:31] thumbor stuff shouldn't have impacted anything appserver-related [17:22:02] hnowlan: that was just staging cluster, right? [17:22:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:22:26] the DNS and BGP alerts in ulsfo are expected, that's sukhe doing the dns4003 reimage [17:22:26] cdanis: I did prod earlier also, but thumbor looks healthy and also shouldn't trigger anything in the mw path really [17:22:32] hnowlan: yeah... [17:22:36] yep expected and unrelated [17:22:50] brett: resolved? [17:23:02] > Resolved by: SYSTEM [17:23:17] RECOVERY - Host 2620:0:863:1:198:35:26:7 is UP: PING OK - Packet loss = 0%, RTA = 70.93 ms [17:23:18] thanks SYSTEM! [17:23:25] Still seeing 5XXs? [17:23:54] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:24:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:24:22] brett: btw the thing that paged wasn't actually the 5xx, it was a worker saturation alert: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=All&var-cluster=api_appserver&var-method=GET&var-code=200&var-php_version=All&from=now-1h&to=now&viewPanel=64 [17:24:26] and it's about to--- yeah [17:24:45] my guess is that this is related to incoming traffic and not a production change, based on SAL [17:24:49] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:24:49] Yeah, I don't know why since there are ~11k idle workers? [17:25:16] brett: if you restrict to eqiad they're almost 100% saturated [17:25:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [17:25:20] eqiad + api_appserver [17:25:22] ohhhh [17:25:25] seems all s7 [17:25:28] and eqiad appserver is about half-saturated [17:26:13] what hnowlan said [17:26:15] Does this warrant an incident/IC at this point? [17:26:17] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now [17:26:34] brett: I think so, we're not *actually* serving errors *yet*, but we need to figure out what's causing this and get it to stop [17:26:43] (well, correction, many errors) [17:27:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [17:27:18] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:27:18] okay, I'll become IC [17:27:19] (03CR) 10Ottomata: analytics: rename postgres DB user for search platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [17:27:33] RECOVERY - IPMI Sensor Status on db2125 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:27:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [17:27:54] brett: https://noc.wikimedia.org/db.php#tabs-s7 btw -- here's the wikis in db shard s7 [17:28:31] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:29:05] brett: judging by https://w.wiki/6NJ3 it looks like db1127 is a bit out to lunch [17:29:33] if you wanted to attempt lowering its lb weight -- https://wikitech.wikimedia.org/wiki/Dbctl#Changing_weights_for_a_host -- then great, otherwise I can do it [17:30:04] cdanis: I took IC so I think you should [17:30:08] 👍 [17:30:26] https://docs.google.com/document/d/1rUGVHyV7MYBOViTYaPeOdqN1ypcXMJfJUX_4ppbGnok/edit# [17:31:05] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [17:31:28] !log cdanis@cumin1001 dbctl commit (dc=all): 'db1127 running very hot', diff saved to https://phabricator.wikimedia.org/P44752 and previous config saved to /var/cache/conftool/dbconfig/20230223-173127-cdanis.json [17:32:08] if it doesn't improve shortly I'll depool it entirely [17:32:18] (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:32:24] i am sorry, I ad trouble getting online myself. here now [17:32:31] long story. can I do something? [17:32:41] btw -- the places where the data is missing on that graph, are where mysql was so overloaded on that host it couldn't reply in time to the stats monitoring queries [17:32:45] 🙃 [17:33:00] mutante: I think it's mostly under control probably [17:33:37] cdanis: pheew, glad to hear that [17:34:17] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:35:21] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1127&var-datasource=thanos&var-cluster=mysql&from=now-1h&to=now [17:35:27] completely pegged on cpu heh [17:35:33] yeah something is up [17:35:35] 5 minutes and no improvement, I'm depooling db1127 entirely [17:35:39] cdanis: might depool? [17:35:39] ok [17:36:08] !log cdanis@cumin1001 dbctl commit (dc=all): 'so hot right now', diff saved to https://phabricator.wikimedia.org/P44753 and previous config saved to /var/cache/conftool/dbconfig/20230223-173608-cdanis.json [17:36:13] it's the [17:36:21] mysql process [17:36:57] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [17:37:17] (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:37:37] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:7 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:38:10] cdanis: despite the weight adjustment it looks like it's still serving errors. Any idea why that is? [17:38:25] * The application is serving errors [17:38:42] brett: takes ... ~10 seconds for the change to propagate from etcd to all mediawikis, then, the database servers are selected by the LB code at the *start* of a query, so [17:38:56] we'll be seeing errors being served for at least another minute or two past that (as queries time out) [17:39:06] I see [17:39:09] RECOVERY - Recursive DNS on 198.35.26.7 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:39:11] and then we'll be seeing mariadb reporting errors without the application reporting errors for slightly longer than that [17:39:14] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:39:17] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:39:38] the latency and saturation metrics are in a much better place though [17:39:45] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:35] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:41:01] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [17:41:56] I think 5xx has recovered as well [17:42:06] looking good! [17:42:17] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:43:00] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved cloudcephosd1003/1004 to new racks - dcaro@cumin1001" [17:43:00] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:43:23] btw, I think T265386 has work that would have prevented this [17:43:24] T265386: Make LoadMonitor server states more up-to-date and respond to outages more quickly - https://phabricator.wikimedia.org/T265386 [17:43:35] ack [17:43:58] db1127 is still wedged heh [17:44:16] brett: can you file a prio: high task against #DBA to have them check the server (and then repool)? [17:44:17] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:44:33] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 03m 32s) [17:44:33] sure [17:44:41] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:42] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [17:44:45] (03PS1) 10Ssingh: Revert "ntp/ulsfo: set to dns4004" [dns] - 10https://gerrit.wikimedia.org/r/891304 [17:44:47] cdanis: Not unbreak now? [17:44:54] nah, it can wait until their business hours [17:45:09] in theory we have more than enough capacity in each db shard to handle one replica being depooled [17:45:10] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 27s) [17:46:06] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [17:46:19] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:57] brett: things look stable now. I need to go afk for a bit but I can take a pass over the doc and/or answer questions/talk more later :) thanks! [17:47:09] cdanis: Thanks so much for your help [17:47:41] np! I'm just glad this was straightforward and not a fun and engaging exercise of "find the query of death" [17:49:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4003.wikimedia.org with OS bullseye [17:50:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS bullseye completed: - dns4003 (**PASS**) - Downtimed on Icinga/Al... [17:51:11] (03CR) 10Herron: [C: 03+1] profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [17:52:10] (03CR) 10Herron: [C: 03+1] logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [17:53:32] (03CR) 10Ssingh: [C: 03+2] Revert "ntp/ulsfo: set to dns4004" [dns] - 10https://gerrit.wikimedia.org/r/891304 (owner: 10Ssingh) [17:54:56] 10SRE, 10Infrastructure-Foundations: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10ssingh) Confirming that I did a new reimage and it completed successfully. Thanks everyone who worked on this to resolve it so quickly. [17:55:57] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [17:56:08] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 10s) [18:00:04] bd808: Your horoscope predicts another unfortunate Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1800) [18:00:33] * bd808 should have one or two things to push out [18:08:52] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [18:09:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:11:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:14:24] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [18:14:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5003.wikimedia.org with OS bullseye [18:15:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS bullseye [18:15:24] some DNS and BGP alerts expected (in cr*-eqsin): please ignore thank you [18:15:25] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2021'] [18:18:39] (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-02-23-121711-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891592 [18:19:39] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:31] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:09] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) **Latest Shopify update after escalation -->** Thanks, Sandra, I have a good direction to go on this with my team. Looks like we do have a hard no on the "includeSubDomains;... [18:21:37] (Nonwrite HTTP requests with primary DB writes alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [18:21:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:04] https://phabricator.wikimedia.org/T330422 created for dba to review [18:24:40] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2021'] [18:25:38] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-02-23-121711-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891592 (owner: 10BryanDavis) [18:26:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:07] (03PS1) 10David Caro: cloud: add tests for >buster os [puppet] - 10https://gerrit.wikimedia.org/r/891593 [18:28:13] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:28:22] ^ expected [18:30:35] (03CR) 10Majavah: [C: 03+1] "Minor nit inline, although as a temporary state this is ok to me as is too." [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [18:30:50] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-02-23-121711-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891592 (owner: 10BryanDavis) [18:32:12] (03CR) 10David Caro: cloud: add tests for >buster os (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro) [18:34:21] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [18:34:33] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:35:14] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:35:28] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:36:20] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:36:38] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:36:49] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:37:43] jinxer-wm: thanks [18:38:07] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:38:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2022'] [18:40:35] PROBLEM - puppet last run on puppetdb2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:41:37] (Nonwrite HTTP requests with primary DB writes alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [18:41:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:43:11] puppetdb2003 is certainly interesting [18:43:15] agent not explicitly disabled [18:43:23] not a new host, as per https://phabricator.wikimedia.org/T317894 [18:45:24] (03CR) 10BCornwall: [C: 03+1] acme_chief: support several passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [18:45:32] puppetdb role applied on Feb 10 as per a5b3212f6 [18:45:44] https://puppetboard.wikimedia.org/node/puppetdb2003.codfw.wmnet ?? [18:45:51] iirc that's one of the puppet(db) 7 test hosts [18:46:20] yeah it's running bookworm/sid [18:46:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:47:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2022'] [18:48:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [18:49:18] !log run puppet agent on puppetdb2003 [18:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:33] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:51:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [18:52:25] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:53:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:53:41] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:54:43] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:54:54] ^ expected [18:59:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:00:05] hashar and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T1900). [19:03:15] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:03:23] ^ expected [19:04:36] (03PS1) 10Cwhite: opensearch_dashboards: restart every 7days [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) [19:08:22] (03PS5) 10Andrew Bogott: cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) [19:08:37] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:09:17] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:10:37] (03CR) 10Cwhite: "RuntimeRandomizedExtraSec was introduced in systemd 250 which would give us the ability to splay the restarts, but bullseye is systemd 247" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [19:11:03] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:21] RECOVERY - puppet last run on puppetdb2003 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:11:23] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:05] PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:13:14] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10Dzahn) [19:13:18] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10Dzahn) [19:13:24] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-02-22-090459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891601 [19:14:45] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10Dzahn) [19:14:49] 10SRE: too many puppet failures (puppet errors on logstash hosts) - https://phabricator.wikimedia.org/T330361 (10Dzahn) [19:16:01] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:23] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:46] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: purge leaked VM images, daily [puppet] - 10https://gerrit.wikimedia.org/r/891560 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [19:17:29] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10Dzahn) Another duplicate at T330394 :) [19:17:47] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:07] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:18] 10SRE: too many puppet failures (puppet errors on stat hosts) - https://phabricator.wikimedia.org/T330360 (10Dzahn) [19:19:38] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [19:20:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5003.wikimedia.org with OS bullseye [19:21:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS bullseye completed: - dns5003 (**PASS**) - Downtimed on Icinga/Al... [19:21:24] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [19:21:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:22:11] 10SRE: too many puppet failures (puppet errors on logstash hosts) - https://phabricator.wikimedia.org/T330361 (10Dzahn) Looks like it was solved in T326668. [19:23:57] 10SRE: too many puppet failures (puppet errors on logstash hosts) - https://phabricator.wikimedia.org/T330361 (10Dzahn) [19:24:18] 10SRE, 10Scap, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10Dzahn) [19:30:31] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-02-22-090459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891601 (owner: 10BryanDavis) [19:35:29] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-02-22-090459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/891601 (owner: 10BryanDavis) [19:45:04] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [19:45:17] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [19:45:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:24] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [19:45:44] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [19:45:49] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [19:46:26] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [19:46:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:50:19] !log aphlict2001 - manually created /etc/phabricator/config.yaml - empty file owned by root:phab-deploy to debug for T330393 T322369 [19:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:25] T330393: puppet/scap: failing on aphlict2001 - https://phabricator.wikimedia.org/T330393 [19:50:26] T322369: create aphlict2001 (Phabricator realtime notifications codfw) - https://phabricator.wikimedia.org/T322369 [19:50:34] !log brennen@deploy1002 Started deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001 [19:51:12] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host relforge1003.eqiad.wmnet [19:51:45] !log brennen@deploy1002 Finished deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001 (duration: 01m 10s) [19:52:15] (03PS1) 10BCornwall: ntp/eqsin: set to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891612 [19:53:56] !log brennen@deploy1002 Started deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001, take 2 [19:54:32] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957 [19:54:36] T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [19:55:00] !log brennen@deploy1002 Finished deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001, take 2 (duration: 01m 04s) [19:55:18] (03CR) 10Ssingh: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/891612 (owner: 10BCornwall) [19:55:37] (03CR) 10BCornwall: [C: 03+2] ntp/eqsin: set to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891612 (owner: 10BCornwall) [19:58:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host relforge1003.eqiad.wmnet [20:05:39] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [20:05:39] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [20:06:42] !log brennen@deploy1002 Started deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001, take 3 [20:08:09] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10andrea.denisse) Hi Daniel, Filippo is correct. Progress of the decommission is tracked on [[ https://phabricator.wikimedia.org/T328803 | T328803 ]].... [20:08:26] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [20:08:36] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10andrea.denisse) [20:09:55] !log brennen@deploy1002 Finished deploy [phabricator/deployment@3f2dd1b]: test deploy to aphlict2001, take 3 (duration: 03m 13s) [20:10:05] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host relforge1004.eqiad.wmnet [20:10:28] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10andrea.denisse) a:03andrea.denisse [20:16:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host relforge1004.eqiad.wmnet [20:18:05] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [20:20:46] (03PS1) 10Ryan Kemper: Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 [20:21:05] (03CR) 10CI reject: [V: 04-1] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (owner: 10Ryan Kemper) [20:21:10] (03PS2) 10Ryan Kemper: [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 [20:21:21] (03CR) 10CI reject: [V: 04-1] [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (owner: 10Ryan Kemper) [20:21:29] (03CR) 10Ryan Kemper: [C: 04-1] "Sticking a -1 on here until these hosts are ready" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (owner: 10Ryan Kemper) [20:21:38] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:21:59] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:35] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [20:26:17] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [20:35:39] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:38:02] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) Carrying over a conversation from https://gerrit.wikimedia.org/r/c/802592 in which @tstarling says: > Ideally I would like there to be... [20:41:34] (03PS3) 10Bking: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [20:42:12] (03PS4) 10Bking: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [20:43:08] (03PS5) 10Ryan Kemper: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) [20:43:27] (03PS6) 10Ryan Kemper: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) [20:45:55] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [20:49:47] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2022'] [20:50:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2022'] [20:55:22] (03CR) 10Vivian Rook: [C: 03+2] Update dns for paws prometheus [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook) [20:56:48] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) @andrea.denisse Don't worry about silences. I was just looking specifically at a dashboard to see only failed attempts from blackbox::http be... [20:57:05] (03PS5) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [20:57:33] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) p:05Triage→03Low [20:57:38] (03PS6) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [20:57:42] (03CR) 10CI reject: [V: 04-1] Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [20:59:35] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) What a mess. So if I'm understanding this correctly: * They refuse to include `includeSubDomains` unless it "makes sense" (what does that even mean?) ** Which means that p... [20:59:41] (03PS7) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [21:00:05] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T2100). [21:00:51] Nothing in the window :) [21:00:58] no doggy? [21:01:39] -_- that took me a moment.. [21:01:41] how much is that scappy in the window? [21:02:31] Why didn't jouncebot say there was nothing scheduled? [21:03:07] Slacking 😌 [21:04:35] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BBlack) I assume "makes sense" here is probably cases where shopify knows of or has configured actual subdomains of the domain in question, or something like that. In either case, ye... [21:07:08] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:09:09] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BBlack) Maybe worth pointing out (I had an old stale link to this years ago earlier in the ticket), if nothing else because it may cause whomever at shopify to actually reach out to a... [21:12:09] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS bullseye [21:12:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns5004.wikimedia.org with OS bullseye [21:13:31] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) My interpretation of the response was that they also refuse to set the `preload` value. [21:14:48] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10TheDJ) > check for the header with curl during install, and warn the user if it is not present. I guess we can do this for upgrades, but for fresh... [21:14:56] (03PS7) 10Ryan Kemper: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) [21:15:30] (03PS8) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [21:15:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2013'] [21:15:55] PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs2013'] [21:16:59] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:17:37] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [21:17:45] (03PS2) 10Dzahn: Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291 [21:18:10] sukhe: dns5004 -known ? [21:18:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:18:43] expected, brett is reimaging :) [21:18:48] ack:) [21:18:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) [21:19:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:19:52] (03PS3) 10Dzahn: Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291 [21:20:18] jouncebot: nowandnext [21:20:18] For the next 0 hour(s) and 39 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230223T2100) [21:20:19] In 9 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230224T0700) [21:21:37] (03CR) 10Zabe: [C: 03+2] Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [21:22:25] (03Merged) 10jenkins-bot: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [21:25:05] Hi TheresNoTime, would you mind running a maint script for task T327114? [21:25:06] T327114: Create New Page Reviewer user right in Nepali Wikipedia - https://phabricator.wikimedia.org/T327114 [21:25:15] cirno: sure [21:25:53] !log create Azerbaijani Wikimedians User Group wiki # T306015 [21:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:59] T306015: Create a wiki for Azerbaijani Wikimedians User Group - https://phabricator.wikimedia.org/T306015 [21:26:19] cirno: one moment [21:27:33] (03CR) 10Dzahn: [C: 03+2] "as taavi correctly pointed out this should not be needed. the "connection refused" was just because on contint* port 443 is not used but 1" [puppet] - 10https://gerrit.wikimedia.org/r/891291 (owner: 10Dzahn) [21:28:34] (03PS1) 10Papaul: Add new wdqs node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891651 (https://phabricator.wikimedia.org/T326689) [21:28:43] cirno: to confirm, I'm doing this for the group `reviewer` ? [21:28:51] yep [21:29:03] (03CR) 10Dzahn: [C: 03+2] "ultimate test will be to watch the dashboard for new errors after merge - https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f" [puppet] - 10https://gerrit.wikimedia.org/r/891291 (owner: 10Dzahn) [21:29:05] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10TheDJ) Ugh.. ok. i see that the envCheckUploadsDirectory of the installer is not even checking wgUploadPath... and wgUploadBaseUrl either. [21:29:11] https://ne.wikipedia.org/w/index.php?title=Special:Listusers&group=reviewer [21:29:12] !log zabe@deploy1002 Started scap: create azwikimedia T306015 [21:29:51] !log `[samtar@mwmaint1002 ~]$ mwscript maintenance/emptyUserGroup.php --wiki newiki reviewer` for T327114 [21:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:05] (03CR) 10Papaul: [C: 03+2] Add new wdqs node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891651 (https://phabricator.wikimedia.org/T326689) (owner: 10Papaul) [21:30:45] cirno: done :) [21:30:56] ty! [21:31:48] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:32:31] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply JRE updates - bking@cumin1001 - T329957 [21:32:35] T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [21:36:00] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:37:06] !log zabe@deploy1002 Finished scap: create azwikimedia T306015 (duration: 07m 54s) [21:37:10] T306015: Create a wiki for Azerbaijani Wikimedians User Group - https://phabricator.wikimedia.org/T306015 [21:38:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:42:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2013.codfw.wmnet with OS bullseye [21:42:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2013.codfw.wmnet with OS bullseye [21:43:57] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:44:18] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:44:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:44:52] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:45:00] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:45:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [21:45:44] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:46:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:46:40] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:47:08] (03PS1) 10Zabe: Enable Translate extension on azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891655 (https://phabricator.wikimedia.org/T306015) [21:47:33] (03CR) 10Zabe: [C: 03+2] Enable Translate extension on azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891655 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [21:48:22] (03Merged) 10jenkins-bot: Enable Translate extension on azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891655 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [21:48:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [21:53:39] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [21:53:54] zabe: how did the wiki creation go? [21:54:56] pretty flawless [21:55:17] good :). it's somewhat an extraordinary thing for new wiki creations :)) [21:55:50] yeah, I hope I didn't mess up something within the initial config [21:55:54] and it wasn't just any wiki either, it's an affiliate that had to go through affcom :) [21:56:08] and that, special wikis are more difficult to create than non-special :) [21:56:11] seems like we've missing the DBA task though [21:56:14] zabe: could do create it? [21:56:14] yea, thanks zabe [21:56:21] !log zabe@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T306015 (duration: 06m 49s) [21:56:21] ^^ [21:56:27] T306015: Create a wiki for Azerbaijani Wikimedians User Group - https://phabricator.wikimedia.org/T306015 [21:56:41] sure [21:56:48] ty [21:58:28] the initial config looks good on first sight [22:01:53] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891395 [22:01:55] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891395 (owner: 10Zabe) [22:02:01] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:02:37] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891395 (owner: 10Zabe) [22:06:25] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:06:31] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:10:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2013.codfw.wmnet with reason: host reimage [22:11:13] !log zabe@deploy1002 Synchronized wmf-config/interwiki.php: [[gerrit:891395]] (duration: 07m 11s) [22:12:09] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:12:37] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:13:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2013.codfw.wmnet with reason: host reimage [22:14:24] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:17:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS bullseye [22:17:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns5004.wikimedia.org with OS bullseye completed: - dns5004 (**PASS**) - Downtimed on Icinga/Al... [22:20:07] (03PS1) 10BCornwall: Revert "ntp/eqsin: set to dns5003" [dns] - 10https://gerrit.wikimedia.org/r/891627 [22:22:27] (03CR) 10Ssingh: [C: 03+2] Revert "ntp/eqsin: set to dns5003" [dns] - 10https://gerrit.wikimedia.org/r/891627 (owner: 10BCornwall) [22:28:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:29:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:30:03] (03PS1) 10Jforrester: build: Pin PHPUnit to 9.5.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891672 [22:32:14] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [22:32:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6002.wikimedia.org with OS bullseye [22:32:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6002.wikimedia.org with OS bullseye [22:34:23] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957 [22:34:27] T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [22:34:29] (03Merged) 10jenkins-bot: wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [22:36:28] PROBLEM - Host 2a02:ec80:600:2:185:15:58:37 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:600:2:185:15:58:37) [22:36:38] ^nothing to see here [22:37:28] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:37:36] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:38:43] (03CR) 10Reedy: [C: 03+2] build: Pin PHPUnit to 9.5.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891672 (owner: 10Jforrester) [22:38:58] PROBLEM - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:39:20] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:40:14] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:40:24] (03Merged) 10jenkins-bot: build: Pin PHPUnit to 9.5.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891672 (owner: 10Jforrester) [22:42:25] (03PS1) 10Dzahn: alertmanager: add missing route for serviceops-collab severity task [puppet] - 10https://gerrit.wikimedia.org/r/891690 (https://phabricator.wikimedia.org/T329587) [22:42:41] (03CR) 10Hoo man: [C: 03+1] Remove unused Wikibase config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) (owner: 10Lucas Werkmeister (WMDE)) [22:43:13] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10TheDJ) whipped up a beginning for an installer check, but requires more work. (i'm about to go on vacation, so might be a while before i can get bac... [22:43:54] (03CR) 10Dzahn: [C: 03+2] alertmanager: add missing route for serviceops-collab severity task [puppet] - 10https://gerrit.wikimedia.org/r/891690 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:45:57] (03PS1) 10Zabe: add amical.wm.o for Amical Wikimedia wiki [dns] - 10https://gerrit.wikimedia.org/r/891698 (https://phabricator.wikimedia.org/T330390) [22:50:29] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:51:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:52:38] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6002.wikimedia.org with reason: host reimage [22:53:50] (03CR) 10Dzahn: [C: 03+2] add amical.wm.o for Amical Wikimedia wiki [dns] - 10https://gerrit.wikimedia.org/r/891698 (https://phabricator.wikimedia.org/T330390) (owner: 10Zabe) [22:55:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6002.wikimedia.org with reason: host reimage [22:55:50] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:58:15] thanks mutante [22:58:52] PROBLEM - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [23:00:22] zabe: you are welcome. the part I am not sure about is why they really need a private wiki instead of fishbowl. "internal working" is vague and sounds more like "default to private" instead of default to public [23:00:32] but that is of course just wiki config later [23:01:04] I did like how you linked right to WHY they are eligible. that was perfect [23:01:08] laters [23:02:07] well, they even attached a config file. so that's a good base for discussion [23:07:00] RECOVERY - Recursive DNS on 185.15.58.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:07:24] (03CR) 10BryanDavis: [C: 03+1] developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140) (owner: 10Alexandros Kosiaris) [23:15:20] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:24] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:17:39] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Quiddity) [23:18:00] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:19:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6002.wikimedia.org with OS bullseye [23:20:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6002.wikimedia.org with OS bullseye completed: - dns6002 (**PASS**) - Downtimed on Icinga/Al... [23:21:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:25:40] !log mwscript namespaceDupes.php shnwiktionary --fix # T330456 [23:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:45] T330456: Lost pages after deployed addtional namespaces on shn.wiktionary - https://phabricator.wikimedia.org/T330456 [23:26:59] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:27:15] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:46:30] (03CR) 10Dzahn: [C: 03+2] switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:46:34] (03PS4) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091)