[00:00:16] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:00:52] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2165.mgmt.codfw.wmnet with reboot policy FORCED [00:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2163.mgmt.codfw.wmnet with reboot policy FORCED [00:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2166.mgmt.codfw.wmnet with reboot policy FORCED [00:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:02] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [00:23:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2161.codfw.wmnet with OS bullseye [00:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:20] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2161.codfw.wmnet with OS bullseye [00:26:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2165.mgmt.codfw.wmnet with reboot policy FORCED [00:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2166.mgmt.codfw.wmnet with reboot policy FORCED [00:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2167.mgmt.codfw.wmnet with reboot policy FORCED [00:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2168.mgmt.codfw.wmnet with reboot policy FORCED [00:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2162.codfw.wmnet with OS bullseye [00:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:54] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2162.codfw.wmnet with OS bullseye [00:42:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [00:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [00:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2167.mgmt.codfw.wmnet with reboot policy FORCED [00:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2168.mgmt.codfw.wmnet with reboot policy FORCED [00:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:30] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [00:57:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [00:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2161.codfw.wmnet with OS bullseye [01:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:19] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2161.codfw.wmnet with OS bullseye completed: - db2... [01:00:28] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:02:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [01:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:49] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:52] (03PS6) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [01:12:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2163.codfw.wmnet with OS bullseye [01:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:20] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2163.codfw.wmnet with OS bullseye [01:15:41] (03CR) 10Krinkle: [C: 03+2] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [01:16:42] (03Merged) 10jenkins-bot: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [01:17:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2162.codfw.wmnet with OS bullseye [01:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:24] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2162.codfw.wmnet with OS bullseye completed: - db2... [01:19:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:20:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:44] (03PS7) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [01:23:07] !log krinkle@deploy1002 Synchronized tests/: I796f38d0f04600c (1/3) (duration: 03m 41s) [01:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:32] (03CR) 10Krinkle: [C: 03+2] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [01:26:04] (03PS11) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [01:26:19] (03Merged) 10jenkins-bot: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [01:26:59] !log krinkle@deploy1002 Synchronized multiversion/: I796f38d0f04600c (2/3) (duration: 03m 32s) [01:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:17] (03PS12) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [01:30:48] !log krinkle@deploy1002 Synchronized src/: I796f38d0f04600c (3/3) (duration: 03m 24s) [01:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [01:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [01:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:42] (03PS1) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 [01:36:48] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (owner: 10Mary Yang) [01:39:47] !log krinkle@deploy1002 Synchronized tests/: I60edfb0f60 (1/3) (duration: 03m 32s) [01:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2163.codfw.wmnet with OS bullseye [01:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:55] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2163.codfw.wmnet with OS bullseye completed: - db2... [01:54:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2165.codfw.wmnet with OS bullseye [01:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2165.codfw.wmnet with OS bullseye [02:01:37] !log krinkle@deploy1002 Synchronized multiversion/: I60edfb0f60 (2/3) (duration: 03m 34s) [02:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:23] (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [02:03:28] (03PS13) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:06:32] !log krinkle@deploy1002 Synchronized wmf-config/: I60edfb0f60 (3/3) (duration: 03m 31s) [02:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [02:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:16:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [02:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:12] (03PS13) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [02:30:13] (03PS1) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) [02:30:16] (03PS1) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 [02:31:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2165.codfw.wmnet with OS bullseye [02:31:15] (03CR) 10CI reject: [V: 04-1] multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:20] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2165.codfw.wmnet with OS bullseye completed: - db2... [02:31:22] (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle) [03:01:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:07:28] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:11:51] (03PS2) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) [03:11:53] (03PS2) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 [03:12:45] (03CR) 10CI reject: [V: 04-1] multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:12:48] (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle) [03:14:34] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:17:34] (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [03:18:09] (03PS3) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 [03:18:26] (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [03:18:51] (03PS3) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) [03:18:53] (03PS4) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 [03:19:13] (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle) [03:20:00] (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle) [03:20:48] (03PS5) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 [03:21:45] (03PS6) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821) [03:39:13] (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:20:38] (03PS14) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [04:20:40] (03PS4) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) [04:20:42] (03PS7) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821) [04:23:54] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1859 MB (3% inode=97%): /tmp 1859 MB (3% inode=97%): /var/tmp 1859 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [05:29:39] (03PS1) 10Marostegui: mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493) [05:30:11] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:30:18] (03PS2) 10Muehlenhoff: certspotter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:30:32] (03PS2) 10Marostegui: mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493) [05:31:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:35:55] (03CR) 10Muehlenhoff: "Could you please also add the header to modules/cumin/files/reboot-host?" [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:39:39] (03PS1) 10Marostegui: instances.yaml: Remove db2092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810150 (https://phabricator.wikimedia.org/T311802) [05:40:26] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810150 (https://phabricator.wikimedia.org/T311802) (owner: 10Marostegui) [05:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2092 from dbctl T311802', diff saved to https://phabricator.wikimedia.org/P30701 and previous config saved to /var/cache/conftool/dbconfig/20220701-054102-marostegui.json [05:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:08] T311802: decommission db2092 - https://phabricator.wikimedia.org/T311802 [05:41:47] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:41:54] (03PS2) 10Muehlenhoff: pdns_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:41:56] (03PS1) 10Marostegui: db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810151 [05:43:25] (03CR) 10Marostegui: [C: 03+2] db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810151 (owner: 10Marostegui) [05:50:37] (03PS1) 10Marostegui: mariadb: Add db2154 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/810152 (https://phabricator.wikimedia.org/T311493) [05:51:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db2154 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/810152 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:58:49] (03PS1) 10Marostegui: instances.yaml: Remove db2091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810153 (https://phabricator.wikimedia.org/T311803) [05:59:36] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810153 (https://phabricator.wikimedia.org/T311803) (owner: 10Marostegui) [06:00:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2091 from dbctl T311803', diff saved to https://phabricator.wikimedia.org/P30703 and previous config saved to /var/cache/conftool/dbconfig/20220701-060000-marostegui.json [06:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] T311803: decommission db2091 - https://phabricator.wikimedia.org/T311803 [06:04:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:05:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:13:27] (03PS1) 10Marostegui: db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810155 (https://phabricator.wikimedia.org/T311803) [06:14:10] (03CR) 10Marostegui: [C: 03+2] db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810155 (https://phabricator.wikimedia.org/T311803) (owner: 10Marostegui) [06:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:18:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:19:59] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:25:39] (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:25:49] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:25:57] (03PS2) 10Muehlenhoff: uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:33:55] (03CR) 10Muehlenhoff: [C: 03+2] Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 (owner: 10Muehlenhoff) [06:36:31] (03CR) 10Muehlenhoff: [C: 03+2] jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:36:37] (03PS2) 10Muehlenhoff: jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) [06:51:20] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:18] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:30] (03CR) 10Majavah: [C: 03+1] "looks correct" [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220701T0700) [07:12:22] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [07:15:17] (03PS2) 10Zabe: cumin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) [07:16:03] (03CR) 10Zabe: cumin: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:18:08] (03PS1) 10Filippo Giunchedi: swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959) [07:18:36] (03CR) 10Samwilson: [C: 04-1] Enable edit-in-sequence on Beta Wikisource for testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:18:39] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:20:55] (03PS2) 10Filippo Giunchedi: swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959) [07:24:35] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM, thank you Majavah!" [puppet] - 10https://gerrit.wikimedia.org/r/810039 (owner: 10Majavah) [07:24:37] (03PS1) 10Marostegui: site.pp: Remove db2153-2154 from insetup role [puppet] - 10https://gerrit.wikimedia.org/r/810278 (https://phabricator.wikimedia.org/T306927) [07:25:24] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2153-2154 from insetup role [puppet] - 10https://gerrit.wikimedia.org/r/810278 (https://phabricator.wikimedia.org/T306927) (owner: 10Marostegui) [07:27:56] (03PS1) 10Marostegui: instances.yaml: Add db2153 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810279 (https://phabricator.wikimedia.org/T311493) [07:30:59] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2153 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810279 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:32:13] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [07:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2153 to s1 T311493', diff saved to https://phabricator.wikimedia.org/P30704 and previous config saved to /var/cache/conftool/dbconfig/20220701-073512-marostegui.json [07:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:17] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [07:39:13] (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:40:14] (03PS1) 10Marostegui: instances.yaml: Add db2154 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810281 (https://phabricator.wikimedia.org/T311493) [07:41:07] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2154 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810281 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:44:26] (03PS2) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969 [07:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2154 to s8 T311493', diff saved to https://phabricator.wikimedia.org/P30705 and previous config saved to /var/cache/conftool/dbconfig/20220701-074607-marostegui.json [07:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:12] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [07:47:36] !log kubemaster2001, restart rsyslog [07:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:58] (KubernetesRsyslogDown) resolved: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:49:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36154/console" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede) [08:07:24] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 3 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi SRE does get paged nowadays when there's a "low" (FSVO low) availability (i.e.... [08:08:52] 10SRE, 10Traffic, 10Patch-For-Review: Create vm.max_map_count metrics for Prometheus - https://phabricator.wikimedia.org/T311445 (10fgiunchedi) [08:11:13] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [08:11:32] (03PS1) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 [08:12:56] (03PS2) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) [08:13:09] (03CR) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [08:13:42] (03CR) 10David Caro: "btw. I'm working on fixing the tests (the errors come from master I think), you might have to rebase the patch again, and your local code" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [08:14:47] (03CR) 10Sohom Datta: "Made the changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [08:15:46] (03CR) 10CI reject: [V: 04-1] wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 (owner: 10David Caro) [08:16:09] (03CR) 10Vgutierrez: prometheus: Add custom vm.max_map_count metric (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [08:16:41] (03CR) 10Majavah: [C: 04-1] "I don't think this will work as expected, as the debian version check is done on the host being checked and not the one running blackbox-e" [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:17:55] (03CR) 10Vgutierrez: [C: 03+1] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [08:18:31] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) I think we're in a good shape wrt spicerack and alertmanager support, is there a... [08:19:31] (03PS2) 10Majavah: keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040 [08:20:58] (03CR) 10Muehlenhoff: puppet: add wrapper command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808877 (owner: 10Jbond) [08:22:18] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10fgiunchedi) [08:24:59] (03CR) 10DannyS712: "what does "" mean here?" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani) [08:25:35] (03CR) 10Zabe: Revert "RecentChange: Straight join to actor table when needed" (031 comment) [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani) [08:27:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:38] (03CR) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:28:11] (03CR) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:29:01] (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040 (owner: 10Majavah) [08:33:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede) [08:33:45] (03PS1) 10David Caro: sre.ganeti.makevm: Format with black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/810288 [08:35:15] !log Stop mysql on db2073 for cloning db2155 [08:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:40] (03PS1) 10Marostegui: mariadb: Productionize db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810289 (https://phabricator.wikimedia.org/T311493) [08:36:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810289 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:44:07] (03PS2) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 [08:49:37] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:51:09] (03CR) 10David Caro: [C: 04-1] "I just rebased the wmcs branch on top of master (and added a fix for some tests), please rebase again!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [08:51:49] (03CR) 10David Caro: "Sorted out the errors on the master branch, and rebased wmcs on top of it, please fetch and rebase this again, cheers!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [08:52:01] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:52:30] (03PS2) 10David Caro: wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez) [08:52:32] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) Yes, the whole phase 2 mentioned in T293209#7698301 is still a TODO: * Although we... [08:52:59] (03PS2) 10Majavah: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 [08:53:11] (03PS8) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [08:54:23] (03CR) 10David Caro: "I'd recommend creating a different cookbook for many instances that just calls this in a loop, that prevents adding complexity to this one" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 (owner: 10Arturo Borrero Gonzalez) [08:55:00] (03PS3) 10David Caro: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [08:55:28] (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [08:55:51] (03PS2) 10David Caro: wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [08:56:00] (03PS2) 10David Caro: wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [08:59:49] (03CR) 10CI reject: [V: 04-1] wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [09:00:42] (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [09:03:00] (03CR) 10David Caro: [C: 03+2] wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez) [09:07:16] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10BTullis) 05Open→03Resolved a:03BTullis [09:08:23] (03PS2) 10Majavah: keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041 [09:08:25] (03PS1) 10Majavah: keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294 [09:08:36] (03CR) 10David Caro: [C: 03+2] labstore: update monitoring for nrpe changes [puppet] - 10https://gerrit.wikimedia.org/r/799318 (owner: 10Majavah) [09:08:57] (03Merged) 10jenkins-bot: wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez) [09:09:05] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:23] (03CR) 10David Caro: [C: 03+2] openstack: remove unused check_ssl_certfile [puppet] - 10https://gerrit.wikimedia.org/r/793423 (owner: 10Majavah) [09:10:34] (03CR) 10Vgutierrez: [C: 03+1] keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294 (owner: 10Majavah) [09:10:54] (03PS1) 10David Caro: Add mypy tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [09:12:52] (03CR) 10David Caro: [C: 03+2] wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [09:14:40] (03CR) 10David Caro: [C: 03+2] wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [09:19:38] (03CR) 10CI reject: [V: 04-1] Add mypy tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [09:20:31] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:34] (03CR) 10Volans: [C: 04-1] "See inline for all the details, AFAICT just removing the double space on line 186 is enough to fix the reported issue." [cookbooks] - 10https://gerrit.wikimedia.org/r/810288 (owner: 10David Caro) [09:38:35] (03PS12) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [09:39:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:39:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:49] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [09:39:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:11] (03Abandoned) 10David Caro: sre.ganeti.makevm: Format with black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/810288 (owner: 10David Caro) [09:49:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30708 and previous config saved to /var/cache/conftool/dbconfig/20220701-094927-ladsgroup.json [09:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [09:56:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 51.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:05:46] (03CR) 10Ladsgroup: [C: 03+1] noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [10:13:15] (03CR) 10LSobanski: [C: 03+1] vtrs: add promtheus blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn) [10:14:33] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30709 and previous config saved to /var/cache/conftool/dbconfig/20220701-101602-ladsgroup.json [10:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:07] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:17:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:20:43] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks! This looks great, I'll merge on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah) [10:22:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:22:57] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:27:38] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) Thank you @Volans, the items all make sense to me. >>! In T293209#8043294, @Vol... [10:27:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30710 and previous config saved to /var/cache/conftool/dbconfig/20220701-102810-ladsgroup.json [10:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:15] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:28:36] (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294 (owner: 10Majavah) [10:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30711 and previous config saved to /var/cache/conftool/dbconfig/20220701-103107-ladsgroup.json [10:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:43] (03PS1) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [10:44:07] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:16] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [10:44:17] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:45:05] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:14] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30712 and previous config saved to /var/cache/conftool/dbconfig/20220701-104612-ladsgroup.json [10:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:41] (03PS1) 10Btullis: Update the partman configuration for the new presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835) [10:48:40] (03PS1) 10Muehlenhoff: graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 [10:49:15] (03CR) 10CI reject: [V: 04-1] graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff) [10:50:13] (03CR) 10Btullis: "It's quite possible that this custom/kafka-jumbo.cfg recipe will become more of a standard configuration, wherever we use hardware RAID an" [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835) (owner: 10Btullis) [10:50:26] (03CR) 10Btullis: [C: 03+2] Update the partman configuration for the new presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835) (owner: 10Btullis) [10:54:17] (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:55:58] (03PS1) 10Majavah: P:toolforge: cleanup support for .wmflabs hostnames [puppet] - 10https://gerrit.wikimedia.org/r/810307 [10:56:03] (03PS2) 10Muehlenhoff: graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 [10:56:14] (03PS1) 10Muehlenhoff: Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308 [10:56:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36157/console" [puppet] - 10https://gerrit.wikimedia.org/r/810307 (owner: 10Majavah) [10:57:14] (03CR) 10Majavah: Remove obsolete profile::base::linux419 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff) [10:57:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:58:02] (03PS2) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [10:58:38] (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:00:27] (03PS3) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30713 and previous config saved to /var/cache/conftool/dbconfig/20220701-110117-ladsgroup.json [11:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:23] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:01:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30714 and previous config saved to /var/cache/conftool/dbconfig/20220701-110204-ladsgroup.json [11:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:33] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:02:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson I think this should be good to go now. We've identified an additional step that we need to carry out... [11:04:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff) [11:06:50] (03PS1) 10Muehlenhoff: profile::parsoid::testing: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810309 [11:07:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) >>! In T299466#8040485, @Ottomata wrote: > We will have to rebuild hadoop for bullsye, eh? {T310643} Yep, looks that way. [11:08:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30715 and previous config saved to /var/cache/conftool/dbconfig/20220701-110859-ladsgroup.json [11:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:04] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:10:01] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386) [11:10:03] (03PS1) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) [11:10:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) @Cmjohnson I think that this should now work if you tweak the RAID controller configuration as described here: T297913#8041258 Let me know if it doesn't b... [11:13:03] (03PS4) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:13:38] (03PS5) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:15:23] (03CR) 10Muehlenhoff: Remove obsolete profile::base::linux419 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff) [11:16:00] (03PS2) 10Muehlenhoff: Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308 [11:16:41] (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:19:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I have manually moved all home directories from `/home` to `/srv/home` and created a symlink. This matches the configuration of all of the other sta... [11:19:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) [11:19:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) 05Open→03Resolved [11:20:56] (03PS6) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30716 and previous config saved to /var/cache/conftool/dbconfig/20220701-112121-ladsgroup.json [11:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:29] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:22:25] (03PS1) 10Muehlenhoff: prometheus::postgres_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810318 [11:23:35] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:24:01] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P30717 and previous config saved to /var/cache/conftool/dbconfig/20220701-112404-ladsgroup.json [11:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:59] (03PS1) 10Muehlenhoff: snapshot: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810319 [11:26:06] (03PS7) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:32:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [11:34:05] (03PS8) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:34:17] (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:34:51] (03PS1) 10Muehlenhoff: uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810321 [11:35:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36161/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:35:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810321 (owner: 10Muehlenhoff) [11:36:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36162/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P30718 and previous config saved to /var/cache/conftool/dbconfig/20220701-113626-ladsgroup.json [11:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:11] (03PS13) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [11:38:27] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [11:38:28] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:37] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [11:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P30719 and previous config saved to /var/cache/conftool/dbconfig/20220701-113909-ladsgroup.json [11:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] (03PS1) 10Muehlenhoff: librenms: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810323 [11:41:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [11:43:39] (03PS1) 10Muehlenhoff: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 [11:44:56] (03CR) 10CI reject: [V: 04-1] bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [11:48:16] (03PS2) 10Muehlenhoff: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 [11:51:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P30720 and previous config saved to /var/cache/conftool/dbconfig/20220701-115131-ladsgroup.json [11:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff) [11:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30721 and previous config saved to /var/cache/conftool/dbconfig/20220701-115414-ladsgroup.json [11:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:17] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:54:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though clouddb-wikilabels-01.clouddb-services.eqiad.wmflabs shows up as diff in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [12:00:19] (03CR) 10Muehlenhoff: prometheus::postgres_exporter: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [12:02:10] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:18] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:42] (03PS3) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969 [12:04:57] (03CR) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede) [12:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30722 and previous config saved to /var/cache/conftool/dbconfig/20220701-120636-ladsgroup.json [12:06:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:44] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30723 and previous config saved to /var/cache/conftool/dbconfig/20220701-120657-ladsgroup.json [12:07:00] (03PS1) 10Filippo Giunchedi: prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) [12:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:55] (03CR) 10CI reject: [V: 04-1] prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi) [12:09:48] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:56] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:17] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10BTullis) @Andrew - I'm starting work on the bigtop build for bullseye now. I hope to have an update for you soon. [12:11:27] (03PS2) 10Filippo Giunchedi: prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) [12:12:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [12:14:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36163/console" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi) [12:16:08] I'm seeking reviewers for ^ should be straightforward (we do the same for prometheus server itself) [12:19:14] looking [12:19:34] cheers moritzm [12:24:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [12:24:53] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:26:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi) [12:26:40] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi) [12:34:17] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:37:50] !log uploaded rsyslog 8.2102.0-2+deb11u1+wmf2 to component/rsyslog-k8s (backport of latest security fixes on top of the rsyslog with mmkubernetes plugin) [12:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:05] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:14] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:45] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:50:37] (03PS1) 10Marostegui: db2073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810336 (https://phabricator.wikimedia.org/T311837) [12:52:29] (03CR) 10Marostegui: [C: 03+2] db2073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810336 (https://phabricator.wikimedia.org/T311837) (owner: 10Marostegui) [12:52:41] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [12:53:15] (03CR) 10David Caro: [C: 03+2] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/810307 (owner: 10Majavah) [12:57:39] (03PS1) 10David Caro: sre.ganeti.makevm: Remove duplicated space [cookbooks] - 10https://gerrit.wikimedia.org/r/810338 [12:58:36] (03PS1) 10Marostegui: instances.yaml: Add db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810339 (https://phabricator.wikimedia.org/T311493) [12:59:23] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810339 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [13:01:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2155 to s4 T311493', diff saved to https://phabricator.wikimedia.org/P30724 and previous config saved to /var/cache/conftool/dbconfig/20220701-130106-marostegui.json [13:01:09] (03PS3) 10David Caro: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [13:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:11] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [13:03:14] (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:04:18] (03PS1) 10Zabe: RecentChange: Make join to comment table also straight [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810138 (https://phabricator.wikimedia.org/T311360) [13:05:06] (03PS4) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) [13:05:08] (03PS5) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) [13:05:10] (03PS5) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) [13:05:12] (03PS3) 10David Caro: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786) [13:05:14] (03PS3) 10David Caro: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786) [13:05:16] (03PS4) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) [13:05:18] (03PS4) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 [13:05:21] (03PS2) 10David Caro: wmcs.openstaack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606) [13:05:22] (03PS2) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 [13:05:24] (03PS1) 10Zabe: Revert "Revert "RecentChange: Straight join to actor table when needed"" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810139 (https://phabricator.wikimedia.org/T311360) [13:08:38] (03PS3) 10Jcrespo: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [13:08:40] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:49] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] (03PS3) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 [13:11:13] (03CR) 10Jcrespo: "Thank you very much for rising this. If we get a +1 verified, this noop should be ready to deploy. However, allow me to delay deployment f" [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [13:11:28] (03CR) 10Jcrespo: [C: 03+1] bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [13:12:37] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:45] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30725 and previous config saved to /var/cache/conftool/dbconfig/20220701-131316-ladsgroup.json [13:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:21] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:19:17] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:19:33] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:42] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] (03PS14) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [13:23:12] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:21] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [13:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson) [13:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Cmjohnson) [13:24:43] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [13:28:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P30726 and previous config saved to /var/cache/conftool/dbconfig/20220701-132821-ladsgroup.json [13:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:13] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) [13:42:23] (03CR) 10EllenR: [C: 03+1] "sorry for delay aok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan) [13:43:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) [13:43:25] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:43:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P30727 and previous config saved to /var/cache/conftool/dbconfig/20220701-134326-ladsgroup.json [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:33] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s) [13:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:57] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) I renamed what was originally "frdev1003" on this task to "frdb1006" because that better describes the server's role. [13:50:19] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [13:56:12] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui 61,62,63, and 65 are ready as well [13:56:46] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Thank you!! [13:57:57] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30728 and previous config saved to /var/cache/conftool/dbconfig/20220701-135831-ladsgroup.json [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:39] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:04:59] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:26] (03PS1) 10Jgreen: add frdb1005 and frdb1006 [dns] - 10https://gerrit.wikimedia.org/r/810345 (https://phabricator.wikimedia.org/T306935) [14:12:44] (03CR) 10Jgreen: [C: 03+2] add frdb1005 and frdb1006 [dns] - 10https://gerrit.wikimedia.org/r/810345 (https://phabricator.wikimedia.org/T306935) (owner: 10Jgreen) [14:13:12] Is there anyone that can look at restbase2018? It's been down for a while now. [14:14:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) [14:18:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10BTullis) If it's of any help, our team has just had some success with a similar kind of partman recipe that creates a big LVM vo... [14:26:43] PROBLEM - Host clouddumps1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:03] RECOVERY - Host clouddumps1001 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [14:34:17] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:39:36] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudstore[1008-1009] [14:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2166.codfw.wmnet with OS bullseye [14:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:28] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2166.codfw.wmnet with OS bullseye [14:43:29] (03CR) 10Nskaggs: Add dumps mapping to cache_upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack) [14:44:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2167.codfw.wmnet with OS bullseye [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2167.codfw.wmnet with OS bullseye [14:46:29] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386) [14:46:31] (03PS2) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) [14:46:33] (03PS1) 10Giuseppe Lavagetto: jobrunner: allow selecting explicitly the backend when performing health checks. [puppet] - 10https://gerrit.wikimedia.org/r/810348 (https://phabricator.wikimedia.org/T311386) [14:48:43] (03PS1) 10Andrew Bogott: Remove references to cloudstore100[89] [puppet] - 10https://gerrit.wikimedia.org/r/810349 (https://phabricator.wikimedia.org/T311844) [14:48:45] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [14:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [14:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [14:59:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30729 and previous config saved to /var/cache/conftool/dbconfig/20220701-145937-ladsgroup.json [14:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:01:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore[1008-1009] [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:14] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [15:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:23] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [15:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [15:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:01] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:10:34] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Andrew) a:05Andrew→03Cmjohnson Because these are HP boxes (I think?) the decom script was unable to actually shut them down. They are n... [15:10:39] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudstore100[89] [puppet] - 10https://gerrit.wikimedia.org/r/810349 (https://phabricator.wikimedia.org/T311844) (owner: 10Andrew Bogott) [15:13:45] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:14:04] (03PS1) 10Andrew Bogott: Remove cloudstore100[89] IPs from the dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/810351 (https://phabricator.wikimedia.org/T311844) [15:15:43] (03PS3) 10BCornwall: prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) [15:16:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2166.codfw.wmnet with OS bullseye [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:01] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2166.codfw.wmnet with OS bullseye completed: - db2... [15:22:15] (03PS2) 10Andrew Bogott: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 [15:22:17] (03PS9) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:22:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2167.codfw.wmnet with OS bullseye [15:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2167.codfw.wmnet with OS bullseye completed: - db2... [15:23:39] (03CR) 10BCornwall: prometheus: Add custom vm.max_map_count metric (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [15:24:39] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@cloudelastic-chi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) public vlan, just like the existing cloudcontrols please. All disks in hardware raid10, and then partman recipe 'pa... [15:26:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) [15:26:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:27:23] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f227d370280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [15:27:23] imedia.org/wiki/Search%23Administration [15:27:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) raid10-4dev.cfg for partman please! [15:28:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) [15:30:49] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott) [15:31:43] (03CR) 10Andrew Bogott: [C: 03+2] galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458 (owner: 10Andrew Bogott) [15:31:54] (03CR) 10Andrew Bogott: [C: 03+2] remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott) [15:31:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:32:19] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott) [15:36:45] (03PS1) 10David Caro: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 [15:36:47] (03PS1) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [15:37:52] (03CR) 10Andrew Bogott: Change formatting of a few openstack calls (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [15:38:10] (03PS4) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [15:38:12] (03PS3) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [15:38:14] (03PS2) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 [15:38:16] (03Abandoned) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott) [15:39:29] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: Use openstack cli for creating new glance image [puppet] - 10https://gerrit.wikimedia.org/r/802605 (owner: 10Andrew Bogott) [15:40:38] (03PS4) 10BCornwall: prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) [15:41:42] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:44:58] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:04] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:53:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2168.codfw.wmnet with OS bullseye [15:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:01] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2168.codfw.wmnet with OS bullseye [15:54:35] (03CR) 10CI reject: [V: 04-1] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [15:59:24] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:53] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [16:05:01] (03PS3) 10Kosta Harlan: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) [16:05:19] (03PS5) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [16:08:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30730 and previous config saved to /var/cache/conftool/dbconfig/20220701-160831-ladsgroup.json [16:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:44] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:10:54] (03PS6) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [16:12:04] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [16:13:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [16:13:04] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:15] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [16:14:08] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:14:41] (03PS1) 10Kosta Harlan: SuggestedEdits: Adjust thumbnailSource logic [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810142 (https://phabricator.wikimedia.org/T311789) [16:16:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [16:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P30731 and previous config saved to /var/cache/conftool/dbconfig/20220701-162337-ladsgroup.json [16:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:09] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 758, active_shards: 1519, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight [16:28:09] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.86850756081526 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:29:00] (03CR) 10Ahmon Dancy: mediawiki: add scap restarts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [16:30:10] (03CR) 10Ahmon Dancy: [C: 04-1] scap: use the new script to restart php-fpm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810031 (owner: 10Giuseppe Lavagetto) [16:30:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2168.codfw.wmnet with OS bullseye [16:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:47] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2168.codfw.wmnet with OS bullseye completed: - db2... [16:34:49] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P30732 and previous config saved to /var/cache/conftool/dbconfig/20220701-163842-ladsgroup.json [16:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:51] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30733 and previous config saved to /var/cache/conftool/dbconfig/20220701-165347-ladsgroup.json [16:53:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:52] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30734 and previous config saved to /var/cache/conftool/dbconfig/20220701-165407-ladsgroup.json [16:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:42] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [16:56:17] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:02:23] PROBLEM - DNS on cloudstore1009.mgmt is CRITICAL: Domain cloudstore1009.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:25] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:02] (03CR) 10Andrew Bogott: "A couple of comments inline. I'm concerned that the __init__ rename/refactor is going to break other things that rely on it (both due to t" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [17:24:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:29:53] (03PS3) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 [17:34:40] (03PS6) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [17:35:36] (03PS7) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [17:39:07] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:47:00] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [17:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:08] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [17:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30735 and previous config saved to /var/cache/conftool/dbconfig/20220701-174929-ladsgroup.json [17:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:34] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:55:53] PROBLEM - DNS on cloudstore1008.mgmt is CRITICAL: Domain cloudstore1008.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:46] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Aklapper) 05Open→03Resolved No replies by anyone, boldly closing - shrug [18:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P30736 and previous config saved to /var/cache/conftool/dbconfig/20220701-180434-ladsgroup.json [18:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:49] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:11] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:12:58] (03CR) 10Dzahn: vtrs: add promtheus blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn) [18:14:23] (03PS3) 10Dzahn: vrts: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) [18:19:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P30737 and previous config saved to /var/cache/conftool/dbconfig/20220701-181939-ladsgroup.json [18:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:33] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [18:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30738 and previous config saved to /var/cache/conftool/dbconfig/20220701-183444-ladsgroup.json [18:34:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:49] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30739 and previous config saved to /var/cache/conftool/dbconfig/20220701-183504-ladsgroup.json [18:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:05] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:38:25] (03PS2) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 [18:39:01] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (owner: 10Mary Yang) [18:39:17] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:40:37] (03PS3) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [18:41:12] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [18:42:25] (03PS1) 10Andrew Bogott: nfs-mounts: move math project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810393 (https://phabricator.wikimedia.org/T301280) [18:50:02] (03PS4) 10Mary Yang: Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [18:50:37] (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [18:56:41] (03CR) 10Mary Yang: "Looks like autoloader requires an init.pp file also, but I am not sure what to put in there.." [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [19:01:37] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move math project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810393 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [19:04:17] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:08:13] (03PS1) 10Andrew Bogott: nfs-mounts: fix c/p error with 'math' nfs path [puppet] - 10https://gerrit.wikimedia.org/r/810395 (https://phabricator.wikimedia.org/T301280) [19:08:52] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: fix c/p error with 'math' nfs path [puppet] - 10https://gerrit.wikimedia.org/r/810395 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [19:18:16] (03CR) 10Urbanecm: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [19:25:28] (03CR) 10Urbanecm: [C: 04-1] "other than the proxy bit (and jerkins's -1), LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [19:41:57] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:44:03] (03PS1) 10Andrew Bogott: nfs-mounts: move video project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810398 (https://phabricator.wikimedia.org/T301280) [19:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30740 and previous config saved to /var/cache/conftool/dbconfig/20220701-194716-ladsgroup.json [19:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:22] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:47:54] (03PS3) 10Dzahn: hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [19:48:22] (03Restored) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [19:50:35] (03PS3) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) [19:51:22] (03CR) 10CI reject: [V: 04-1] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [19:51:24] (03CR) 10Dzahn: [C: 04-1] "eh, surprise after rebase.. hold on" [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [19:53:13] (03PS4) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) [19:59:41] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > 1. [change 744763 (puppet)](https://g... [20:00:16] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) a:05hashar→03Dzahn [20:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P30741 and previous config saved to /var/cache/conftool/dbconfig/20220701-200221-ladsgroup.json [20:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:40] (03PS1) 10Dzahn: doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) [20:10:21] (03PS1) 10Dzahn: site/DHCP: decom doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) [20:12:49] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > I propose the following rollout: add... [20:14:00] (03PS7) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [20:15:45] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [20:17:13] (03PS1) 10Dzahn: doc: remove support for stretch / PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) [20:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P30742 and previous config saved to /var/cache/conftool/dbconfig/20220701-201726-ladsgroup.json [20:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:27] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:55] (03CR) 10Dzahn: [C: 03+2] "yep, thanks - https://puppet-compiler.wmflabs.org/pcc-worker1001/36165/" [puppet] - 10https://gerrit.wikimedia.org/r/810309 (owner: 10Muehlenhoff) [20:20:31] (03CR) 10CI reject: [V: 04-1] doc: remove support for stretch / PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [20:21:17] (03CR) 10Dzahn: "lol @ " error during compilation: Evaluation Error: Error while evaluating a Function Call, profile not supported by stretch " from jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [20:21:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Cmjohnson) 05Open→03Resolved @Jgreen updated frdev1003 to frdb1006, thank you for fixing dns. The on-site work has been completed, i... [20:22:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:23] (03PS2) 10Dzahn: doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) [20:24:59] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move video project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810398 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [20:26:35] (03PS1) 10Dzahn: typos: add "vtrs" [puppet] - 10https://gerrit.wikimedia.org/r/810403 [20:27:29] (03PS2) 10Dzahn: typos: add "vtrs" [puppet] - 10https://gerrit.wikimedia.org/r/810403 [20:27:37] (03CR) 10CI reject: [V: 04-1] doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [20:29:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30743 and previous config saved to /var/cache/conftool/dbconfig/20220701-203231-ladsgroup.json [20:32:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [20:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:37] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [20:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30744 and previous config saved to /var/cache/conftool/dbconfig/20220701-203251-ladsgroup.json [20:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:15] (03PS5) 10Dzahn: Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:33:50] (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:33:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:27] (03CR) 10Dzahn: "made some minor changes to fix the CI / "in auto-layout" issue and some brackets. you can look at the diff between PS4 and PS5: https://" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:35:50] (03CR) 10Dzahn: Add puppet profile and role files for wikifunctions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:36:09] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10Cmjohnson) a:03Cmjohnson Ack this task, will take care of next week [20:37:25] (03PS1) 10Clare Ming: Remove Table of Contents config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810405 (https://phabricator.wikimedia.org/T310527) [20:37:52] (03CR) 10Dzahn: Add puppet profile and role files for wikifunctions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:39:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson) @Jgreen i don't seem to have the template directory or 10.in file in my DNS repo to make changes for you. If you can update frlog1002's dns then you... [20:39:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:41:39] (03CR) 10Dzahn: "the part that CI doesn't like has now changed to just "following are missing a SPDX licence header"." [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:58:40] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > I propose the following rollout: I s... [21:04:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye [21:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [21:08:54] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810409 [21:09:00] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810409 (owner: 10QChris) [21:09:01] !log https://doc.wikimedia.org - scheduled maintenance period - switching to buster backend doc1002 (T247653) [21:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:05] T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 [21:09:29] (03PS1) 10QChris: Import done. Revoke import grants [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810410 [21:09:32] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810410 (owner: 10QChris) [21:13:42] Hallo qchris :) [21:13:52] (03CR) 10Dzahn: [C: 03+2] hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [21:13:59] Hi hauskatze :) [21:17:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1006.eqiad.wmnet with reason: host reimage [21:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS bullseye [21:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye [21:20:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1006.eqiad.wmnet with reason: host reimage [21:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:54] (03CR) 10Krinkle: [C: 03+1] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:21:05] (03CR) 10Krinkle: [C: 03+1] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:24:09] mutante: would you have a few minutes to look at restbase2018? It's down -including SSH- and has been for a while now. [21:24:31] urandom: sorry, I am in the middle of a maintenance window, maybe after that [21:24:37] ok [21:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30745 and previous config saved to /var/cache/conftool/dbconfig/20220701-212903-ladsgroup.json [21:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:08] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:30:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1009.eqiad.wmnet with reason: host reimage [21:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:02] (03CR) 10Dzahn: [C: 03+2] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:31:09] (03PS2) 10Dzahn: doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) [21:31:25] (03CR) 10Dzahn: [V: 03+2] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:33:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1009.eqiad.wmnet with reason: host reimage [21:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:41] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:34:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1006.eqiad.wmnet with OS bullseye [21:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye co... [21:36:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1014.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1009.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1013.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1008.eqiad.wmnet with OS bullseye [21:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye [21:36:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1010.eqiad.wmnet with OS bullseye [21:37:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1015.eqiad.wmnet with OS bullseye [21:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye [21:37:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1014.eqiad.wmnet with OS bullseye [21:37:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye [21:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [21:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1013.eqiad.wmnet with OS bullseye [21:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1008.eqiad.wmnet with OS bullseye [21:37:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye [21:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1010.eqiad.wmnet with OS bullseye [21:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye [21:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P30746 and previous config saved to /var/cache/conftool/dbconfig/20220701-214408-ladsgroup.json [21:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:29] (03PS5) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) [21:45:57] (03CR) 10Dzahn: [C: 03+2] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:48:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1009.eqiad.wmnet with OS bullseye [21:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye completed: - stat1009 (**PASS... [21:48:45] !log https://doc.wikimedia.org switched to doc1002 backend on buster T247653 [21:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:49] T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 [21:49:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1014.eqiad.wmnet with reason: host reimage [21:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [21:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1010.eqiad.wmnet with reason: host reimage [21:49:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [21:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1013.eqiad.wmnet with reason: host reimage [21:49:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1008.eqiad.wmnet with reason: host reimage [21:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [21:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1013.eqiad.wmnet with reason: host reimage [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1008.eqiad.wmnet with reason: host reimage [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1010.eqiad.wmnet with reason: host reimage [21:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1014.eqiad.wmnet with reason: host reimage [21:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:51] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [21:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:54] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1012.eqiad.wmnet with OS bullseye [21:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye ex... [21:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1015.eqiad.wmnet with OS bullseye [21:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye ex... [21:57:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye [21:57:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1011.eqiad.wmnet with OS bullseye [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex... [21:57:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye ex... [21:57:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1009.eqiad.wmnet with OS bullseye [21:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye ex... [21:59:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P30747 and previous config saved to /var/cache/conftool/dbconfig/20220701-215913-ladsgroup.json [21:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1015.eqiad.wmnet with OS bullseye [22:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye [22:02:55] urandom: restbase2018 is running, I can see it on mgmt. so it's "just" cable or switch port. we will just have to ask dcops via ticket [22:02:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye [22:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye [22:03:05] urandom: it's properly depooled? no problem right now? [22:03:50] it's not depooled per say, but it's not creating an outage if that's what you mean [22:04:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) [22:04:10] yea, whatever needs to be done so that it does not get traffic or causes issues that it's down [22:04:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:04:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) 05Open→03Resolved resolved [22:04:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1010.eqiad.wmnet with OS bullseye [22:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1010.eqiad.wmnet with OS bullseye co... [22:05:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1008.eqiad.wmnet with OS bullseye [22:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1008.eqiad.wmnet with OS bullseye co... [22:05:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1013.eqiad.wmnet with OS bullseye [22:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1013.eqiad.wmnet with OS bullseye co... [22:08:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1014.eqiad.wmnet with OS bullseye [22:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1014.eqiad.wmnet with OS bullseye co... [22:10:52] 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) [22:11:13] urandom: I made a ticket ^ [22:12:38] !log restbase2018 - attempting power cycle via mgmt - /admin1-> racadm serveraction powercycle (T311890) [22:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:42] T311890: restbase2018 down - https://phabricator.wikimedia.org/T311890 [22:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30748 and previous config saved to /var/cache/conftool/dbconfig/20220701-221418-ladsgroup.json [22:14:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [22:14:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1015.eqiad.wmnet with reason: host reimage [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:23] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [22:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [22:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:37] mutante: sorry, I should have opened one [22:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30749 and previous config saved to /var/cache/conftool/dbconfig/20220701-221438-ladsgroup.json [22:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:23] urandom: no worries, let's try this one powercycle [22:15:30] (03CR) 10Andrew Bogott: [C: 03+2] haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350 (owner: 10Andrew Bogott) [22:15:31] you cant get on mgmt.. so... [22:15:32] PROBLEM - cassandra-c service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:15:39] (03PS2) 10Andrew Bogott: haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350 [22:15:44] PROBLEM - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:15:44] PROBLEM - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:15:45] PROBLEM - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:16:00] PROBLEM - cassandra-a service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:16:02] urandom: well.. those alerts kind of sound like it wasnt actually down ? [22:16:06] PROBLEM - cassandra-b service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:16:10] or somehow in limbo [22:16:22] PROBLEM - puppet last run on restbase2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:16:32] 1 day ago? [22:16:38] RECOVERY - Restbase root url on restbase2018 is OK: HTTP OK: HTTP/1.1 200 - 17235 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/RESTBase [22:17:04] urandom: try ssh now. it works again [22:17:28] mutante: I mean, it was definitely in some sort of limbo/broken state, if not actually totally down [22:17:29] [restbase2018:~] $ uptime 22:17:20 up 3 min, [22:17:30] RECOVERY - cassandra-b service on restbase2018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:17:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1015.eqiad.wmnet with reason: host reimage [22:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:03] urandom: ACK, yea. and nothing in hardware fail log [22:18:10] weird. [22:18:21] anyway, yeah, seems Ok now [22:18:26] RECOVERY - SSH on restbase2018 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:18:36] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1001.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:29] mutante: thank you! [22:20:23] urandom: no problem. I am just not sure what to do with the ticket. probably nothing though :) [22:20:32] RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.124 port 9042 https://phabricator.wikimedia.org/T93886 [22:20:36] I glanced at syslog as well [22:21:30] there is a separate syslog just for restbase too, but: [22:21:31] May 17 14:08:38 restbase2018 restbase[27229]: #033]0;firejail /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml #007Child process initialized in 98.93 ms [22:21:34] RECOVERY - puppet last run on restbase2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:21:35] Jul 1 22:14:34 restbase2018 restbase[937]: Reading profile /etc/firejail/default.profile [22:22:39] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1012.eqiad.wmnet with OS bullseye [22:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye ex... [22:22:50] 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) powercycling via mgmt brought it back as if nothing happened nothing obvious in syslog, or restbase/syslog. [22:23:18] 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) 05Open→03Resolved a:03Dzahn feel free to reopen if you see any issue with this again [22:23:46] 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) ` 22:20 <+icinga-wm> RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.124 port 9042 https://phabricator.wikimedia.org/T93886 22:20 < mutante> I glance... [22:27:10] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:14] (03PS1) 10BryanDavis: striker: Open firewall for Docker-managed service [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469) [22:31:16] (03PS1) 10BryanDavis: striker: Bump container version to 2022-07-01-210101-production [puppet] - 10https://gerrit.wikimedia.org/r/810414 (https://phabricator.wikimedia.org/T306469) [22:31:32] RECOVERY - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-a valid until 2022-10-08 10:54:06 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:31:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye [22:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye [22:32:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1015.eqiad.wmnet with OS bullseye [22:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye co... [22:33:48] RECOVERY - cassandra-a service on restbase2018 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:34:44] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/36166/" [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [22:36:02] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36167/" [puppet] - 10https://gerrit.wikimedia.org/r/810414 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [22:36:26] RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.037 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886 [22:38:44] RECOVERY - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-b valid until 2022-10-08 10:54:09 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:41:10] RECOVERY - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.126 port 9042 https://phabricator.wikimedia.org/T93886 [22:43:38] RECOVERY - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-c valid until 2022-10-08 10:54:12 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:43:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1012.eqiad.wmnet with reason: host reimage [22:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:17] 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) ` 22:36 <+icinga-wm> RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.037 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886 22:38 <+icinga-wm> RECOVE... [22:46:02] RECOVERY - cassandra-c service on restbase2018 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:47:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1012.eqiad.wmnet with reason: host reimage [22:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:11] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [23:02:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1012.eqiad.wmnet with OS bullseye [23:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye co... [23:02:44] 10SRE, 10Observability-Logging, 10SRE Observability (FY2021/2022-Q4): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10colewhite) 05Open→03Resolved a:03herron There hasn't been a need to test if the patch above fixed the issue, but I think we can close it and cir... [23:04:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:10:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30750 and previous config saved to /var/cache/conftool/dbconfig/20220701-231009-ladsgroup.json [23:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:14] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:25:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P30751 and previous config saved to /var/cache/conftool/dbconfig/20220701-232514-ladsgroup.json [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P30752 and previous config saved to /var/cache/conftool/dbconfig/20220701-234019-ladsgroup.json [23:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30753 and previous config saved to /var/cache/conftool/dbconfig/20220701-235524-ladsgroup.json [23:55:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [23:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:29] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [23:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:54] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:58:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase