[00:00:16] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:00:52] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2165.mgmt.codfw.wmnet with reboot policy FORCED
[00:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2163.mgmt.codfw.wmnet with reboot policy FORCED
[00:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2166.mgmt.codfw.wmnet with reboot policy FORCED
[00:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:02] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[00:23:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2161.codfw.wmnet with OS bullseye
[00:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2161.codfw.wmnet with OS bullseye
[00:26:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2165.mgmt.codfw.wmnet with reboot policy FORCED
[00:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2166.mgmt.codfw.wmnet with reboot policy FORCED
[00:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2167.mgmt.codfw.wmnet with reboot policy FORCED
[00:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2168.mgmt.codfw.wmnet with reboot policy FORCED
[00:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2162.codfw.wmnet with OS bullseye
[00:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:54] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2162.codfw.wmnet with OS bullseye
[00:42:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2161.codfw.wmnet with reason: host reimage
[00:42:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: host reimage
[00:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2167.mgmt.codfw.wmnet with reboot policy FORCED
[00:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2168.mgmt.codfw.wmnet with reboot policy FORCED
[00:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[00:57:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2162.codfw.wmnet with reason: host reimage
[00:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2161.codfw.wmnet with OS bullseye
[01:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:19] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2161.codfw.wmnet with OS bullseye completed: - db2...
[01:00:28] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:02:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: host reimage
[01:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:09:49] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:52] <wikibugs>	 (03PS6) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[01:12:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2163.codfw.wmnet with OS bullseye
[01:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:12:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2163.codfw.wmnet with OS bullseye
[01:15:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[01:16:42] <wikibugs>	 (03Merged) 10jenkins-bot: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[01:17:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2162.codfw.wmnet with OS bullseye
[01:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:24] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2162.codfw.wmnet with OS bullseye completed: - db2...
[01:19:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:19:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:20:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:44] <wikibugs>	 (03PS7) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[01:23:07] <logmsgbot>	 !log krinkle@deploy1002 Synchronized tests/: I796f38d0f04600c (1/3) (duration: 03m 41s)
[01:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:32] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[01:26:04] <wikibugs>	 (03PS11) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[01:26:19] <wikibugs>	 (03Merged) 10jenkins-bot: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[01:26:59] <logmsgbot>	 !log krinkle@deploy1002 Synchronized multiversion/: I796f38d0f04600c (2/3) (duration: 03m 32s)
[01:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:28:17] <wikibugs>	 (03PS12) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[01:30:48] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/: I796f38d0f04600c (3/3) (duration: 03m 24s)
[01:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:31:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:31:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2163.codfw.wmnet with reason: host reimage
[01:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:32:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:32:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:33:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: host reimage
[01:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:42] <wikibugs>	 (03PS1) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146
[01:36:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (owner: 10Mary Yang)
[01:39:47] <logmsgbot>	 !log krinkle@deploy1002 Synchronized tests/: I60edfb0f60 (1/3) (duration: 03m 32s)
[01:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:49:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2163.codfw.wmnet with OS bullseye
[01:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:49:55] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2163.codfw.wmnet with OS bullseye completed: - db2...
[01:54:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2165.codfw.wmnet with OS bullseye
[01:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:11] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2165.codfw.wmnet with OS bullseye
[02:01:37] <logmsgbot>	 !log krinkle@deploy1002 Synchronized multiversion/: I60edfb0f60 (2/3) (duration: 03m 34s)
[02:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:03:23] <wikibugs>	 (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[02:03:28] <wikibugs>	 (03PS13) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820)
[02:06:32] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: I60edfb0f60 (3/3) (duration: 03m 31s)
[02:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:13:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2165.codfw.wmnet with reason: host reimage
[02:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:16:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: host reimage
[02:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:12] <wikibugs>	 (03PS13) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[02:30:13] <wikibugs>	 (03PS1) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821)
[02:30:16] <wikibugs>	 (03PS1) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148
[02:31:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2165.codfw.wmnet with OS bullseye
[02:31:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[02:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2165.codfw.wmnet with OS bullseye completed: - db2...
[02:31:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle)
[03:01:50] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:07:28] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:11:51] <wikibugs>	 (03PS2) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821)
[03:11:53] <wikibugs>	 (03PS2) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148
[03:12:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[03:12:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle)
[03:14:34] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:17:34] <wikibugs>	 (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[03:18:09] <wikibugs>	 (03PS3) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148
[03:18:26] <wikibugs>	 (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[03:18:51] <wikibugs>	 (03PS3) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821)
[03:18:53] <wikibugs>	 (03PS4) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148
[03:19:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle)
[03:20:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (owner: 10Krinkle)
[03:20:48] <wikibugs>	 (03PS5) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148
[03:21:45] <wikibugs>	 (03PS6) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821)
[03:39:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:20:38] <wikibugs>	 (03PS14) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[04:20:40] <wikibugs>	 (03PS4) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821)
[04:20:42] <wikibugs>	 (03PS7) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821)
[04:23:54] <icinga-wm>	 PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1859 MB (3% inode=97%): /tmp 1859 MB (3% inode=97%): /var/tmp 1859 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[05:29:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493)
[05:30:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:30:18] <wikibugs>	 (03PS2) 10Muehlenhoff: certspotter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:30:32] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493)
[05:31:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Add db2153 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/810149 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:35:55] <wikibugs>	 (03CR) 10Muehlenhoff: "Could you please also add the header to modules/cumin/files/reboot-host?" [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:39:39] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810150 (https://phabricator.wikimedia.org/T311802)
[05:40:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810150 (https://phabricator.wikimedia.org/T311802) (owner: 10Marostegui)
[05:41:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2092 from dbctl T311802', diff saved to https://phabricator.wikimedia.org/P30701 and previous config saved to /var/cache/conftool/dbconfig/20220701-054102-marostegui.json
[05:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:41:08] <stashbot>	 T311802: decommission db2092 - https://phabricator.wikimedia.org/T311802
[05:41:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:41:54] <wikibugs>	 (03PS2) 10Muehlenhoff: pdns_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:41:56] <wikibugs>	 (03PS1) 10Marostegui: db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810151
[05:43:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810151 (owner: 10Marostegui)
[05:50:37] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db2154 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/810152 (https://phabricator.wikimedia.org/T311493)
[05:51:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Add db2154 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/810152 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:58:49] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810153 (https://phabricator.wikimedia.org/T311803)
[05:59:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810153 (https://phabricator.wikimedia.org/T311803) (owner: 10Marostegui)
[06:00:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2091 from dbctl T311803', diff saved to https://phabricator.wikimedia.org/P30703 and previous config saved to /var/cache/conftool/dbconfig/20220701-060000-marostegui.json
[06:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:05] <stashbot>	 T311803: decommission db2091 - https://phabricator.wikimedia.org/T311803
[06:04:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:05:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:13:27] <wikibugs>	 (03PS1) 10Marostegui: db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810155 (https://phabricator.wikimedia.org/T311803)
[06:14:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810155 (https://phabricator.wikimedia.org/T311803) (owner: 10Marostegui)
[06:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:18:05] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:19:59] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:25:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:25:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:25:57] <wikibugs>	 (03PS2) 10Muehlenhoff: uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:33:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 (owner: 10Muehlenhoff)
[06:36:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[06:36:37] <wikibugs>	 (03PS2) 10Muehlenhoff: jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013)
[06:51:20] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:18] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:57:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "looks correct" [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220701T0700)
[07:12:22] <icinga-wm>	 RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[07:15:17] <wikibugs>	 (03PS2) 10Zabe: cumin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013)
[07:16:03] <wikibugs>	 (03CR) 10Zabe: cumin: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:18:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959)
[07:18:36] <wikibugs>	 (03CR) 10Samwilson: [C: 04-1] Enable edit-in-sequence on Beta Wikisource for testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:18:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:20:55] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959)
[07:24:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM, thank you Majavah!" [puppet] - 10https://gerrit.wikimedia.org/r/810039 (owner: 10Majavah)
[07:24:37] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db2153-2154 from insetup role [puppet] - 10https://gerrit.wikimedia.org/r/810278 (https://phabricator.wikimedia.org/T306927)
[07:25:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2153-2154 from insetup role [puppet] - 10https://gerrit.wikimedia.org/r/810278 (https://phabricator.wikimedia.org/T306927) (owner: 10Marostegui)
[07:27:56] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2153 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810279 (https://phabricator.wikimedia.org/T311493)
[07:30:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2153 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810279 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:32:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall)
[07:35:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2153 to s1 T311493', diff saved to https://phabricator.wikimedia.org/P30704 and previous config saved to /var/cache/conftool/dbconfig/20220701-073512-marostegui.json
[07:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:17] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[07:39:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:40:14] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2154 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810281 (https://phabricator.wikimedia.org/T311493)
[07:41:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2154 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810281 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:44:26] <wikibugs>	 (03PS2) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969
[07:46:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2154 to s8 T311493', diff saved to https://phabricator.wikimedia.org/P30705 and previous config saved to /var/cache/conftool/dbconfig/20220701-074607-marostegui.json
[07:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:12] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[07:47:36] <mmandere>	 !log kubemaster2001, restart rsyslog 
[07:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:49:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36154/console" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede)
[08:07:24] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 3 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi SRE does get paged nowadays when there's a "low" (FSVO low) availability (i.e....
[08:08:52] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Create vm.max_map_count metrics for Prometheus - https://phabricator.wikimedia.org/T311445 (10fgiunchedi)
[08:11:13] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi)
[08:11:32] <wikibugs>	 (03PS1) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285
[08:12:56] <wikibugs>	 (03PS2) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098)
[08:13:09] <wikibugs>	 (03CR) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[08:13:42] <wikibugs>	 (03CR) 10David Caro: "btw. I'm working on fixing the tests (the errors come from master I think), you might have to rebase the patch again, and your local code" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[08:14:47] <wikibugs>	 (03CR) 10Sohom Datta: "Made the changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[08:15:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 (owner: 10David Caro)
[08:16:09] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: Add custom vm.max_map_count metric (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall)
[08:16:41] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "I don't think this will work as expected, as the debian version check is done on the host being checked and not the one running blackbox-e" [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:17:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[08:18:31] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) I think we're in a good shape wrt spicerack and alertmanager support, is there a...
[08:19:31] <wikibugs>	 (03PS2) 10Majavah: keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040
[08:20:58] <wikibugs>	 (03CR) 10Muehlenhoff: puppet: add wrapper command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808877 (owner: 10Jbond)
[08:22:18] <wikibugs>	 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10fgiunchedi)
[08:24:59] <wikibugs>	 (03CR) 10DannyS712: "what does "<koi>" mean here?" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani)
[08:25:35] <wikibugs>	 (03CR) 10Zabe: Revert "RecentChange: Straight join to actor table when needed" (031 comment) [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani)
[08:27:18] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:28:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:29:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040 (owner: 10Majavah)
[08:33:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede)
[08:33:45] <wikibugs>	 (03PS1) 10David Caro: sre.ganeti.makevm: Format with black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/810288
[08:35:15] <marostegui>	 !log Stop mysql on db2073 for cloning db2155 
[08:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810289 (https://phabricator.wikimedia.org/T311493)
[08:36:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810289 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[08:44:07] <wikibugs>	 (03PS2) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285
[08:49:37] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:51:09] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] "I just rebased the wmcs branch on top of master (and added a fix for some tests), please rebase again!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott)
[08:51:49] <wikibugs>	 (03CR) 10David Caro: "Sorted out the errors on the master branch, and rebased wmcs on top of it, please fetch and rebase this again, cheers!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[08:52:01] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:52:30] <wikibugs>	 (03PS2) 10David Caro: wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez)
[08:52:32] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) Yes, the whole phase 2 mentioned in T293209#7698301 is still a TODO:  * Although we...
[08:52:59] <wikibugs>	 (03PS2) 10Majavah: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067
[08:53:11] <wikibugs>	 (03PS8) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170
[08:54:23] <wikibugs>	 (03CR) 10David Caro: "I'd recommend creating a different cookbook for many instances that just calls this in a loop, that prevents adding complexity to this one" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 (owner: 10Arturo Borrero Gonzalez)
[08:55:00] <wikibugs>	 (03PS3) 10David Caro: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez)
[08:55:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez)
[08:55:51] <wikibugs>	 (03PS2) 10David Caro: wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah)
[08:56:00] <wikibugs>	 (03PS2) 10David Caro: wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[08:59:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah)
[09:00:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[09:03:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez)
[09:07:16] <wikibugs>	 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10BTullis) 05Open→03Resolved a:03BTullis
[09:08:23] <wikibugs>	 (03PS2) 10Majavah: keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041
[09:08:25] <wikibugs>	 (03PS1) 10Majavah: keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294
[09:08:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] labstore: update monitoring for nrpe changes [puppet] - 10https://gerrit.wikimedia.org/r/799318 (owner: 10Majavah)
[09:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez)
[09:09:05] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: remove unused check_ssl_certfile [puppet] - 10https://gerrit.wikimedia.org/r/793423 (owner: 10Majavah)
[09:10:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294 (owner: 10Majavah)
[09:10:54] <wikibugs>	 (03PS1) 10David Caro: Add mypy tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295
[09:12:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[09:14:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah)
[09:19:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add mypy tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro)
[09:20:31] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:34] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "See inline for all the details, AFAICT just removing the double space on line 186 is enough to fix the reported issue." [cookbooks] - 10https://gerrit.wikimedia.org/r/810288 (owner: 10David Caro)
[09:38:35] <wikibugs>	 (03PS12) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[09:39:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[09:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[09:39:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance
[09:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[09:39:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance
[09:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:11] <wikibugs>	 (03Abandoned) 10David Caro: sre.ganeti.makevm: Format with black and isort [cookbooks] - 10https://gerrit.wikimedia.org/r/810288 (owner: 10David Caro)
[09:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:49:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30708 and previous config saved to /var/cache/conftool/dbconfig/20220701-094927-ladsgroup.json
[09:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:31] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[09:56:31] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 51.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[10:05:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[10:13:15] <wikibugs_>	 (03CR) 10LSobanski: [C: 03+1] vtrs: add promtheus blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn)
[10:14:33] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[10:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:16:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30709 and previous config saved to /var/cache/conftool/dbconfig/20220701-101602-ladsgroup.json
[10:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:07] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[10:17:33] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:20:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks! This looks great, I'll merge on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah)
[10:22:33] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:22:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[10:27:38] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) Thank you @Volans, the items all make sense to me.  >>! In T293209#8043294, @Vol...
[10:27:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[10:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[10:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30710 and previous config saved to /var/cache/conftool/dbconfig/20220701-102810-ladsgroup.json
[10:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:15] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[10:28:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder::monitoring: remove source for absent nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810294 (owner: 10Majavah)
[10:31:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30711 and previous config saved to /var/cache/conftool/dbconfig/20220701-103107-ladsgroup.json
[10:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:43] <wikibugs>	 (03PS1) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[10:44:07] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:16] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s)
[10:44:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[10:45:05] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:14] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[10:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30712 and previous config saved to /var/cache/conftool/dbconfig/20220701-104612-ladsgroup.json
[10:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:41] <wikibugs>	 (03PS1) 10Btullis: Update the partman configuration for the new presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835)
[10:48:40] <wikibugs>	 (03PS1) 10Muehlenhoff: graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306
[10:49:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff)
[10:50:13] <wikibugs>	 (03CR) 10Btullis: "It's quite possible that this custom/kafka-jumbo.cfg recipe will become more of a standard configuration, wherever we use hardware RAID an" [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835) (owner: 10Btullis)
[10:50:26] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the partman configuration for the new presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/810305 (https://phabricator.wikimedia.org/T306835) (owner: 10Btullis)
[10:54:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:55:58] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: cleanup support for .wmflabs hostnames [puppet] - 10https://gerrit.wikimedia.org/r/810307
[10:56:03] <wikibugs>	 (03PS2) 10Muehlenhoff: graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306
[10:56:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308
[10:56:53] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36157/console" [puppet] - 10https://gerrit.wikimedia.org/r/810307 (owner: 10Majavah)
[10:57:14] <wikibugs>	 (03CR) 10Majavah: Remove obsolete profile::base::linux419 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff)
[10:57:33] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:58:02] <wikibugs>	 (03PS2) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[10:58:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:00:27] <wikibugs>	 (03PS3) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:01:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P30713 and previous config saved to /var/cache/conftool/dbconfig/20220701-110117-ladsgroup.json
[11:01:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:23] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[11:01:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[11:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[11:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30714 and previous config saved to /var/cache/conftool/dbconfig/20220701-110204-ladsgroup.json
[11:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:33] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:02:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson I think this should be good to go now. We've identified an additional step that we need to carry out...
[11:04:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff)
[11:06:50] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::parsoid::testing: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810309
[11:07:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) >>! In T299466#8040485, @Ottomata wrote: > We will have to rebuild hadoop for bullsye, eh?  {T310643}  Yep, looks that way.
[11:08:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30715 and previous config saved to /var/cache/conftool/dbconfig/20220701-110859-ladsgroup.json
[11:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:04] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[11:10:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386)
[11:10:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386)
[11:10:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) @Cmjohnson I think that this should now work if you tweak the RAID controller configuration as described here: T297913#8041258  Let me know if it doesn't b...
[11:13:03] <wikibugs>	 (03PS4) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:13:38] <wikibugs>	 (03PS5) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:15:23] <wikibugs>	 (03CR) 10Muehlenhoff: Remove obsolete profile::base::linux419 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff)
[11:16:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308
[11:16:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:19:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I have manually moved all home directories from `/home` to `/srv/home` and created a symlink. This matches the configuration of all of the other sta...
[11:19:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis)
[11:19:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) 05Open→03Resolved
[11:20:56] <wikibugs>	 (03PS6) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:21:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30716 and previous config saved to /var/cache/conftool/dbconfig/20220701-112121-ladsgroup.json
[11:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:29] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[11:22:25] <wikibugs>	 (03PS1) 10Muehlenhoff: prometheus::postgres_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810318
[11:23:35] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:24:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:24:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P30717 and previous config saved to /var/cache/conftool/dbconfig/20220701-112404-ladsgroup.json
[11:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:59] <wikibugs>	 (03PS1) 10Muehlenhoff: snapshot: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810319
[11:26:06] <wikibugs>	 (03PS7) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:32:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff)
[11:34:05] <wikibugs>	 (03PS8) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:34:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:34:51] <wikibugs>	 (03PS1) 10Muehlenhoff: uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810321
[11:35:23] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36161/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:35:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810321 (owner: 10Muehlenhoff)
[11:36:25] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36162/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:36:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P30718 and previous config saved to /var/cache/conftool/dbconfig/20220701-113626-ladsgroup.json
[11:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:11] <wikibugs>	 (03PS13) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[11:38:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[11:38:28] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[11:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:37] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[11:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P30719 and previous config saved to /var/cache/conftool/dbconfig/20220701-113909-ladsgroup.json
[11:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:12] <wikibugs>	 (03PS1) 10Muehlenhoff: librenms: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810323
[11:41:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff)
[11:43:39] <wikibugs>	 (03PS1) 10Muehlenhoff: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325
[11:44:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff)
[11:48:16] <wikibugs>	 (03PS2) 10Muehlenhoff: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325
[11:51:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P30720 and previous config saved to /var/cache/conftool/dbconfig/20220701-115131-ladsgroup.json
[11:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff)
[11:54:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P30721 and previous config saved to /var/cache/conftool/dbconfig/20220701-115414-ladsgroup.json
[11:54:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:17] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[11:54:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though clouddb-wikilabels-01.clouddb-services.eqiad.wmflabs shows up as diff in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff)
[12:00:19] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus::postgres_exporter: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff)
[12:02:10] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[12:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:18] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[12:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:42] <wikibugs>	 (03PS3) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969
[12:04:57] <wikibugs>	 (03CR) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede)
[12:06:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T309311)', diff saved to https://phabricator.wikimedia.org/P30722 and previous config saved to /var/cache/conftool/dbconfig/20220701-120636-ladsgroup.json
[12:06:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[12:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:44] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[12:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[12:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30723 and previous config saved to /var/cache/conftool/dbconfig/20220701-120657-ladsgroup.json
[12:07:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833)
[12:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi)
[12:09:48] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[12:09:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:56] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[12:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10BTullis) @Andrew - I'm starting work on the bigtop build for bullseye now. I hope to have an update for you soon.
[12:11:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: split blackbox-exporter logs into a file [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833)
[12:12:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff)
[12:14:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36163/console" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi)
[12:16:08] <godog>	 I'm seeking reviewers for ^ should be straightforward (we do the same for prometheus server itself)
[12:19:14] <moritzm>	 looking
[12:19:34] <godog>	 cheers moritzm 
[12:24:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff)
[12:24:53] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:26:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi)
[12:26:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/810328 (https://phabricator.wikimedia.org/T311833) (owner: 10Filippo Giunchedi)
[12:34:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:37:50] <moritzm>	 !log uploaded rsyslog 8.2102.0-2+deb11u1+wmf2 to component/rsyslog-k8s (backport of latest security fixes on top of the rsyslog with mmkubernetes plugin)
[12:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:05] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[12:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:14] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[12:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:45] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:50:37] <wikibugs>	 (03PS1) 10Marostegui: db2073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810336 (https://phabricator.wikimedia.org/T311837)
[12:52:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810336 (https://phabricator.wikimedia.org/T311837) (owner: 10Marostegui)
[12:52:41] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[12:53:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/810307 (owner: 10Majavah)
[12:57:39] <wikibugs>	 (03PS1) 10David Caro: sre.ganeti.makevm: Remove duplicated space [cookbooks] - 10https://gerrit.wikimedia.org/r/810338
[12:58:36] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810339 (https://phabricator.wikimedia.org/T311493)
[12:59:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2155 [puppet] - 10https://gerrit.wikimedia.org/r/810339 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[13:01:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2155 to s4 T311493', diff saved to https://phabricator.wikimedia.org/P30724 and previous config saved to /var/cache/conftool/dbconfig/20220701-130106-marostegui.json
[13:01:09] <wikibugs>	 (03PS3) 10David Caro: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah)
[13:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:11] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[13:03:14] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:04:18] <wikibugs>	 (03PS1) 10Zabe: RecentChange: Make join to comment table also straight [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810138 (https://phabricator.wikimedia.org/T311360)
[13:05:06] <wikibugs>	 (03PS4) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789)
[13:05:08] <wikibugs>	 (03PS5) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930)
[13:05:10] <wikibugs>	 (03PS5) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786)
[13:05:12] <wikibugs>	 (03PS3) 10David Caro: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786)
[13:05:14] <wikibugs>	 (03PS3) 10David Caro: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786)
[13:05:16] <wikibugs>	 (03PS4) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786)
[13:05:18] <wikibugs>	 (03PS4) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742
[13:05:21] <wikibugs>	 (03PS2) 10David Caro: wmcs.openstaack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606)
[13:05:22] <wikibugs>	 (03PS2) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543
[13:05:24] <wikibugs>	 (03PS1) 10Zabe: Revert "Revert "RecentChange: Straight join to actor table when needed"" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810139 (https://phabricator.wikimedia.org/T311360)
[13:08:38] <wikibugs>	 (03PS3) 10Jcrespo: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff)
[13:08:40] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:49] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:49] <wikibugs>	 (03PS3) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285
[13:11:13] <wikibugs>	 (03CR) 10Jcrespo: "Thank you very much for rising this. If we get a +1 verified, this noop should be ready to deploy. However, allow me to delay deployment f" [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff)
[13:11:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff)
[13:12:37] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:45] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30725 and previous config saved to /var/cache/conftool/dbconfig/20220701-131316-ladsgroup.json
[13:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:21] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[13:19:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:19:33] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:42] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:28] <wikibugs>	 (03PS14) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[13:23:12] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:21] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s)
[13:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson)
[13:24:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Cmjohnson)
[13:24:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[13:28:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P30726 and previous config saved to /var/cache/conftool/dbconfig/20220701-132821-ladsgroup.json
[13:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:05] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:13] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen)
[13:42:23] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "sorry for delay aok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan)
[13:43:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen)
[13:43:25] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:43:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P30727 and previous config saved to /var/cache/conftool/dbconfig/20220701-134326-ladsgroup.json
[13:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:33] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s)
[13:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:57] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[13:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:11] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) I renamed what was originally "frdev1003" on this task to "frdb1006" because that better describes the server's role.
[13:50:19] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:40] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[13:56:12] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui 61,62,63, and 65 are ready as well
[13:56:46] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Thank you!!
[13:57:57] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:58:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T309311)', diff saved to https://phabricator.wikimedia.org/P30728 and previous config saved to /var/cache/conftool/dbconfig/20220701-135831-ladsgroup.json
[13:58:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:39] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[14:04:59] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[14:05:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:08] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[14:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:26] <wikibugs>	 (03PS1) 10Jgreen: add frdb1005 and frdb1006 [dns] - 10https://gerrit.wikimedia.org/r/810345 (https://phabricator.wikimedia.org/T306935)
[14:12:44] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] add frdb1005 and frdb1006 [dns] - 10https://gerrit.wikimedia.org/r/810345 (https://phabricator.wikimedia.org/T306935) (owner: 10Jgreen)
[14:13:12] <urandom>	 Is there anyone that can look at restbase2018?  It's been down for a while now.
[14:14:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen)
[14:18:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10BTullis) If it's of any help, our team has just had some success with a similar kind of partman recipe that creates a big LVM vo...
[14:26:43] <icinga-wm>	 PROBLEM - Host clouddumps1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:28:03] <icinga-wm>	 RECOVERY - Host clouddumps1001 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms
[14:34:17] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[14:39:36] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudstore[1008-1009]
[14:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2166.codfw.wmnet with OS bullseye
[14:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:28] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2166.codfw.wmnet with OS bullseye
[14:43:29] <wikibugs>	 (03CR) 10Nskaggs: Add dumps mapping to cache_upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack)
[14:44:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2167.codfw.wmnet with OS bullseye
[14:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:56] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2167.codfw.wmnet with OS bullseye
[14:46:29] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386)
[14:46:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386)
[14:46:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: allow selecting explicitly the backend when performing health checks. [puppet] - 10https://gerrit.wikimedia.org/r/810348 (https://phabricator.wikimedia.org/T311386)
[14:48:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove references to cloudstore100[89] [puppet] - 10https://gerrit.wikimedia.org/r/810349 (https://phabricator.wikimedia.org/T311844)
[14:48:45] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:44] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[14:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2166.codfw.wmnet with reason: host reimage
[14:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[14:59:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30729 and previous config saved to /var/cache/conftool/dbconfig/20220701-145937-ladsgroup.json
[14:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:41] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[15:01:33] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore[1008-1009]
[15:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:14] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[15:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:23] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[15:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: host reimage
[15:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2167.codfw.wmnet with reason: host reimage
[15:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: host reimage
[15:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:10:34] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Andrew) a:05Andrew→03Cmjohnson Because these are HP boxes (I think?) the decom script was unable to actually shut them down. They are n...
[15:10:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudstore100[89] [puppet] - 10https://gerrit.wikimedia.org/r/810349 (https://phabricator.wikimedia.org/T311844) (owner: 10Andrew Bogott)
[15:13:45] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:14:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove cloudstore100[89] IPs from the dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/810351 (https://phabricator.wikimedia.org/T311844)
[15:15:43] <wikibugs>	 (03PS3) 10BCornwall: prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445)
[15:16:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2166.codfw.wmnet with OS bullseye
[15:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2166.codfw.wmnet with OS bullseye completed: - db2...
[15:22:15] <wikibugs>	 (03PS2) 10Andrew Bogott: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107
[15:22:17] <wikibugs>	 (03PS9) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:22:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2167.codfw.wmnet with OS bullseye
[15:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:32] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2167.codfw.wmnet with OS bullseye completed: - db2...
[15:23:39] <wikibugs>	 (03CR) 10BCornwall: prometheus: Add custom vm.max_map_count metric (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall)
[15:24:39] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@cloudelastic-chi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) public vlan, just like the existing cloudcontrols please.  All disks in hardware raid10, and then partman recipe 'pa...
[15:26:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew)
[15:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:27:23] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f227d370280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki
[15:27:23] <icinga-wm>	 imedia.org/wiki/Search%23Administration
[15:27:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) raid10-4dev.cfg for partman please!
[15:28:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew)
[15:30:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott)
[15:31:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458 (owner: 10Andrew Bogott)
[15:31:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott)
[15:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:32:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott)
[15:36:45] <wikibugs>	 (03PS1) 10David Caro: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367
[15:36:47] <wikibugs>	 (03PS1) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368
[15:37:52] <wikibugs>	 (03CR) 10Andrew Bogott: Change formatting of a few openstack calls (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott)
[15:38:10] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027
[15:38:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031
[15:38:14] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048
[15:38:16] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott)
[15:39:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: Use openstack cli for creating new glance image [puppet] - 10https://gerrit.wikimedia.org/r/802605 (owner: 10Andrew Bogott)
[15:40:38] <wikibugs>	 (03PS4) 10BCornwall: prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445)
[15:41:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:44:58] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:04] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:53:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2168.codfw.wmnet with OS bullseye
[15:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2168.codfw.wmnet with OS bullseye
[15:54:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro)
[15:59:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:53] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[16:05:01] <wikibugs>	 (03PS3) 10Kosta Harlan: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032)
[16:05:19] <wikibugs>	 (03PS5) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[16:08:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30730 and previous config saved to /var/cache/conftool/dbconfig/20220701-160831-ladsgroup.json
[16:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:44] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[16:10:54] <wikibugs>	 (03PS6) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[16:12:04] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[16:13:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2168.codfw.wmnet with reason: host reimage
[16:13:04] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[16:14:08] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:14:41] <wikibugs>	 (03PS1) 10Kosta Harlan: SuggestedEdits: Adjust thumbnailSource logic [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810142 (https://phabricator.wikimedia.org/T311789)
[16:16:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: host reimage
[16:16:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P30731 and previous config saved to /var/cache/conftool/dbconfig/20220701-162337-ladsgroup.json
[16:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:09] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 758, active_shards: 1519, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight
[16:28:09] <icinga-wm>	 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.86850756081526 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:29:00] <wikibugs>	 (03CR) 10Ahmon Dancy: mediawiki: add scap restarts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto)
[16:30:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] scap: use the new script to restart php-fpm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810031 (owner: 10Giuseppe Lavagetto)
[16:30:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2168.codfw.wmnet with OS bullseye
[16:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:47] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2168.codfw.wmnet with OS bullseye completed: - db2...
[16:34:49] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:38:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P30732 and previous config saved to /var/cache/conftool/dbconfig/20220701-163842-ladsgroup.json
[16:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:51] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T309311)', diff saved to https://phabricator.wikimedia.org/P30733 and previous config saved to /var/cache/conftool/dbconfig/20220701-165347-ladsgroup.json
[16:53:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[16:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:52] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[16:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[16:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30734 and previous config saved to /var/cache/conftool/dbconfig/20220701-165407-ladsgroup.json
[16:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:42] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[16:56:17] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:02:23] <icinga-wm>	 PROBLEM - DNS on cloudstore1009.mgmt is CRITICAL: Domain cloudstore1009.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:06:25] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:17:02] <wikibugs>	 (03CR) 10Andrew Bogott: "A couple of comments inline. I'm concerned that the __init__ rename/refactor is going to break other things that rely on it (both due to t" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro)
[17:24:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:29:53] <wikibugs>	 (03PS3) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106
[17:34:40] <wikibugs>	 (03PS6) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074)
[17:35:36] <wikibugs>	 (03PS7) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074)
[17:39:07] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:47:00] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[17:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:08] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[17:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30735 and previous config saved to /var/cache/conftool/dbconfig/20220701-174929-ladsgroup.json
[17:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:34] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[17:55:53] <icinga-wm>	 PROBLEM - DNS on cloudstore1008.mgmt is CRITICAL: Domain cloudstore1008.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:58:46] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Aklapper) 05Open→03Resolved No replies by anyone, boldly closing - shrug
[18:04:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P30736 and previous config saved to /var/cache/conftool/dbconfig/20220701-180434-ladsgroup.json
[18:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:49] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:12:58] <wikibugs>	 (03CR) 10Dzahn: vtrs: add promtheus blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn)
[18:14:23] <wikibugs>	 (03PS3) 10Dzahn: vrts: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190)
[18:19:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P30737 and previous config saved to /var/cache/conftool/dbconfig/20220701-181939-ladsgroup.json
[18:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:33] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[18:34:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30738 and previous config saved to /var/cache/conftool/dbconfig/20220701-183444-ladsgroup.json
[18:34:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:49] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[18:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30739 and previous config saved to /var/cache/conftool/dbconfig/20220701-183504-ladsgroup.json
[18:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:05] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:38:25] <wikibugs>	 (03PS2) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146
[18:39:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (owner: 10Mary Yang)
[18:39:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:40:37] <wikibugs>	 (03PS3) 10Mary Yang: DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457)
[18:41:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[18:42:25] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts: move math project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810393 (https://phabricator.wikimedia.org/T301280)
[18:50:02] <wikibugs>	 (03PS4) 10Mary Yang: Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457)
[18:50:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[18:56:41] <wikibugs>	 (03CR) 10Mary Yang: "Looks like autoloader requires an init.pp file also, but I am not sure what to put in there.." [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[19:01:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move math project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810393 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[19:04:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:08:13] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts: fix c/p error with 'math' nfs path [puppet] - 10https://gerrit.wikimedia.org/r/810395 (https://phabricator.wikimedia.org/T301280)
[19:08:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: fix c/p error with 'math' nfs path [puppet] - 10https://gerrit.wikimedia.org/r/810395 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[19:18:16] <wikibugs>	 (03CR) 10Urbanecm: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[19:25:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "other than the proxy bit (and jerkins's -1), LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[19:41:57] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:44:03] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts: move video project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810398 (https://phabricator.wikimedia.org/T301280)
[19:47:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30740 and previous config saved to /var/cache/conftool/dbconfig/20220701-194716-ladsgroup.json
[19:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:22] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[19:47:54] <wikibugs>	 (03PS3) 10Dzahn: hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[19:48:22] <wikibugs>	 (03Restored) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[19:50:35] <wikibugs>	 (03PS3) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653)
[19:51:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[19:51:24] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "eh, surprise after rebase.. hold on" [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[19:53:13] <wikibugs>	 (03PS4) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653)
[19:59:41] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > 1. [change 744763 (puppet)](https://g...
[20:00:16] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) a:05hashar→03Dzahn
[20:02:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P30741 and previous config saved to /var/cache/conftool/dbconfig/20220701-200221-ladsgroup.json
[20:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:40] <wikibugs>	 (03PS1) 10Dzahn: doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653)
[20:10:21] <wikibugs>	 (03PS1) 10Dzahn: site/DHCP: decom doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653)
[20:12:49] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > I propose the following rollout:  add...
[20:14:00] <wikibugs>	 (03PS7) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[20:15:45] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[20:17:13] <wikibugs>	 (03PS1) 10Dzahn: doc: remove support for stretch / PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653)
[20:17:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P30742 and previous config saved to /var/cache/conftool/dbconfig/20220701-201726-ladsgroup.json
[20:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:27] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:19:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "yep, thanks - https://puppet-compiler.wmflabs.org/pcc-worker1001/36165/" [puppet] - 10https://gerrit.wikimedia.org/r/810309 (owner: 10Muehlenhoff)
[20:20:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: remove support for stretch / PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[20:21:17] <wikibugs>	 (03CR) 10Dzahn: "lol @ " error during compilation: Evaluation Error: Error while evaluating a Function Call, profile not supported by stretch " from jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[20:21:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Cmjohnson) 05Open→03Resolved @Jgreen  updated frdev1003 to frdb1006, thank you for fixing dns.  The on-site work has been completed, i...
[20:22:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:23] <wikibugs>	 (03PS2) 10Dzahn: doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653)
[20:24:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move video project to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/810398 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott)
[20:26:35] <wikibugs>	 (03PS1) 10Dzahn: typos: add "vtrs" [puppet] - 10https://gerrit.wikimedia.org/r/810403
[20:27:29] <wikibugs>	 (03PS2) 10Dzahn: typos: add "vtrs" [puppet] - 10https://gerrit.wikimedia.org/r/810403
[20:27:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[20:29:21] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T309311)', diff saved to https://phabricator.wikimedia.org/P30743 and previous config saved to /var/cache/conftool/dbconfig/20220701-203231-ladsgroup.json
[20:32:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[20:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:37] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[20:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[20:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30744 and previous config saved to /var/cache/conftool/dbconfig/20220701-203251-ladsgroup.json
[20:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:15] <wikibugs>	 (03PS5) 10Dzahn: Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:33:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:33:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:27] <wikibugs>	 (03CR) 10Dzahn: "made some minor changes to fix the CI / "in auto-layout" issue and some brackets.  you can look at the diff between PS4 and PS5:  https://" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:35:50] <wikibugs>	 (03CR) 10Dzahn: Add puppet profile and role files for wikifunctions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:36:09] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10Cmjohnson) a:03Cmjohnson Ack this task, will take care of next week
[20:37:25] <wikibugs>	 (03PS1) 10Clare Ming: Remove Table of Contents config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810405 (https://phabricator.wikimedia.org/T310527)
[20:37:52] <wikibugs>	 (03CR) 10Dzahn: Add puppet profile and role files for wikifunctions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:39:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson) @Jgreen i don't seem to have the template directory or 10.in file in my DNS repo to make changes for you. If you can update frlog1002's dns then you...
[20:39:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:41:39] <wikibugs>	 (03CR) 10Dzahn: "the part that CI doesn't like has now changed to just "following are missing a SPDX licence header"." [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[20:58:40] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7982883, @Krinkle wrote: > I propose the following rollout:  I s...
[21:04:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye
[21:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[21:08:54] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810409
[21:09:00] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810409 (owner: 10QChris)
[21:09:01] <mutante>	 !log https://doc.wikimedia.org - scheduled maintenance period - switching to buster backend doc1002 (T247653)
[21:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:05] <stashbot>	 T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653
[21:09:29] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810410
[21:09:32] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/varnish/libvmod-querysort] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/810410 (owner: 10QChris)
[21:13:42] <hauskatze>	 Hallo qchris :)
[21:13:52] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[21:13:59] <qchris>	 Hi hauskatze :)
[21:17:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1006.eqiad.wmnet with reason: host reimage
[21:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS bullseye
[21:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye
[21:20:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1006.eqiad.wmnet with reason: host reimage
[21:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:54] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[21:21:05] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[21:24:09] <urandom>	 mutante: would you have a few minutes to look at restbase2018?  It's down -including SSH- and has been for a while now.
[21:24:31] <mutante>	 urandom: sorry, I am in the middle of a maintenance window, maybe after that
[21:24:37] <urandom>	 ok
[21:29:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30745 and previous config saved to /var/cache/conftool/dbconfig/20220701-212903-ladsgroup.json
[21:29:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:08] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[21:30:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1009.eqiad.wmnet with reason: host reimage
[21:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[21:31:09] <wikibugs>	 (03PS2) 10Dzahn: doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653)
[21:31:25] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] doc: remove doc1001 from doc::all_hosts and scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/810399 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[21:33:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1009.eqiad.wmnet with reason: host reimage
[21:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:41] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:34:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1006.eqiad.wmnet with OS bullseye
[21:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye co...
[21:36:57] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1014.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1009.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1013.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1008.eqiad.wmnet with OS bullseye
[21:36:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye
[21:36:59] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1010.eqiad.wmnet with OS bullseye
[21:37:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1015.eqiad.wmnet with OS bullseye
[21:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye
[21:37:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1014.eqiad.wmnet with OS bullseye
[21:37:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye
[21:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye
[21:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1013.eqiad.wmnet with OS bullseye
[21:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1008.eqiad.wmnet with OS bullseye
[21:37:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye
[21:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1010.eqiad.wmnet with OS bullseye
[21:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye
[21:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P30746 and previous config saved to /var/cache/conftool/dbconfig/20220701-214408-ladsgroup.json
[21:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:29] <wikibugs>	 (03PS5) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653)
[21:45:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[21:48:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1009.eqiad.wmnet with OS bullseye
[21:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye completed: - stat1009 (**PASS...
[21:48:45] <mutante>	 !log https://doc.wikimedia.org switched to doc1002 backend on buster T247653
[21:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:49] <stashbot>	 T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653
[21:49:19] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1014.eqiad.wmnet with reason: host reimage
[21:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[21:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:26] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1010.eqiad.wmnet with reason: host reimage
[21:49:26] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[21:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1013.eqiad.wmnet with reason: host reimage
[21:49:29] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1008.eqiad.wmnet with reason: host reimage
[21:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage
[21:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1013.eqiad.wmnet with reason: host reimage
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1008.eqiad.wmnet with reason: host reimage
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1010.eqiad.wmnet with reason: host reimage
[21:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1014.eqiad.wmnet with reason: host reimage
[21:50:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[21:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1012.eqiad.wmnet with OS bullseye
[21:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye ex...
[21:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1015.eqiad.wmnet with OS bullseye
[21:52:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye ex...
[21:57:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye
[21:57:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1011.eqiad.wmnet with OS bullseye
[21:57:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex...
[21:57:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye ex...
[21:57:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1009.eqiad.wmnet with OS bullseye
[21:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye ex...
[21:59:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P30747 and previous config saved to /var/cache/conftool/dbconfig/20220701-215913-ladsgroup.json
[21:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1015.eqiad.wmnet with OS bullseye
[22:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye
[22:02:55] <mutante>	 urandom: restbase2018 is running, I can see it on mgmt. so it's "just" cable or switch port. we will just have to ask dcops via ticket
[22:02:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye
[22:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye
[22:03:05] <mutante>	 urandom: it's properly depooled? no problem right now?
[22:03:50] <urandom>	 it's not depooled per say, but it's not creating an outage if that's what you mean
[22:04:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson)
[22:04:10] <mutante>	 yea, whatever needs to be done so that it does not get traffic or causes issues that it's down
[22:04:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:04:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) 05Open→03Resolved resolved
[22:04:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1010.eqiad.wmnet with OS bullseye
[22:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1010.eqiad.wmnet with OS bullseye co...
[22:05:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1008.eqiad.wmnet with OS bullseye
[22:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1008.eqiad.wmnet with OS bullseye co...
[22:05:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1013.eqiad.wmnet with OS bullseye
[22:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1013.eqiad.wmnet with OS bullseye co...
[22:08:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1014.eqiad.wmnet with OS bullseye
[22:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1014.eqiad.wmnet with OS bullseye co...
[22:10:52] <wikibugs>	 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn)
[22:11:13] <mutante>	 urandom: I made a ticket ^
[22:12:38] <mutante>	 !log restbase2018 - attempting power cycle via mgmt - /admin1-> racadm serveraction powercycle  (T311890)
[22:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:42] <stashbot>	 T311890: restbase2018 down  - https://phabricator.wikimedia.org/T311890
[22:14:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T309311)', diff saved to https://phabricator.wikimedia.org/P30748 and previous config saved to /var/cache/conftool/dbconfig/20220701-221418-ladsgroup.json
[22:14:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[22:14:21] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1015.eqiad.wmnet with reason: host reimage
[22:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:23] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[22:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[22:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:37] <urandom>	 mutante: sorry, I should have opened one
[22:14:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30749 and previous config saved to /var/cache/conftool/dbconfig/20220701-221438-ladsgroup.json
[22:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:23] <mutante>	 urandom: no worries, let's try this one powercycle
[22:15:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350 (owner: 10Andrew Bogott)
[22:15:31] <mutante>	 you cant get on mgmt.. so...
[22:15:32] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:15:39] <wikibugs>	 (03PS2) 10Andrew Bogott: haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350
[22:15:44] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:15:44] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:15:45] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:16:00] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:16:02] <mutante>	 urandom: well.. those alerts kind of sound like it wasnt actually down ?
[22:16:06] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:16:10] <mutante>	 or somehow in limbo
[22:16:22] <icinga-wm>	 PROBLEM - puppet last run on restbase2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:16:32] <mutante>	 1 day ago? 
[22:16:38] <icinga-wm>	 RECOVERY - Restbase root url on restbase2018 is OK: HTTP OK: HTTP/1.1 200 - 17235 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[22:17:04] <mutante>	 urandom: try ssh now. it works again
[22:17:28] <urandom>	 mutante: I mean, it was definitely in some sort of limbo/broken state, if not actually totally down
[22:17:29] <mutante>	 [restbase2018:~] $ uptime 22:17:20 up 3 min,
[22:17:30] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:17:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1015.eqiad.wmnet with reason: host reimage
[22:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:03] <mutante>	 urandom: ACK, yea. and nothing in hardware fail log
[22:18:10] <urandom>	 weird.
[22:18:21] <urandom>	 anyway, yeah, seems Ok now
[22:18:26] <icinga-wm>	 RECOVERY - SSH on restbase2018 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:18:36] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1001.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:19:29] <urandom>	 mutante: thank you!
[22:20:23] <mutante>	 urandom: no problem. I am just not sure what to do with the ticket. probably nothing though :)
[22:20:32] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.124 port 9042 https://phabricator.wikimedia.org/T93886
[22:20:36] <mutante>	 I glanced at syslog as well
[22:21:30] <mutante>	 there is a separate syslog just for restbase too, but:
[22:21:31] <mutante>	 May 17 14:08:38 restbase2018 restbase[27229]: #033]0;firejail /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml #007Child process initialized in 98.93 ms
[22:21:34] <icinga-wm>	 RECOVERY - puppet last run on restbase2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:21:35] <mutante>	 Jul  1 22:14:34 restbase2018 restbase[937]: Reading profile /etc/firejail/default.profile
[22:22:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1012.eqiad.wmnet with OS bullseye
[22:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye ex...
[22:22:50] <wikibugs>	 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) powercycling via mgmt brought it back as if nothing happened  nothing obvious in syslog, or restbase/syslog.
[22:23:18] <wikibugs>	 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) 05Open→03Resolved a:03Dzahn feel free to reopen if you see any issue with this again
[22:23:46] <wikibugs>	 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) ` 22:20 <+icinga-wm> RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.124 port 9042 https://phabricator.wikimedia.org/T93886 22:20 < mutante> I glance...
[22:27:10] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:14] <wikibugs>	 (03PS1) 10BryanDavis: striker: Open firewall for Docker-managed service [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469)
[22:31:16] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2022-07-01-210101-production [puppet] - 10https://gerrit.wikimedia.org/r/810414 (https://phabricator.wikimedia.org/T306469)
[22:31:32] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-a valid until 2022-10-08 10:54:06 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:31:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1012.eqiad.wmnet with OS bullseye
[22:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye
[22:32:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1015.eqiad.wmnet with OS bullseye
[22:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1015.eqiad.wmnet with OS bullseye co...
[22:33:48] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2018 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:34:44] <wikibugs>	 (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/36166/" [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[22:36:02] <wikibugs>	 (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36167/" [puppet] - 10https://gerrit.wikimedia.org/r/810414 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[22:36:26] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.037 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886
[22:38:44] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-b valid until 2022-10-08 10:54:09 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:41:10] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.126 port 9042 https://phabricator.wikimedia.org/T93886
[22:43:38] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-c valid until 2022-10-08 10:54:12 +0000 (expires in 98 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:43:46] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1012.eqiad.wmnet with reason: host reimage
[22:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:17] <wikibugs>	 10SRE, 10ops-codfw: restbase2018 down - https://phabricator.wikimedia.org/T311890 (10Dzahn) ` 22:36 <+icinga-wm> RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.037 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886 22:38 <+icinga-wm> RECOVE...
[22:46:02] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2018 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:47:22] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1012.eqiad.wmnet with reason: host reimage
[22:47:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:11] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite)
[23:02:31] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1012.eqiad.wmnet with OS bullseye
[23:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1012.eqiad.wmnet with OS bullseye co...
[23:02:44] <wikibugs>	 10SRE, 10Observability-Logging, 10SRE Observability (FY2021/2022-Q4): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10colewhite) 05Open→03Resolved a:03herron There hasn't been a need to test if the patch above fixed the issue, but I think we can close it and cir...
[23:04:17] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:10:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30750 and previous config saved to /var/cache/conftool/dbconfig/20220701-231009-ladsgroup.json
[23:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:14] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[23:25:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P30751 and previous config saved to /var/cache/conftool/dbconfig/20220701-232514-ladsgroup.json
[23:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:40:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P30752 and previous config saved to /var/cache/conftool/dbconfig/20220701-234019-ladsgroup.json
[23:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T309311)', diff saved to https://phabricator.wikimedia.org/P30753 and previous config saved to /var/cache/conftool/dbconfig/20220701-235524-ladsgroup.json
[23:55:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[23:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:29] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[23:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[23:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:58:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase