[00:04:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS stretch
[00:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:23] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch
[00:09:57] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[00:10:07] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:10:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash Access for  Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) Hey @Ammarpad your user page and Wikitech / LDAP user don't show your realname but @KFrancis from Legal needs it to go through the NDA process with you.  Could you please shoot her an email (https...
[00:15:14] <wikibugs>	 (03PS1) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919
[00:23:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[00:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:51] <roy649>	 Today's enwiki featured image is failing to display.  I opened https://phabricator.wikimedia.org/T302357 for it.
[00:26:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[00:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:47] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:24] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:29] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage
[00:52:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:58] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage
[00:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:53] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:40] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[00:59:40] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye
[00:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage
[01:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:02:19] <Jdlrobson>	 beta cluster doesn't seem to be updating today
[01:03:14] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage
[01:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS stretch
[01:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch completed: - ms-be2067 (*...
[01:08:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch
[01:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:09:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch
[01:13:14] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[01:18:23] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1004.wikimedia.org with OS bullseye
[01:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch
[01:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch
[01:27:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[01:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:30:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[01:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:31] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:38:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[01:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:40:36] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:41:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[01:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:50:56] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2066.codfw.wmnet with OS stretch
[01:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m...
[01:51:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch
[01:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch
[01:54:31] <wikibugs>	 (03CR) 10Cwhite: "There are more instances than just eqiad.  Do we need to provide proxy for those as well?" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[01:59:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash Access for  Ammarpad - https://phabricator.wikimedia.org/T302250 (10KFrancis) Thanks all.  I'm processing this now.
[02:02:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Reedy)
[02:06:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[02:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[02:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:56] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) @fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you please check? thanks  ` Error: 'parted --script...
[02:19:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul)
[02:20:39] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts.yaml.erb: remove nfs mounts for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/764930 (https://phabricator.wikimedia.org/T301298)
[02:27:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:35:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml.erb: remove nfs mounts for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/764930 (https://phabricator.wikimedia.org/T301298) (owner: 10Andrew Bogott)
[02:40:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[02:40:26] <wikibugs>	 (03CR) 10Herron: prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[02:49:43] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2066.codfw.wmnet with OS stretch
[02:49:45] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS stretch
[02:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m...
[02:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors: - m...
[02:51:18] <wikibugs>	 (03CR) 10Herron: prometheus: sketch out proxied prometheus web with IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[03:04:03] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:37] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:42:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service,wmf_auto_restart_prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:00:33] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:01:55] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcontrol1003, ms-be2067, prometheus1006, ms-be2068, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[04:02:57] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:14:23] <wikibugs>	 (03PS1) 10Ladsgroup: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029)
[04:14:37] <wikibugs>	 (03PS1) 10Ladsgroup: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029)
[04:14:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup)
[04:14:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup)
[04:27:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[04:27:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[04:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:28:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T302363)', diff saved to https://phabricator.wikimedia.org/P21322 and previous config saved to /var/cache/conftool/dbconfig/20220223-042802-ladsgroup.json
[04:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:28:11] <stashbot>	 T302363: Upgraded s7 to bullseye - https://phabricator.wikimedia.org/T302363
[04:28:56] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup)
[04:29:01] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup)
[04:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2150.codfw.wmnet with OS bullseye
[04:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:33:51] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.23/includes/page/ParserOutputAccess.php: Backport: [[gerrit:764837|ParserOutputAccess: Check for latest revision when checking for cache (T283029)]] (duration: 00m 51s)
[04:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:33:57] <stashbot>	 T283029: FlaggableWikiPage::preloadPreparedEdit() does not actually carry over the parser output, leading to double parses on save - https://phabricator.wikimedia.org/T283029
[04:35:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:36:16] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.22/includes/page/ParserOutputAccess.php: Backport: [[gerrit:764838|ParserOutputAccess: Check for latest revision when checking for cache (T283029)]] (duration: 00m 50s)
[04:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:36:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:36:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:37:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:42:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:43:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2150.codfw.wmnet with reason: host reimage
[04:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:48:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: host reimage
[04:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2150.codfw.wmnet with OS bullseye
[05:04:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T302363)', diff saved to https://phabricator.wikimedia.org/P21323 and previous config saved to /var/cache/conftool/dbconfig/20220223-051026-ladsgroup.json
[05:10:28] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:33] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[05:11:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[05:11:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[05:11:22] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T302363)', diff saved to https://phabricator.wikimedia.org/P21324 and previous config saved to /var/cache/conftool/dbconfig/20220223-051125-ladsgroup.json
[05:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:30] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:13:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2122.codfw.wmnet with OS bullseye
[05:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2122.codfw.wmnet with reason: host reimage
[05:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2122.codfw.wmnet with reason: host reimage
[05:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2122.codfw.wmnet with OS bullseye
[05:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T302363)', diff saved to https://phabricator.wikimedia.org/P21325 and previous config saved to /var/cache/conftool/dbconfig/20220223-055416-ladsgroup.json
[05:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:22] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[05:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[05:55:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[05:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T302363)', diff saved to https://phabricator.wikimedia.org/P21326 and previous config saved to /var/cache/conftool/dbconfig/20220223-055534-ladsgroup.json
[05:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:20] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:58:08] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:58:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2120.codfw.wmnet with OS bullseye
[05:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:28] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:12:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2120.codfw.wmnet with reason: host reimage
[06:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2120.codfw.wmnet with reason: host reimage
[06:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2120.codfw.wmnet with OS bullseye
[06:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T302363)', diff saved to https://phabricator.wikimedia.org/P21327 and previous config saved to /var/cache/conftool/dbconfig/20220223-063625-ladsgroup.json
[06:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:31] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[06:37:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[06:37:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[06:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T302363)', diff saved to https://phabricator.wikimedia.org/P21328 and previous config saved to /var/cache/conftool/dbconfig/20220223-063733-ladsgroup.json
[06:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2118.codfw.wmnet with OS bullseye
[06:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:27] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe)
[06:43:39] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) 05Open→03Resolved a:03Joe
[06:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2118.codfw.wmnet with reason: host reimage
[06:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[06:54:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[06:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:51] <Amir1>	 !log dbmaint on s2@codfw (T300992)
[06:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:57] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[06:56:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance
[06:56:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance
[06:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2118.codfw.wmnet with reason: host reimage
[06:56:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance
[06:59:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance
[06:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance
[07:02:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance
[07:02:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[07:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[07:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:13] <wikibugs>	 10SRE, 10Traffic: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10elukey) For varnishkafka, this is the problem:  ` elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent varnishkafka varnishkafka | 1.0.13-1 | stretch-wikimedia |               main | amd64, source varnis...
[07:03:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance
[07:03:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance
[07:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[07:06:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[07:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance
[07:09:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance
[07:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[07:10:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[07:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21329 and previous config saved to /var/cache/conftool/dbconfig/20220223-071038-ladsgroup.json
[07:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:47] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[07:11:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10elukey) acked the alerts in icinga for elastic1093 :)
[07:12:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2118.codfw.wmnet with OS bullseye
[07:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21330 and previous config saved to /var/cache/conftool/dbconfig/20220223-071404-ladsgroup.json
[07:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:09] <wikibugs>	 (03PS1) 10Ayounsi: drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190
[07:29:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21331 and previous config saved to /var/cache/conftool/dbconfig/20220223-072909-ladsgroup.json
[07:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:41] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:33:09] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:33:43] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:40:49] <wikibugs>	 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Joe) Without knowing more about the type of data and your access patterns, it's hard to provide a good suggestion around this. But, more in genera...
[07:42:28] <wikibugs>	 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Joe) >>! In T301428#7716398, @Mstyles wrote: > Thanks @Joe, is there a hard limit on file sizes that can be stored inside the container? We might...
[07:44:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21332 and previous config saved to /var/cache/conftool/dbconfig/20220223-074413-ladsgroup.json
[07:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson)
[07:48:17] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:19] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343)
[07:49:46] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326)
[07:50:10] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable mobile DT at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders)
[07:51:53] <wikibugs>	 (03PS1) 10Elukey: Split the revscoring-editquality ml-serve settings in three [labs/private] - 10https://gerrit.wikimedia.org/r/765193 (https://phabricator.wikimedia.org/T301415)
[07:52:33] <icinga-wm>	 PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[07:52:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Split the revscoring-editquality ml-serve settings in three [labs/private] - 10https://gerrit.wikimedia.org/r/765193 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[07:52:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "o11y: temp relax of LogstashIndexingFailures" [alerts] - 10https://gerrit.wikimedia.org/r/765194 (https://phabricator.wikimedia.org/T288549)
[07:54:39] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256)
[07:55:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "o11y: temp relax of LogstashIndexingFailures" [alerts] - 10https://gerrit.wikimedia.org/r/765194 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi)
[07:56:16] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders)
[07:59:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21333 and previous config saved to /var/cache/conftool/dbconfig/20220223-075918-ladsgroup.json
[07:59:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:25] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[07:59:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21334 and previous config saved to /var/cache/conftool/dbconfig/20220223-075926-ladsgroup.json
[07:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:05] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T0800).
[08:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:25] <urbanecm>	 o/
[08:00:27] <urbanecm>	 i can deploy today
[08:00:30] <MatmaRex>	 hi
[08:00:34] <urbanecm>	 hi MatmaRex!
[08:01:10] <urbanecm>	 MatmaRex: do the config patches depend on the backports, please?
[08:01:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński)
[08:02:00] <MatmaRex>	 urbanecm: yeah
[08:02:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343) (owner: 10Bartosz Dziewoński)
[08:02:25] <urbanecm>	 okay, then we have to wait for CI to process the backports now :)
[08:04:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21335 and previous config saved to /var/cache/conftool/dbconfig/20220223-080424-ladsgroup.json
[08:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:30] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[08:05:32] <wikibugs>	 (03Abandoned) 10Urbanecm: DNM: Testing patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764901 (owner: 10Urbanecm)
[08:05:44] <wikibugs>	 (03Merged) 10jenkins-bot: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński)
[08:06:02] <wikibugs>	 (03PS1) 10Elukey: profile::kubernetes::deployment_server: split revscoring-ediquality [puppet] - 10https://gerrit.wikimedia.org/r/765196 (https://phabricator.wikimedia.org/T301415)
[08:06:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T302363)', diff saved to https://phabricator.wikimedia.org/P21336 and previous config saved to /var/cache/conftool/dbconfig/20220223-080609-ladsgroup.json
[08:06:12] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256)
[08:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:16] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[08:06:27] <urbanecm>	 that was quick
[08:07:19] <urbanecm>	 MatmaRex: first backport is at mwdebug1001, can you test it please?
[08:07:45] <MatmaRex>	 looking
[08:07:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kubernetes::deployment_server: split revscoring-ediquality [puppet] - 10https://gerrit.wikimedia.org/r/765196 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[08:08:07] <MatmaRex>	 urbanecm: oh, we don't have that enabled anywhere yet :/ i can only test that with the config patch
[08:08:11] <urbanecm>	 oh
[08:08:19] <urbanecm>	 so should i pull one of the config patches there too?
[08:08:24] <urbanecm>	 (i assume https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/764868?)
[08:08:30] <MatmaRex>	 yeah. the mobile one
[08:08:31] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[08:08:34] <MatmaRex>	 yes, thanks
[08:08:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders)
[08:08:47] <urbanecm>	 okay, give me a sec :)
[08:09:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders)
[08:09:45] <urbanecm>	 MatmaRex: the config patch is at mwdebug1001 together with the backport now
[08:10:16] <MatmaRex>	 urbanecm: thanks. looks good on https://ht.m.wikipedia.org/wiki/Diskite:Paj_Prensipal
[08:10:22] <urbanecm>	 great! syncing
[08:10:29] <urbanecm>	 (backport first, then config)
[08:10:46] <urbanecm>	 actually...
[08:10:55] <urbanecm>	 never mind, htwiki is group0
[08:11:18] <urbanecm>	 no, it's not, my screen confused me at https://versions.toolforge.org/
[08:11:32] <MatmaRex>	 oh, hm
[08:11:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:44] <MatmaRex>	 i should backport to wmf.22 as well, right?
[08:13:03] <urbanecm>	 yeah, if you want the code to be live at htwiki
[08:13:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:13:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:17] <urbanecm>	 you can also wait for Thursday (when train deploys wmf.23 there)
[08:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[08:13:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[08:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21337 and previous config saved to /var/cache/conftool/dbconfig/20220223-081338-ladsgroup.json
[08:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:43] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326)
[08:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:46] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[08:13:48] <MatmaRex>	 let's do it, if you don't mind
[08:13:52] <urbanecm>	 not at all
[08:14:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński)
[08:14:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:14:32] <MatmaRex>	 (i was testing it wrong, i didn't realize that i was seeing the new tools because i had them enabled in preferences)
[08:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:55] <urbanecm>	 makes sense :)
[08:17:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2108.codfw.wmnet with OS bullseye
[08:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:27] <wikibugs>	 (03Merged) 10jenkins-bot: ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343) (owner: 10Bartosz Dziewoński)
[08:17:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: sketch out proxied prometheus web with IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[08:18:42] <wikibugs>	 (03Merged) 10jenkins-bot: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński)
[08:19:19] <urbanecm>	 MatmaRex: all backports are now at mwdebug1001 (together with the mobile config patch)
[08:19:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21338 and previous config saved to /var/cache/conftool/dbconfig/20220223-081929-ladsgroup.json
[08:19:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:07] <MatmaRex>	 urbanecm: thanks. now it's really as expected on https://ht.m.wikipedia.org/wiki/Diskite:Paj_Prensipal, while logged out too
[08:20:16] <urbanecm>	 great!
[08:20:17] <urbanecm>	 syncing :)
[08:20:41] <MatmaRex>	 and for the other backport, https://www.mediawiki.org/wiki/Special:Contributions/Matma_Rex looks fixed as well
[08:20:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:20:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:00] <urbanecm>	 excellent, will sync it too
[08:21:46] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/DiscussionTools/: b82e4eb: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions (T302326) (duration: 00m 52s)
[08:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:51] <stashbot>	 T302326: Enable reply and new topic tools unconditionally when Discussion Tools mobile is enabled - https://phabricator.wikimedia.org/T302326
[08:21:56] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256)
[08:22:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:35] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/DiscussionTools/: 269dcfd: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions (T302326) (duration: 00m 50s)
[08:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:17] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add new namespaces for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/765198 (https://phabricator.wikimedia.org/T301415)
[08:24:19] <wikibugs>	 (03PS1) 10Elukey: ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415)
[08:24:23] <wikibugs>	 (03PS1) 10MMandere: varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301)
[08:24:26] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d9e8861: Enable mobile DiscussionTools at ht.wiki (T302259) (duration: 00m 50s)
[08:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:32] <stashbot>	 T302259: [Config Change] Offer mobile Reply and New Discussion Tools at ht.wiki - https://phabricator.wikimedia.org/T302259
[08:24:42] <urbanecm>	 MatmaRex: so, htwiki stuff is live now.  Can you advise a good sync order for the core backport?
[08:25:32] <urbanecm>	 I'm thinking about HistoryPager, ContribsPager, MergeHistoryPager and then the other two files, but I'm not sure about that
[08:25:43] <MatmaRex>	 oh, hm
[08:25:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998)
[08:26:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998)
[08:26:19] <MatmaRex>	 urbanecm: IndexPager last, the rest is whatever
[08:26:43] <urbanecm>	 okay
[08:26:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[08:27:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:28:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:53] <urbanecm>	 started
[08:29:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add new namespaces for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/765198 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[08:29:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:29:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:40] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/actions/pagers/HistoryPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 1/5) (duration: 00m 49s)
[08:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:46] <stashbot>	 T302343: Date headings on Special:Contributions don't work well for Flow edits - https://phabricator.wikimedia.org/T302343
[08:29:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński)
[08:30:29] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/specials/pagers/ContribsPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 2/5) (duration: 00m 49s)
[08:30:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński)
[08:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2108.codfw.wmnet with reason: host reimage
[08:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:19] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/specials/pagers/MergeHistoryPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 3/5) (duration: 00m 49s)
[08:31:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10fgiunchedi) >>! In T299468#7730659, @Papaul wrote: > @fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you pl...
[08:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:31:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:13] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/pager/ReverseChronologicalPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 4/5) (duration: 00m 53s)
[08:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:02] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/pager/IndexPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 5/5) (duration: 00m 48s)
[08:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:10] <urbanecm>	 MatmaRex: core backport should be live now
[08:33:27] <urbanecm>	 and the last config patch is at mwdebug1001 now. MatmaRex, can you test please?
[08:34:05] <MatmaRex>	 looking
[08:34:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:34:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2108.codfw.wmnet with reason: host reimage
[08:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21339 and previous config saved to /var/cache/conftool/dbconfig/20220223-083433-ladsgroup.json
[08:34:35] <MatmaRex>	 urbanecm: yeah, looks good on https://www.mediawiki.org/wiki/Talk:Talk_pages_project/Usability
[08:34:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:41] <urbanecm>	 great, syncing!
[08:35:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:35:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:51] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 10cb05a: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org (T302256) (duration: 00m 49s)
[08:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:56] <stashbot>	 T302256: Config Change: offer Reply Tool, New Discussion Tool, Topic Subscriptions as Opt-Out at mediawiki.org - https://phabricator.wikimedia.org/T302256
[08:36:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:37] <urbanecm>	 MatmaRex: and all should be done now
[08:37:39] <urbanecm>	 anything else?
[08:37:45] <urbanecm>	 (sorry, had a short network issue here)
[08:37:49] <MatmaRex>	 thanks
[08:38:00] <MatmaRex>	 no more, that's enough patches ;)
[08:38:06] <urbanecm>	 fair enough :)
[08:38:16] <urbanecm>	 !log UTC morning B&C window done
[08:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:28] <apergos>	 speaking of patches, tomorrow we have a trainee for the morning slot!
[08:38:38] <urbanecm>	 good luck to them :)
[08:38:42] <apergos>	 I oughta know, I got her to sign up :-D
[08:39:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Disable cluster rebalances temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811)
[08:39:24] <apergos>	 I hope more than just me will be aruond tomorrow morning (please)
[08:39:57] <urbanecm>	 will try to :)
[08:40:00] <wikibugs>	 (03PS2) 10MMandere: varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301)
[08:40:03] <apergos>	 cool!
[08:40:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[08:40:22] <urbanecm>	 apergos: just out of curiosity, how does one know when there are trainees or not?
[08:40:24] <apergos>	 first, I don't really wanna deploy by myself, and second, I want the norm to be that people learning have other people to rely on
[08:40:38] <apergos>	 and that doesn't happen if there's only one person here!
[08:40:58] <apergos>	 oh, if it's not a special case like this one where I said "go make a task", I go check the board:
[08:41:10] <apergos>	 https://phabricator.wikimedia.org/project/view/5265/
[08:41:26] <apergos>	 oh wow TWO trainees
[08:41:43] <urbanecm>	 even better :)
[08:42:02] <apergos>	 yeah we definitely need a couple people around, maybe one person can share screen while they deploy and the other can actually discuss what's going on and give all thelinks and so on
[08:42:21] <apergos>	 wow so exciting :-)
[08:42:29] * urbanecm is happy to play whichever role he's assigned
[08:42:38] <apergos>	 cool! thanks for just showing up 
[08:45:13] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:45:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[08:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2108.codfw.wmnet with OS bullseye
[08:49:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21340 and previous config saved to /var/cache/conftool/dbconfig/20220223-084938-ladsgroup.json
[08:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[08:49:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[08:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:45] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[08:49:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:49:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21341 and previous config saved to /var/cache/conftool/dbconfig/20220223-084951-ladsgroup.json
[08:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi)
[08:50:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "Disable cluster rebalances temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff)
[08:51:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372)
[08:52:45] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: ms-be2068, cloudcontrol1005, cloudcontrol1003, ms-be2066, ms-be2067, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:52:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33944/console" [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi)
[08:53:53] <godog>	 seeking reviewer for an easy but kinda urgent one ^
[08:54:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21342 and previous config saved to /var/cache/conftool/dbconfig/20220223-085411-ladsgroup.json
[08:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi)
[08:56:09] <godog>	 thank you elukey 
[08:56:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi)
[08:57:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21343 and previous config saved to /var/cache/conftool/dbconfig/20220223-085755-ladsgroup.json
[08:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:01] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[09:00:05] <jouncebot>	 dduvall and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T0900).
[09:01:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2077.codfw.wmnet with reason: Maintenance
[09:01:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2077.codfw.wmnet with reason: Maintenance
[09:01:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21345 and previous config saved to /var/cache/conftool/dbconfig/20220223-090109-ladsgroup.json
[09:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:44] <godog>	 !log bounce prometheus-statsd-exporter on C:prometheus::statsd_exporter - T302372
[09:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:50] <stashbot>	 T302372: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372
[09:03:28] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[09:03:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2077.codfw.wmnet with OS bullseye
[09:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff)
[09:08:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The update is complete
[09:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21346 and previous config saved to /var/cache/conftool/dbconfig/20220223-090916-ladsgroup.json
[09:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Both main Ganeti cluster have been upgraded to Buster.
[09:12:31] <wikibugs>	 (03PS1) 10Majavah: os-reports: add clouddb2001-dev task [puppet] - 10https://gerrit.wikimedia.org/r/765204
[09:14:21] <dcausse>	 !log restarting blazegrah on wdqs1007 (jvm stuck for 11hours)
[09:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging" [puppet] - 10https://gerrit.wikimedia.org/r/765204 (owner: 10Majavah)
[09:17:48] <wikibugs>	 (03PS4) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363
[09:18:23] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2077.codfw.wmnet with reason: host reimage
[09:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190 (owner: 10Ayounsi)
[09:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190 (owner: 10Ayounsi)
[09:23:23] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, followup at {T302373}
[09:24:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21347 and previous config saved to /var/cache/conftool/dbconfig/20220223-092421-ladsgroup.json
[09:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2077.codfw.wmnet with reason: host reimage
[09:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:45] <wikibugs>	 (03CR) 10Vgutierrez: "looks good, make sure that varnish6 is already available on the main component before merging as PCC (https://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere)
[09:33:47] <wikibugs>	 (03PS1) 10Ayounsi: drmrs: use BGP_aggregate_contributors for main prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/765205
[09:36:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[09:36:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Rakefile: check existence of fixtures directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/765226
[09:37:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff)
[09:38:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2077.codfw.wmnet with OS bullseye
[09:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21348 and previous config saved to /var/cache/conftool/dbconfig/20220223-093925-ladsgroup.json
[09:39:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[09:39:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[09:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:31] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[09:39:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21349 and previous config saved to /var/cache/conftool/dbconfig/20220223-093933-ladsgroup.json
[09:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:43] <wikibugs>	 (03CR) 10MMandere: varnish: change the default archive component for varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere)
[09:41:47] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[09:43:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765226 (owner: 10Giuseppe Lavagetto)
[09:43:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[09:44:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21350 and previous config saved to /var/cache/conftool/dbconfig/20220223-094405-ladsgroup.json
[09:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:46:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21351 and previous config saved to /var/cache/conftool/dbconfig/20220223-094655-ladsgroup.json
[09:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:01] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[09:47:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Prometheus doesn't run on VMs in eqiad/codfw (not sure if this fact was...
[09:49:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:10] <wikibugs>	 (03PS1) 10Elukey: ml-services: move reverted models to their new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/765228 (https://phabricator.wikimedia.org/T301415)
[09:59:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21352 and previous config saved to /var/cache/conftool/dbconfig/20220223-095909-ladsgroup.json
[09:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:37] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[10:06:55] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[10:08:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: move reverted models to their new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/765228 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[10:10:21] <wikibugs>	 (03PS1) 10Ladsgroup: cumin: Avoid creating alias for tendril [puppet] - 10https://gerrit.wikimedia.org/r/765234
[10:11:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[10:11:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:44] <wikibugs>	 (03CR) 10Ladsgroup: "We have a happy PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/33947/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup)
[10:14:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21353 and previous config saved to /var/cache/conftool/dbconfig/20220223-101414-ladsgroup.json
[10:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup)
[10:16:18] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] cumin: Avoid creating alias for tendril [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup)
[10:19:44] <wikibugs>	 (03PS1) 10Elukey: ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415)
[10:23:07] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:24:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[10:24:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[10:26:02] <wikibugs>	 (03PS1) 10Ladsgroup: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765236 (https://phabricator.wikimedia.org/T302363)
[10:26:57] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:27:45] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765236 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup)
[10:29:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21354 and previous config saved to /var/cache/conftool/dbconfig/20220223-102919-ladsgroup.json
[10:29:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[10:29:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[10:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:26] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[10:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:36] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[10:31:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[10:31:58] <kormat>	 !log running schema change against s3 T300774
[10:32:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[10:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21355 and previous config saved to /var/cache/conftool/dbconfig/20220223-103204-ladsgroup.json
[10:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:08] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[10:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:15] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[10:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:16] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[10:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:25] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 13 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[10:32:59] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[10:33:13] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[10:38:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[10:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:45] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:45:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:38] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:46:40] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:44] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21356 and previous config saved to /var/cache/conftool/dbconfig/20220223-104644-kormat.json
[10:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:51] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[10:46:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21357 and previous config saved to /var/cache/conftool/dbconfig/20220223-104704-ladsgroup.json
[10:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:17] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[10:48:31] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:49:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[10:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:15] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[10:51:51] <wikibugs>	 (03PS1) 10Ayounsi: Export POPs aggregates and private prefixes over BGP [homer/public] - 10https://gerrit.wikimedia.org/r/765240
[10:56:44] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[11:02:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21358 and previous config saved to /var/cache/conftool/dbconfig/20220223-110209-ladsgroup.json
[11:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:05:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21359 and previous config saved to /var/cache/conftool/dbconfig/20220223-110540-ladsgroup.json
[11:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:49] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[11:06:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:07:02] <wikibugs>	 (03CR) 10Volans: "I think at this point this could be squashed with the other CR that introduces reposync." [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond)
[11:07:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[11:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[11:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[11:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21360 and previous config saved to /var/cache/conftool/dbconfig/20220223-111714-ladsgroup.json
[11:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1181.eqiad.wmnet with OS bullseye
[11:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:30] <wikibugs>	 (03PS1) 10Elukey: ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415)
[11:22:28] <wikibugs>	 (03PS1) 10Elukey: Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415)
[11:23:59] <wikibugs>	 (03PS1) 10Elukey: Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415)
[11:25:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21361 and previous config saved to /var/cache/conftool/dbconfig/20220223-112522-kormat.json
[11:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:29] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[11:25:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:26:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:26:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:26:26] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:26:56] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:28:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:28:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage
[11:29:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[11:32:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21362 and previous config saved to /var/cache/conftool/dbconfig/20220223-113219-ladsgroup.json
[11:32:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[11:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[11:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:25] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[11:32:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21363 and previous config saved to /var/cache/conftool/dbconfig/20220223-113226-ladsgroup.json
[11:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage
[11:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:04] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[11:34:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[11:35:44] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:36:54] <wikibugs>	 (03PS1) 10Aklapper: MFA Phab accounts email: Fix incorrect SQL query; misc improvements [puppet] - 10https://gerrit.wikimedia.org/r/765245 (https://phabricator.wikimedia.org/T302385)
[11:37:41] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[11:40:27] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21364 and previous config saved to /var/cache/conftool/dbconfig/20220223-114026-kormat.json
[11:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:26] <wikibugs>	 (03PS7) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782
[11:41:44] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:42:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:42:16] <wikibugs>	 (03PS38) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397)
[11:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[11:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[11:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:32] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1181.eqiad.wmnet with OS bullseye
[11:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:28] <wikibugs>	 (03Abandoned) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond)
[11:52:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21365 and previous config saved to /var/cache/conftool/dbconfig/20220223-115233-ladsgroup.json
[11:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:40] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[11:53:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] miscweb: Enable ingress for all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[11:55:32] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21366 and previous config saved to /var/cache/conftool/dbconfig/20220223-115531-kormat.json
[11:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:37] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: Enable ingress for all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[12:02:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[12:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:03] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[12:04:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[12:05:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:07:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21367 and previous config saved to /var/cache/conftool/dbconfig/20220223-120738-ladsgroup.json
[12:07:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[12:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:13] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[12:08:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[12:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:36] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21368 and previous config saved to /var/cache/conftool/dbconfig/20220223-121036-kormat.json
[12:10:38] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:10:39] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:44] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:53] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[12:12:11] <wikibugs>	 (03PS1) 10Jbond: C:package_builder: install tools to build node packages [puppet] - 10https://gerrit.wikimedia.org/r/765250
[12:12:55] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[12:14:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33949/console" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:14:22] <wikibugs>	 (03CR) 10Muehlenhoff: "node-babel7 is only in bullseye, this will need an os_release condition since deneb is still around for a few weeks." [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:15:41] <wikibugs>	 (03CR) 10Jbond: "The errors you see in CI are due to the fact that our puppet-lint plug-in expects this define to exists in all roles.  We would first need" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway)
[12:15:59] <wikibugs>	 (03CR) 10Muehlenhoff: C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:17:30] <wikibugs>	 (03CR) 10Jbond: "further i think we could maybe add this functionality to  profile::base with and use the $::_role variable" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway)
[12:20:24] <wikibugs>	 (03PS1) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005)
[12:21:47] <wikibugs>	 (03CR) 10Muehlenhoff: C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:22:25] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[12:22:37] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[12:22:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21369 and previous config saved to /var/cache/conftool/dbconfig/20220223-122242-ladsgroup.json
[12:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:16] <wikibugs>	 (03PS2) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005)
[12:24:26] <wikibugs>	 (03CR) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[12:24:43] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:24:45] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:50] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21370 and previous config saved to /var/cache/conftool/dbconfig/20220223-122449-kormat.json
[12:24:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:57] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:25:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[12:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:13] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210
[12:26:00] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210
[12:26:04] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210 (owner: 10Ladsgroup)
[12:26:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[12:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:25] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:27] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415)
[12:27:46] <wikibugs>	 (03PS1) 10Ladsgroup: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765255 (https://phabricator.wikimedia.org/T302363)
[12:28:39] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765255 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup)
[12:30:17] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21372 and previous config saved to /var/cache/conftool/dbconfig/20220223-123017-kormat.json
[12:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:23] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:32:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21373 and previous config saved to /var/cache/conftool/dbconfig/20220223-123246-ladsgroup.json
[12:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:53] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[12:34:49] <wikibugs>	 (03PS1) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257
[12:35:08] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[12:35:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[12:36:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:37:18] <wikibugs>	 (03PS2) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257
[12:37:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21374 and previous config saved to /var/cache/conftool/dbconfig/20220223-123747-ladsgroup.json
[12:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:54] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[12:38:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[12:38:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33951/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[12:40:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:40:22] <hashar>	 I am switching the operations-puppet-tests-buster-docker Jenkins job to a new instance (Stretch > Bullseye)
[12:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21375 and previous config saved to /var/cache/conftool/dbconfig/20220223-124027-ladsgroup.json
[12:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:34] <hashar>	 which in practice should be almost a noop since everything runs inside a Docker container
[12:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:53] <wikibugs>	 (03PS4) 10JMeybohm: Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966)
[12:40:58] <wikibugs>	 (03PS3) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257
[12:41:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[12:41:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[12:44:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1174.eqiad.wmnet with OS bullseye
[12:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21376 and previous config saved to /var/cache/conftool/dbconfig/20220223-124521-kormat.json
[12:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:47] <wikibugs>	 (03PS1) 10Elukey: kserve-inference: dry model config for revscoring_inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415)
[12:45:55] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21377 and previous config saved to /var/cache/conftool/dbconfig/20220223-124751-ladsgroup.json
[12:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:32] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[12:55:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage
[12:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:11] <wikibugs>	 (03PS4) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257
[12:59:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage
[12:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33954/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[13:00:27] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21378 and previous config saved to /var/cache/conftool/dbconfig/20220223-130026-kormat.json
[13:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: dry model config for revscoring_inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey)
[13:02:23] <wikibugs>	 (03CR) 10Hashar: ci: Qemu image and snapshot creation (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[13:02:31] <wikibugs>	 (03PS18) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774)
[13:02:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21379 and previous config saved to /var/cache/conftool/dbconfig/20220223-130255-ladsgroup.json
[13:03:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:07] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcontrol1004, ms-be2066, cloudcontrol1003, ms-be2068, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[13:04:35] <wikibugs>	 (03CR) 10Hashar: "I have cherry picked PS18 on integration-puppetmaster-02 . On integration-agent-qemu-1003 I have deleted /srv/vm-images/*qcow2 and I am ru" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[13:06:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add LVS servie k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[13:09:37] <wikibugs>	 (03PS1) 10Elukey: kserve-inference: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/765264
[13:11:18] <wikibugs>	 (03PS5) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257
[13:11:20] <wikibugs>	 (03PS1) 10Jbond: motd::message: add new define for simple motd entries [puppet] - 10https://gerrit.wikimedia.org/r/765265
[13:12:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33955/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond)
[13:14:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/765264 (owner: 10Elukey)
[13:14:24] <wikibugs>	 (03CR) 10Jbond: Rename system::role to base::add_motd_role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway)
[13:14:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1174.eqiad.wmnet with OS bullseye
[13:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond)
[13:15:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans)
[13:15:31] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21380 and previous config saved to /var/cache/conftool/dbconfig/20220223-131531-kormat.json
[13:15:33] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:15:34] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:37] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21381 and previous config saved to /var/cache/conftool/dbconfig/20220223-131801-ladsgroup.json
[13:18:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:18:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:08] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[13:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:22] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388)
[13:18:30] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388)
[13:19:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998) (owner: 10Muehlenhoff)
[13:19:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:27] <wikibugs>	 (03PS1) 10Jbond: C:package_builder: only install node-babel7 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765267
[13:23:33] <Krinkle>	 !log debugging on mwdebug1002
[13:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:37] <Krinkle>	 err. didn't mean to log
[13:23:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33959/console" [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond)
[13:23:43] <wikibugs>	 (03CR) 10Jbond: "FYI there where dependency issues on buster so i have moved to bullseye and will build on build2001.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond)
[13:23:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:package_builder: only install node-babel7 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond)
[13:25:16] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415)
[13:29:35] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[13:30:21] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[13:30:22] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[13:30:23] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:27] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:32] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21383 and previous config saved to /var/cache/conftool/dbconfig/20220223-133031-kormat.json
[13:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:45] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:30:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[13:32:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[13:32:51] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:54] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Tested and it works. I have confirmed the CI job works with the new image as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[13:35:59] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21384 and previous config saved to /var/cache/conftool/dbconfig/20220223-133559-kormat.json
[13:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:05] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:36:38] <Lucas_WMDE>	 jouncebot: nowandnext
[13:36:38] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[13:36:38] <jouncebot>	 In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1400)
[13:37:39] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:37:42] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:25] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21385 and previous config saved to /var/cache/conftool/dbconfig/20220223-133858-ladsgroup.json
[13:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:06] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[13:39:40] <mmandere>	 !log import libvmod-netmapper_1.9-1.dsc and libvmod-netmapper_1.9-1_amd64.deb to main  component  - T302301
[13:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:46] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[13:41:24] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:57] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:12] <wikibugs>	 (03PS1) 10Ayounsi: Prepend AS to anycast prefixes learned on the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315)
[13:45:20] <Lucas_WMDE>	 !log Deployed patch for T302192
[13:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:41] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:03] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:01] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:04] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21386 and previous config saved to /var/cache/conftool/dbconfig/20220223-135103-kormat.json
[13:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:40] <mmandere>	 !log import libvmod-re2_1.5.3-1.dsc and libvmod-re2_1.5.3-1_amd64.deb to main component - T302301
[13:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:46] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[13:54:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21387 and previous config saved to /var/cache/conftool/dbconfig/20220223-135404-ladsgroup.json
[13:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:56:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1400).
[14:00:04] <jouncebot>	 MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:08] <taavi>	 o/
[14:00:10] <Lucas_WMDE>	 o/
[14:00:16] <urbanecm>	 I can deploy today!
[14:00:20] <MatmaRex>	 hi
[14:00:29] <urbanecm>	 hi MatmaRex
[14:00:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński)
[14:00:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński)
[14:02:01] <MatmaRex>	 i might also want to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/765213
[14:03:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MatthewVernon)
[14:03:22] <MatmaRex>	 actually, i think i don't want to, until someone reviews it
[14:03:42] <MatmaRex>	 if i do just a revert, then it has localisation changes, which are annoying to backport (right?)
[14:03:48] <urbanecm>	 indeed
[14:03:48] <MatmaRex>	 and if i make other changes, then i' prefer a review
[14:03:54] <MatmaRex>	 i'd*
[14:04:07] <wikibugs>	 (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966)
[14:04:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MatthewVernon)
[14:04:49] <wikibugs>	 (03Merged) 10jenkins-bot: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński)
[14:04:52] <urbanecm>	 but we can do i18n changes too if the reason for the revert is an urgent problem :)
[14:05:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[14:05:06] <mmandere>	 !log import varnish_6.0.10-1wm1.dsc, varnish_6.0.10-1wm1_amd64.deb, varnish-dbg_6.0.6-1wm1_amd64.deb, varnish-dbgsym_6.0.10-1wm1_amd64.deb, varnish-doc_6.0.10-1wm1_all.deb to main component - T302301
[14:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:12] <MatmaRex>	 i mean, it's Special:ApiSandbox
[14:05:13] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[14:05:21] <MatmaRex>	 so probably not that urgent
[14:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński)
[14:05:26] <wikibugs>	 (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966)
[14:05:30] <urbanecm>	 your call :)
[14:06:07] <urbanecm>	 MatmaRex: both backports are at mwdebug1001 now, can you test?
[14:06:09] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21388 and previous config saved to /var/cache/conftool/dbconfig/20220223-140608-kormat.json
[14:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:22] <MatmaRex>	 yeah. looking
[14:06:35] <wikibugs>	 (03PS1) 10Ssingh: test_dns: update EDNS client subnet test for IPv6 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/765270
[14:08:21] <jayme>	 !log restarting pybal on lvs1020,lvs2010 - T290966
[14:08:21] <MatmaRex>	 urbanecm: seems good
[14:08:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: update EDNS client subnet test for IPv6 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/765270 (owner: 10Ssingh)
[14:08:25] <urbanecm>	 syncing
[14:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:26] <stashbot>	 T290966: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966
[14:09:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21389 and previous config saved to /var/cache/conftool/dbconfig/20220223-140908-ladsgroup.json
[14:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:21] <taavi>	 urbanecm: ping me when done please?
[14:09:26] <urbanecm>	 sure thing
[14:09:46] <urbanecm>	 unless MatmaRex wants me to deploy anything else, should be just two syncs
[14:10:01] <MatmaRex>	 that is all
[14:10:07] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 815b3d1: Fix check for enabling features on mobile (T302388) (duration: 00m 50s)
[14:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:12] <stashbot>	 T302388: Discussion Tools features are unexpectedly enabled on mobile ([reply] links, "Add discussion" button, [subscribe] links) - https://phabricator.wikimedia.org/T302388
[14:11:20] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 78f0d9d: Fix check for enabling features on mobile (T302388) (duration: 00m 49s)
[14:11:24] <urbanecm>	 MatmaRex: should be live
[14:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:30] <urbanecm>	 taavi: the floor is yours
[14:11:36] <taavi>	 thanks
[14:11:50] <MatmaRex>	 thanks
[14:11:52] <taavi>	 deploying the updated patch for https://phabricator.wikimedia.org/T302248
[14:11:52] <mmandere>	 !log import libvarnishapi2_6.0.10-1wm1_amd64.deb, libvarnishapi2-dbgsym_6.0.10-1wm1_amd64.deb, libvarnishapi-dev_6.0.10-1wm1_amd64.deb to main component - T302301
[14:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:58] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[14:12:27] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.70:30443]) https://wikitech.wikimedia.org/wiki/PyBal
[14:12:45] <jayme>	 !log restarting pybal on lvs1019,lvs2009 - T290966
[14:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:04] <wikibugs>	 (03PS3) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005)
[14:13:14] <wikibugs>	 (03CR) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[14:13:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:13:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:39] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:14:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[14:14:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:37] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[14:16:05] <wikibugs>	 (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966)
[14:16:12] <taavi>	 syncing
[14:17:35] <taavi>	 !log deploy second patch for T302248
[14:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:48] <taavi>	 anyone have anything else to deploy?
[14:18:04] <urbanecm>	 i don't think so
[14:18:22] <taavi>	 !log UTC afternoon deploys done
[14:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:51] <mmandere>	 !log import varnish-modules_0.15.0-1+wmf1.dsc, varnish-modules-dbgsym_0.15.0-1+wmf1_amd64.deb, varnish-modules_0.15.0-1+wmf1_amd64.deb to main component  - T302301
[14:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:57] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[14:19:25] <Krinkle>	 me done testing on mwdebug1002
[14:19:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS buster
[14:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin...
[14:19:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[14:19:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] discovery_dashboards: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga)
[14:20:02] * urbanecm didn't know Krinkle was testing. Hopefully the B&C deploys didn't interfere (much) :)
[14:20:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:14] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21390 and previous config saved to /var/cache/conftool/dbconfig/20220223-142113-kormat.json
[14:21:15] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[14:21:17] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[14:21:18] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro)
[14:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:19] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[14:21:21] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21391 and previous config saved to /var/cache/conftool/dbconfig/20220223-142121-kormat.json
[14:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:24] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) 05Open→03Resolved a:03dcaro I think this is ready to be closed! \o/ There's some related patches pending, but those are not directly these anymore.
[14:22:26] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10dcaro)
[14:22:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) @fgiunchedi thanks will check and see why the drive is missing.
[14:24:09] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:24:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21392 and previous config saved to /var/cache/conftool/dbconfig/20220223-142413-ladsgroup.json
[14:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:19] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[14:25:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:25:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[14:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:25:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:48] <mmandere>	 !log import varnishkafka_1.1.0-1_amd64.deb, varnishkafka_1.1.0-1.dsc, varnishkafka-dbg_1.1.0-1_amd64.deb to main component - T302301
[14:26:54] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21393 and previous config saved to /var/cache/conftool/dbconfig/20220223-142652-kormat.json
[14:26:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:55] <stashbot>	 T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301
[14:26:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:27:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:03] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[14:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:16] <mmandere>	  !log import varnishkafka_1.1.0-1_amd64.deb, varnishkafka_1.1.0-1.dsc, varnishkafka-dbg_1.1.0-1_amd64.deb to main component - T300164
[14:29:16] <stashbot>	 T300164: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164
[14:29:51] <icinga-wm>	 PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100%
[14:30:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > As far as this task goes to me it still remains a mystery why it looks l...
[14:31:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:59] <icinga-wm>	 RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[14:33:03] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:33:05] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:33:11] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:33:13] <jayme>	 thats me
[14:33:50] <wikibugs>	 (03PS1) 10MVernon: admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782)
[14:34:22] <wikibugs>	 (03PS1) 10JMeybohm: Add k8s-ingress-wikikube to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/765273 (https://phabricator.wikimedia.org/T300740)
[14:34:39] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro)
[14:35:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/765273 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm)
[14:36:33] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:36:33] <icinga-wm>	 PROBLEM - Host ms-be2068 is DOWN: PING CRITICAL - Packet loss = 100%
[14:36:37] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:36:45] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube
[14:36:45] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:36:49] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[14:38:11] <icinga-wm>	 PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100%
[14:39:09] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage
[14:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:45] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster
[14:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin...
[14:40:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS buster
[14:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin...
[14:41:15] <icinga-wm>	 RECOVERY - Host ms-be2068 is UP: PING WARNING - Packet loss = 33%, RTA = 33.56 ms
[14:41:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782) (owner: 10MVernon)
[14:41:58] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21394 and previous config saved to /var/cache/conftool/dbconfig/20220223-144158-kormat.json
[14:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:31] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage
[14:42:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/764739 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[14:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:26] <icinga-wm>	 RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms
[14:46:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:15] <papaul>	 !log power down ms-be2068 for re-image
[14:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:19] <papaul>	   
[14:48:33] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782) (owner: 10MVernon)
[14:48:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch
[14:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch
[14:50:52] <wikibugs>	 (03CR) 10MMandere: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33960/console" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere)
[14:53:58] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10jhathaway) @fgiunchedi very sorry about the breakage, I wish I would have caught that in the review.
[14:55:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[14:56:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[14:56:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage
[14:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:03] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21395 and previous config saved to /var/cache/conftool/dbconfig/20220223-145703-kormat.json
[14:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:16] <wikibugs>	 (03PS1) 10MVernon: admin: add skyenet, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581)
[14:58:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere)
[14:59:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1031.eqiad.wmnet with OS buster
[14:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001...
[14:59:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage
[15:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Done.
[15:03:48] <moritzm>	 !log installing expat security updates
[15:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:07:14] <dcausse>	 this is me testing ^ (wdqs@codfw is depooled)
[15:07:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:00] <papaul>	  /win 5
[15:12:08] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21396 and previous config saved to /var/cache/conftool/dbconfig/20220223-151207-kormat.json
[15:12:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1033.eqiad.wmnet with OS buster
[15:12:12] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[15:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:13] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[15:12:14] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance
[15:12:15] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:20] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance
[15:12:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001...
[15:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:15:11] <wikibugs>	 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) Per Cathal's feedback above, we are closing this ticket as he correctly stated "it represents significant risk for what seems to be scant benefit....
[15:15:48] <wikibugs>	 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) 05Open→03Resolved
[15:17:49] <moritzm>	 !log rolling restart of FPM and Apache on mediawiki canaries to pick up expat security updates
[15:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:19:40] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1032.eqiad.wmnet with OS buster
[15:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001...
[15:21:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:23:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:26:06] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) epic task! kudos for finishing it
[15:26:37] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS stretch
[15:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:42] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors: - m...
[15:28:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:28:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:29:01] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:46] <wikibugs>	 (03CR) 10Bearloga: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro)
[15:30:39] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:30:40] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:45] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21397 and previous config saved to /var/cache/conftool/dbconfig/20220223-153044-kormat.json
[15:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgrm" [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon)
[15:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:53] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:30:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon)
[15:31:32] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add skyenet, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon)
[15:33:49] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:36:12] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21398 and previous config saved to /var/cache/conftool/dbconfig/20220223-153611-kormat.json
[15:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:18] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:36:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch
[15:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch
[15:38:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Done.
[15:42:25] <wikibugs>	 (03CR) 10Jbond: "lgtm see nits" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[15:43:19] <wikibugs>	 (03PS2) 10Jbond: O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438
[15:43:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[15:44:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438 (owner: 10Jbond)
[15:49:45] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:51:16] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21399 and previous config saved to /var/cache/conftool/dbconfig/20220223-155116-kormat.json
[15:51:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:18] <wikibugs>	 (03CR) 10JHathaway: Rename system::role to base::add_motd_role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway)
[15:55:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:32] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 6.988e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:56:58] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2002 is CRITICAL: 6.91e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:57:32] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2003 is CRITICAL: 6.88e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:00:06] <vgutierrez>	 !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia - T290005
[16:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:12] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[16:01:32] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2007 is CRITICAL: 6.368e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:01:52] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi) No worries @jhathaway ! It was a combination of factors that meant deployment would fail silently too :( i.e. no puppe...
[16:03:12] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 6.196e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:04:12] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:04:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:04:44] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:35] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005)
[16:05:44] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:54] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:06:21] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21400 and previous config saved to /var/cache/conftool/dbconfig/20220223-160621-kormat.json
[16:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:34] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:08:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[16:09:55] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005)
[16:13:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33963/console" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[16:14:32] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 4.911e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:21:26] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21401 and previous config saved to /var/cache/conftool/dbconfig/20220223-162125-kormat.json
[16:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:32] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[16:23:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS stretch
[16:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch completed: - ms-be2068 (*...
[16:25:45] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 3.186e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:27:51] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2001 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 1.999e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:30:18] <wikibugs>	 (03CR) 10Hnowlan: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:30:35] <icinga-wm>	 PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100%
[16:31:12] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221
[16:31:19] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221
[16:31:43] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2003 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.094e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:32:28] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221 (owner: 10Ladsgroup)
[16:32:33] <wikibugs>	 (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:34:02] <wikibugs>	 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10hashar) `gerrit2001.wikimedia.org` is a replica and can also be used as a spare to switch the primary service.  It also serves repos over `gerrit-replica.wikimedia.org` which is used by various scripts an...
[16:34:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) >>! In T302265#7731305, @fgiunchedi wrote:  > The current pings from promet...
[16:35:39] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2002 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 4.793e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:36:20] <wikibugs>	 (03PS1) 10Ladsgroup: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765308 (https://phabricator.wikimedia.org/T302363)
[16:38:22] <wikibugs>	 (03CR) 10Hnowlan: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:38:47] <wikibugs>	 (03CR) 10Jbond: "see comments inline, have also added Moritz who may have a view.  Also regardless of the inline comments im also happy to go with just the" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway)
[16:39:02] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765308 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup)
[16:41:19] <icinga-wm>	 RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms
[16:42:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch
[16:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch
[16:43:19] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 7.68e+05 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:43:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  In terms of eventual re-use I replicated this (in templates/asw/policy-options.conf) for the LSWs as they need it, but I didn't wan" [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi)
[16:44:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[16:44:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[16:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21403 and previous config saved to /var/cache/conftool/dbconfig/20220223-164453-ladsgroup.json
[16:44:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:02] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[16:45:28] <wikibugs>	 (03CR) 10Jbond: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:46:41] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 1.046e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:47:31] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 2.43e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:48:18] <wikibugs>	 (03CR) 10Majavah: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1127.eqiad.wmnet with OS bullseye
[16:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF)
[16:49:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF)
[16:50:25] <wikibugs>	 (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[16:50:41] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:54:17] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF)
[16:55:07] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309
[16:55:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF)
[16:56:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF)
[16:57:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff)
[16:58:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1127.eqiad.wmnet with reason: host reimage
[16:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:37] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2007 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.437e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:00:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1127.eqiad.wmnet with reason: host reimage
[17:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:08] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:06:17] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:14:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1127.eqiad.wmnet with OS bullseye
[17:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:41] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2066.codfw.wmnet with OS stretch
[17:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m...
[17:19:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:20:33] <wikibugs>	 (03PS1) 10BBlack: eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419)
[17:21:42] <wikibugs>	 (03CR) 10BBlack: "Note, I've already reserved .17-.20 in all 8 of the vlans in netbox, too.  Seemed the simplest scheme for now, given there's already a few" [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack)
[17:21:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch
[17:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch
[17:22:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21404 and previous config saved to /var/cache/conftool/dbconfig/20220223-172206-ladsgroup.json
[17:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:13] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[17:23:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[17:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org
[17:26:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[17:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:23] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10JoKalliauer)
[17:30:58] <wikibugs>	 (03PS1) 10Hnowlan: restbase: disable redundant jmx config [puppet] - 10https://gerrit.wikimedia.org/r/765313 (https://phabricator.wikimedia.org/T295375)
[17:35:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[17:35:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[17:37:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21406 and previous config saved to /var/cache/conftool/dbconfig/20220223-173711-ladsgroup.json
[17:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:44] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:40:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[17:44:33] <wikibugs>	 (03CR) 10JHathaway: Add nagios_core & mailalias_core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway)
[17:44:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org
[17:44:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:45:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS stretch
[17:45:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10TheresNoTime) Hey all, sorry for the delay, I tested positive for COVID on Sunday and its been a little rough! Thank you //all// for the comments—I absolutely respect and appreciate tho...
[17:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch completed: - ms-be2066 (*...
[17:45:48] <wikibugs>	 (03PS2) 10Hnowlan: restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375)
[17:45:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[17:46:21] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided)
[17:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:29] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided) (duration: 00m 07s)
[17:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul)
[17:49:37] <wikibugs>	 (03CR) 10Btullis: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[17:49:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org
[17:52:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21407 and previous config saved to /var/cache/conftool/dbconfig/20220223-175217-ladsgroup.json
[17:52:18] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 52.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:44] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:53:26] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:53:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS stretch
[17:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch
[17:56:02] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:50] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:32] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:03:02] <wikibugs>	 (03PS3) 10Hnowlan: restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375)
[18:05:18] <wikibugs>	 (03PS2) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919
[18:05:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[18:07:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21408 and previous config saved to /var/cache/conftool/dbconfig/20220223-180722-ladsgroup.json
[18:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:29] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[18:11:15] <wikibugs>	 (03PS3) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919
[18:11:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney)
[18:11:55] <wikibugs>	 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) 05Open→03Resolved Thanks to John and Chris for the help on this, all done with the testing now.  I've set the 3 servers back to the status they'd have been after regular provision, so they can be image...
[18:12:03] <wikibugs>	 (03PS1) 10Ladsgroup: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765316 (https://phabricator.wikimedia.org/T302363)
[18:12:41] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765316 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup)
[18:13:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[18:13:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[18:13:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21409 and previous config saved to /var/cache/conftool/dbconfig/20220223-181350-ladsgroup.json
[18:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:02] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[18:15:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10RobH)
[18:16:44] <wikibugs>	 (03PS1) 10Majavah: service: generate config yaml in puppet instead of via templates [puppet] - 10https://gerrit.wikimedia.org/r/765317
[18:17:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33964/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah)
[18:18:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1158.eqiad.wmnet with OS bullseye
[18:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33965/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah)
[18:19:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10RobH)
[18:19:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH)
[18:19:58] <wikibugs>	 (03PS2) 10Majavah: service: generate config yaml in puppet instead of via templates [puppet] - 10https://gerrit.wikimedia.org/r/765317
[18:20:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH)
[18:20:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH)
[18:20:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10RobH)
[18:20:47] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33966/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah)
[18:20:49] <taavi>	 pro tip: remember to push your updated patch to gerrit before running pcc on it, otherwies you're going to be very confused
[18:20:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RobH)
[18:21:06] <dancy>	 Good advice
[18:21:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install clouddumps100[12] - https://phabricator.wikimedia.org/T299610 (10RobH)
[18:21:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10RobH)
[18:23:39] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2069.codfw.wmnet with OS stretch
[18:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:44] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch executed with errors: - m...
[18:24:18] <wikibugs>	 (03CR) 10Majavah: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[18:25:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) @RobH (cc: @Jelto )    gitlab1002  has existed as a VM in the past, when contractors used it but the...
[18:29:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks Brandon.  IP reservations in Netbox seem good also." [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack)
[18:29:57] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Quiddity) 05Open→03Resolved This was done.  ZI_Jony (added, and listed on info-page) and others offered to help, Plus I set the list to "reject with bounce" non-members to deal with the large wave...
[18:30:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage
[18:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Change is fine +1.  But I'm wondering why it's needed?  Without the "aggregate" there the routes sent by the ASW should be propagated anyw" [homer/public] - 10https://gerrit.wikimedia.org/r/765240 (owner: 10Ayounsi)
[18:33:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage
[18:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah)
[18:34:34] <wikibugs>	 (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)
[18:35:06] <taavi>	 jhathaway: do you want someone else to review that too? I don't have merge rights on the puppet repo
[18:35:37] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense to me but I'm no expert on puppetcode.  Logic seems good +1." [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi)
[18:36:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) a:05Jclark-ctr→03LSobanski @lsobanski: Is it ok to shift these hostnames from gitlab100[23] to gi...
[18:36:01] <jhathaway>	 taavi: yes, I would defer to hnowlan, as I don't have any knowledge of that service
[18:36:20] <taavi>	 ack, makes sense
[18:36:45] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:38:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I should have set msw1-eqiad as parent for LSWs too I realize, will add." [puppet] - 10https://gerrit.wikimedia.org/r/764725 (owner: 10Ayounsi)
[18:39:56] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Aklapper) assuming this is about #puppet
[18:41:57] <wikibugs>	 (03PS2) 10Cathal Mooney: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758)
[18:42:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:43:45] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Dzahn) To start with I would just like to add a bit of info that we have a history of using git submodules inside the puppet repo and not liking them and then moving away from them again, whic...
[18:44:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Proc) Specifically in the case of T302047, I would prefer that active contributors on primarily single Wikipedias //not// be deploying those patches. For example, and without getting in...
[18:44:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[18:45:06] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) >>! In T302423#7733059, @Dzahn wrote: > To start with I would just like to add a bit of info that we have a history of using git submodules inside the puppet repo and not liking the...
[18:45:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[18:46:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) @jhathaway thanks for writing this up just a few quick comments.  Before commenting i would say that in my mind we have  [[ https://phabricator.wikimedia.org/T265138#7041244 | four type...
[18:49:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1158.eqiad.wmnet with OS bullseye
[18:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[18:51:59] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) p:05Triage→03Medium
[18:52:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond)
[18:52:31] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond)
[18:54:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[18:57:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21410 and previous config saved to /var/cache/conftool/dbconfig/20220223-185740-ladsgroup.json
[18:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:47] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[18:58:58] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Dzahn) >>! In T302423#7733064, @jhathaway wrote: >>>! In T302423#7733059, @Dzahn wrote: >> To start with I would just like to add a bit of info that we have a history of using git submodules i...
[19:00:04] <jouncebot>	 dduvall and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1900).
[19:00:04] <jouncebot>	 dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1900).
[19:06:32] <James_F>	 Amir1: Do we rotate primary DBs for the OS upgrades, or will finishing the work be stalled on the next DC switch-over?
[19:06:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans)
[19:07:09] <Amir1>	 James_F: for most I think we will do a switchover, s6 is already planned
[19:07:23] <Amir1>	 T300471
[19:07:23] <stashbot>	 T300471: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T300471
[19:07:38] <wikibugs>	 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) >>! In T301428#7730915, @Joe wrote: > Without knowing more about the type of data and your access patterns, it's hard to provide a good...
[19:08:10] <Amir1>	 but only core dbs left, es, m, pc, and x are already had swichovers
[19:08:20] <James_F>	 Right. s7 going RO for a few minutes isn't terrible though.
[19:08:32] * James_F nods.
[19:09:08] <James_F>	 s8 and s4 would be the hard ones, I guess.
[19:09:33] <Amir1>	 yeah
[19:09:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[19:10:07] <James_F>	 But single wikis, so they could cope if it's 15 minutes not 5.
[19:10:22] <Amir1>	 there are a lot of schema changes pending for primary switchover as well. See the list T301312
[19:10:22] <stashbot>	 T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312
[19:10:24] <James_F>	 Whereas s3 going down for 15 minutes would make a bunch of people whine
[19:10:26] <Amir1>	 T300402 T300992 T300381 T298554
[19:10:27] <stashbot>	 T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402
[19:10:27] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[19:10:28] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[19:10:28] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[19:10:31] * James_F nods.
[19:11:36] <Amir1>	 the RO time is around a minute these days
[19:11:59] <James_F>	 Yeah, I'm just being pessimistic if something goes wrong.
[19:12:12] <Amir1>	 honestly if we can automate it a bit, it should be done fully automatically and unannounced if you ask me :D
[19:12:21] <Amir1>	 yeah
[19:12:26] <James_F>	 Right.
[19:12:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21411 and previous config saved to /var/cache/conftool/dbconfig/20220223-191245-ladsgroup.json
[19:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:50] <James_F>	 Just in time for k8s for everything else, so the scale value won't be high. :-)
[19:13:33] <James_F>	 Automated tools that make everything very easy are great when we have 2000 boxes, but a bit dull when we have 100 boxes plus 2000 k8s pods.
[19:14:38] <Amir1>	 dbs won't be in k8s (did I misunderstand you?)
[19:14:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10dr0ptp4kt) Thanks all. @MatthewVernon I'm delegating responsibility on research and response on this to my direct report, @SCherukuwada (Senior Engineering Manager, Web team), who i...
[19:15:22] <Amir1>	 generally stateful services should not go to containers 
[19:15:32] <James_F>	 Yeah, the DBs will still be the 100.
[19:15:57] <Amir1>	 it's around 300 these days :P
[19:16:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:16:38] <James_F>	 Hah. Ouch.
[19:16:39] <Amir1>	 we have to do schema changes 100 times because for codfw we just run them on primary and it gets replicated
[19:17:02] <James_F>	 Right.
[19:18:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:19:51] <wikibugs>	 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD insto centrallog1001 - https://phabricator.wikimedia.org/T302437 (10RobH)
[19:20:13] <wikibugs>	 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD insto centrallog1001 - https://phabricator.wikimedia.org/T302437 (10RobH)
[19:20:52] <dancy>	 Testing scap mods on deploy server for a few minutes
[19:22:29] <Amir1>	 🍿
[19:26:35] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing
[19:26:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdx1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:27] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'make -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver' returned non-zero exit status 2. (duration: 00m 51s)
[19:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21413 and previous config saved to /var/cache/conftool/dbconfig/20220223-192749-ladsgroup.json
[19:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:39] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222
[19:30:47] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223
[19:30:54] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222
[19:30:59] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222 (owner: 10Ladsgroup)
[19:31:13] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223
[19:31:17] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223 (owner: 10Ladsgroup)
[19:32:30] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing scap container image building
[19:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:41] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing scap container image building
[19:33:45] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'make -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver' returned non-zero exit status 2. (duration: 00m 03s)
[19:33:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:45] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing scap container image building
[19:35:48] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawik
[19:35:48] <logmsgbot>	 i-webserver' returned non-zero exit status 2. (duration: 00m 03s)
[19:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:44] <dancy>	 Done testing for now.
[19:42:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21414 and previous config saved to /var/cache/conftool/dbconfig/20220223-194254-ladsgroup.json
[19:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:01] <stashbot>	 T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363
[20:02:21] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts: remove wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/765331 (https://phabricator.wikimedia.org/T301646)
[20:03:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: remove wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/765331 (https://phabricator.wikimedia.org/T301646) (owner: 10Andrew Bogott)
[20:06:04] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:07:08] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:33] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:14:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10Reedy)
[20:24:45] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) >>! In T302423#7733067, @jbond wrote: > Before commenting i would say that in my mind we have  [[ https://phabricator.wikimedia.org/T265138#7041244 | four types types of modules ]]...
[20:34:34] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) >>! In T302423#7733421, @jhathaway wrote:  > According to [[ https://puppet.com/docs/puppet/6/type.html#puppet-60-type-changes | puppet's docs ]] and my own inspection of Puppet's 6.26...
[20:44:34] <taavi>	 !log run CentralAuthUser::importLocalNames for FuzzyBot T302399
[20:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:40] <stashbot>	 T302399: FuzzyBot account is not attached to global account on many Wikimedia wikis - https://phabricator.wikimedia.org/T302399
[20:52:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks for working on this! Confirmed NOOP on other hosts as expected." [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi)
[20:56:48] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.23  refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334
[20:56:50] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.23  refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334 (owner: 10Dduvall)
[20:57:49] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.23  refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334 (owner: 10Dduvall)
[21:00:05] <jouncebot>	 RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T2100).
[21:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:00:18] <urbanecm>	 indeed, nothing to do
[21:01:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:02:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:18] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:08:24] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:08:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:09:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:13] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.23  refs T300199
[21:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:19] <stashbot>	 T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199
[21:10:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:46] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.23  refs T300199 (duration: 01m 31s)
[21:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:23] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: bird6 errors expected, not serving any traffic
[21:17:25] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: bird6 errors expected, not serving any traffic
[21:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:56] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-mounts: remove account-creation-assistance project [puppet] - 10https://gerrit.wikimedia.org/r/765339 (https://phabricator.wikimedia.org/T301294)
[21:54:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: remove account-creation-assistance project [puppet] - 10https://gerrit.wikimedia.org/r/765339 (https://phabricator.wikimedia.org/T301294) (owner: 10Andrew Bogott)
[21:57:01] <wikibugs>	 (03PS1) 10Reedy: Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433)
[21:57:08] <Reedy>	 jouncebot: nowandnext
[21:57:09] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T2100)
[21:57:09] <jouncebot>	 In 3 hour(s) and 2 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0100)
[21:57:53] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433) (owner: 10Reedy)
[22:00:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433) (owner: 10Reedy)
[22:01:30] <wikibugs>	 (03PS1) 10Jbond: (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342
[22:02:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond)
[22:03:53] <wikibugs>	 (03PS2) 10Jbond: (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342
[22:04:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond)
[22:06:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:07:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:07:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:59] <logmsgbot>	 !log reedy@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/SecurePoll/cli/wm-scripts/ucoc/: (no justification provided) (duration: 00m 50s)
[22:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:44] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1 this as it requires puppet > 6" [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond)
[22:11:45] <wikibugs>	 (03PS3) 10Jbond: (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342
[22:12:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond)
[22:16:09] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] (WIP) bolt:  Add bolt rake tasks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond)
[22:19:02] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) 05Open→03Resolved I closed out the ticket and this is now resolved.
[22:19:18] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[22:32:51] <mutante>	 !quip "svn->git migration is not completely trivial, due to the free-form nature of SVN repos --valhallasw in 2014 on T60801
[22:32:52] <stashbot>	 T60801: Copy contents of https://svn.toolserver.org/ to Wikimedia Diffusion - https://phabricator.wikimedia.org/T60801
[22:33:04] <mutante>	 !quip help
[22:33:59] <mutante>	 !bash "svn->git migration is not completely trivial, due to the free-form nature of SVN repos --valhallasw in 2014 on T60801
[22:33:59] <stashbot>	 mutante: Stored quip at https://bash.toolforge.org/quip/Gem4KH8Ba_6PSCT9RHkp
[22:35:11] <mutante>	 quip help is https://bash.toolforge.org/help
[22:37:13] <mutante>	 !log phabricator - disabling repository dibyaduttabook
[22:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:37] <wikibugs>	 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Confirmed the following:    - Known-good ES startup script `(shasum:2a11d1b38f6712e4898a383bf68c7ed5937ba0a1)` is from Elastic's 6.5.4 release    - Known-bad ES startu...
[22:42:25] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) This came up again in T301507.
[22:50:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS stretch
[22:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch
[22:51:04] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org
[22:51:13] <mutante>	 !log phabricator - disabled empty but active repos: dibyaduttabook and xtools-H (T296022)
[22:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:18] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[22:55:04] <wikibugs>	 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Ran a diff against good and bad, the bad has the following inserted in 23-29:  `# If the quote-aware filesystem plugin is installed, then we need to pass extra # flags...
[22:57:35] <wikibugs>	 (03PS1) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029)
[22:58:09] <mutante>	 !log phabricator - disabled empty but active repo: wikidata-query-LDFServer (WQLD) created in 2018 by qchris (T296022)
[22:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:15] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[22:58:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[23:04:11] <wikibugs>	 (03PS4) 10Jbond: (WIP) bolt:  Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342
[23:05:18] <wikibugs>	 (03PS2) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029)
[23:09:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[23:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[23:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS stretch
[23:39:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch completed: - ms-be2069 (*...
[23:47:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul)
[23:47:29] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:52:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete after long hours of workaround because puppet wasn't  happy at  ` mkfs on /dev/sdc1  ` hopefully w...
[23:54:22] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul)
[23:55:30] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul)
[23:59:51] <wikibugs>	 (03PS1) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465)