[00:04:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS stretch [00:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:23] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch [00:09:57] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [00:10:07] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:10:23] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) Hey @Ammarpad your user page and Wikitech / LDAP user don't show your realname but @KFrancis from Legal needs it to go through the NDA process with you. Could you please shoot her an email (https... [00:15:14] (03PS1) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 [00:23:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [00:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:51] Today's enwiki featured image is failing to display. I opened https://phabricator.wikimedia.org/T302357 for it. [00:26:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [00:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:47] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye [00:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:24] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [00:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:29] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcontrol1004.wikimedia.org with OS bullseye [00:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [00:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:37] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [00:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [00:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:53] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye [00:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:40] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:59:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [00:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:50] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [01:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:19] beta cluster doesn't seem to be updating today [01:03:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [01:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS stretch [01:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch completed: - ms-be2067 (*... [01:08:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch [01:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch [01:13:14] (03CR) 10Eevans: [C: 03+1] restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [01:18:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1004.wikimedia.org with OS bullseye [01:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch [01:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch [01:27:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [01:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [01:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:31] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:38:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [01:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:41:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [01:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:56] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2066.codfw.wmnet with OS stretch [01:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:01] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m... [01:51:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch [01:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch [01:54:31] (03CR) 10Cwhite: "There are more instances than just eqiad. Do we need to provide proxy for those as well?" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [01:59:19] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10KFrancis) Thanks all. I'm processing this now. [02:02:44] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Reedy) [02:06:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [02:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [02:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:56] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) @fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you please check? thanks ` Error: 'parted --script... [02:19:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [02:20:39] (03PS1) 10Andrew Bogott: nfs-mounts.yaml.erb: remove nfs mounts for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/764930 (https://phabricator.wikimedia.org/T301298) [02:27:57] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:32] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml.erb: remove nfs mounts for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/764930 (https://phabricator.wikimedia.org/T301298) (owner: 10Andrew Bogott) [02:40:11] (03CR) 10Herron: [C: 03+2] prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [02:40:26] (03CR) 10Herron: prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [02:49:43] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2066.codfw.wmnet with OS stretch [02:49:45] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS stretch [02:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m... [02:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors: - m... [02:51:18] (03CR) 10Herron: prometheus: sketch out proxied prometheus web with IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [03:04:03] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:37] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:42:51] PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service,wmf_auto_restart_prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:33] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:01:55] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcontrol1003, ms-be2067, prometheus1006, ms-be2068, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:02:57] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:23] (03PS1) 10Ladsgroup: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029) [04:14:37] (03PS1) 10Ladsgroup: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029) [04:14:45] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [04:14:49] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [04:27:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [04:27:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [04:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T302363)', diff saved to https://phabricator.wikimedia.org/P21322 and previous config saved to /var/cache/conftool/dbconfig/20220223-042802-ladsgroup.json [04:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:11] T302363: Upgraded s7 to bullseye - https://phabricator.wikimedia.org/T302363 [04:28:56] (03Merged) 10jenkins-bot: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764837 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [04:29:01] (03Merged) 10jenkins-bot: ParserOutputAccess: Check for latest revision when checking for cache [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764838 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [04:31:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2150.codfw.wmnet with OS bullseye [04:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:51] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.23/includes/page/ParserOutputAccess.php: Backport: [[gerrit:764837|ParserOutputAccess: Check for latest revision when checking for cache (T283029)]] (duration: 00m 51s) [04:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:57] T283029: FlaggableWikiPage::preloadPreparedEdit() does not actually carry over the parser output, leading to double parses on save - https://phabricator.wikimedia.org/T283029 [04:35:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:16] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.22/includes/page/ParserOutputAccess.php: Backport: [[gerrit:764838|ParserOutputAccess: Check for latest revision when checking for cache (T283029)]] (duration: 00m 50s) [04:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:36:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:43:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [04:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [04:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2150.codfw.wmnet with OS bullseye [05:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T302363)', diff saved to https://phabricator.wikimedia.org/P21323 and previous config saved to /var/cache/conftool/dbconfig/20220223-051026-ladsgroup.json [05:10:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:33] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [05:11:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:11:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:11:22] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T302363)', diff saved to https://phabricator.wikimedia.org/P21324 and previous config saved to /var/cache/conftool/dbconfig/20220223-051125-ladsgroup.json [05:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:30] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:13:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2122.codfw.wmnet with OS bullseye [05:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2122.codfw.wmnet with reason: host reimage [05:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2122.codfw.wmnet with reason: host reimage [05:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2122.codfw.wmnet with OS bullseye [05:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T302363)', diff saved to https://phabricator.wikimedia.org/P21325 and previous config saved to /var/cache/conftool/dbconfig/20220223-055416-ladsgroup.json [05:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:22] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [05:55:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [05:55:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [05:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T302363)', diff saved to https://phabricator.wikimedia.org/P21326 and previous config saved to /var/cache/conftool/dbconfig/20220223-055534-ladsgroup.json [05:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:20] PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-statsd-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:08] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:58:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2120.codfw.wmnet with OS bullseye [05:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:28] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:12:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2120.codfw.wmnet with reason: host reimage [06:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2120.codfw.wmnet with reason: host reimage [06:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2120.codfw.wmnet with OS bullseye [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T302363)', diff saved to https://phabricator.wikimedia.org/P21327 and previous config saved to /var/cache/conftool/dbconfig/20220223-063625-ladsgroup.json [06:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:31] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [06:37:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:37:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T302363)', diff saved to https://phabricator.wikimedia.org/P21328 and previous config saved to /var/cache/conftool/dbconfig/20220223-063733-ladsgroup.json [06:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2118.codfw.wmnet with OS bullseye [06:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:27] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [06:43:39] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) 05Open→03Resolved a:03Joe [06:53:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2118.codfw.wmnet with reason: host reimage [06:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:54:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:51] !log dbmaint on s2@codfw (T300992) [06:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:57] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:56:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [06:56:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [06:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2118.codfw.wmnet with reason: host reimage [06:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:59:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:02:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:02:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [07:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [07:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:13] 10SRE, 10Traffic: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10elukey) For varnishkafka, this is the problem: ` elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent varnishkafka varnishkafka | 1.0.13-1 | stretch-wikimedia | main | amd64, source varnis... [07:03:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [07:03:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [07:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [07:06:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [07:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [07:09:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [07:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:10:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21329 and previous config saved to /var/cache/conftool/dbconfig/20220223-071038-ladsgroup.json [07:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:47] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:11:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10elukey) acked the alerts in icinga for elastic1093 :) [07:12:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2118.codfw.wmnet with OS bullseye [07:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21330 and previous config saved to /var/cache/conftool/dbconfig/20220223-071404-ladsgroup.json [07:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:09] (03PS1) 10Ayounsi: drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190 [07:29:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21331 and previous config saved to /var/cache/conftool/dbconfig/20220223-072909-ladsgroup.json [07:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:41] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:33:09] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:33:43] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:40:49] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Joe) Without knowing more about the type of data and your access patterns, it's hard to provide a good suggestion around this. But, more in genera... [07:42:28] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Joe) >>! In T301428#7716398, @Mstyles wrote: > Thanks @Joe, is there a hard limit on file sizes that can be stored inside the container? We might... [07:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21332 and previous config saved to /var/cache/conftool/dbconfig/20220223-074413-ladsgroup.json [07:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:37] (03CR) 10Filippo Giunchedi: [C: 03+1] mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [07:48:17] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:19] (03PS1) 10Bartosz Dziewoński: ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343) [07:49:46] (03PS1) 10Bartosz Dziewoński: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326) [07:50:10] (03PS2) 10Bartosz Dziewoński: Enable mobile DT at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders) [07:51:53] (03PS1) 10Elukey: Split the revscoring-editquality ml-serve settings in three [labs/private] - 10https://gerrit.wikimedia.org/r/765193 (https://phabricator.wikimedia.org/T301415) [07:52:33] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [07:52:36] (03CR) 10Elukey: [V: 03+2 C: 03+2] Split the revscoring-editquality ml-serve settings in three [labs/private] - 10https://gerrit.wikimedia.org/r/765193 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [07:52:58] (03PS1) 10Filippo Giunchedi: Revert "o11y: temp relax of LogstashIndexingFailures" [alerts] - 10https://gerrit.wikimedia.org/r/765194 (https://phabricator.wikimedia.org/T288549) [07:54:39] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) [07:55:17] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "o11y: temp relax of LogstashIndexingFailures" [alerts] - 10https://gerrit.wikimedia.org/r/765194 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi) [07:56:16] (03PS3) 10Bartosz Dziewoński: Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders) [07:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21333 and previous config saved to /var/cache/conftool/dbconfig/20220223-075918-ladsgroup.json [07:59:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:59:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:25] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:59:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21334 and previous config saved to /var/cache/conftool/dbconfig/20220223-075926-ladsgroup.json [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T0800). [08:00:05] MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:25] o/ [08:00:27] i can deploy today [08:00:30] hi [08:00:34] hi MatmaRex! [08:01:10] MatmaRex: do the config patches depend on the backports, please? [08:01:55] (03CR) 10Urbanecm: [C: 03+2] Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński) [08:02:00] urbanecm: yeah [08:02:11] (03CR) 10Urbanecm: [C: 03+2] ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343) (owner: 10Bartosz Dziewoński) [08:02:25] okay, then we have to wait for CI to process the backports now :) [08:04:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21335 and previous config saved to /var/cache/conftool/dbconfig/20220223-080424-ladsgroup.json [08:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:30] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:05:32] (03Abandoned) 10Urbanecm: DNM: Testing patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764901 (owner: 10Urbanecm) [08:05:44] (03Merged) 10jenkins-bot: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764842 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński) [08:06:02] (03PS1) 10Elukey: profile::kubernetes::deployment_server: split revscoring-ediquality [puppet] - 10https://gerrit.wikimedia.org/r/765196 (https://phabricator.wikimedia.org/T301415) [08:06:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T302363)', diff saved to https://phabricator.wikimedia.org/P21336 and previous config saved to /var/cache/conftool/dbconfig/20220223-080609-ladsgroup.json [08:06:12] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) [08:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:16] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [08:06:27] that was quick [08:07:19] MatmaRex: first backport is at mwdebug1001, can you test it please? [08:07:45] looking [08:07:49] (03CR) 10Elukey: [C: 03+2] profile::kubernetes::deployment_server: split revscoring-ediquality [puppet] - 10https://gerrit.wikimedia.org/r/765196 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [08:08:07] urbanecm: oh, we don't have that enabled anywhere yet :/ i can only test that with the config patch [08:08:11] oh [08:08:19] so should i pull one of the config patches there too? [08:08:24] (i assume https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/764868?) [08:08:30] yeah. the mobile one [08:08:31] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [08:08:34] yes, thanks [08:08:36] (03CR) 10Urbanecm: [C: 03+2] Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders) [08:08:47] okay, give me a sec :) [08:09:30] (03Merged) 10jenkins-bot: Enable mobile DiscussionTools at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) (owner: 10Esanders) [08:09:45] MatmaRex: the config patch is at mwdebug1001 together with the backport now [08:10:16] urbanecm: thanks. looks good on https://ht.m.wikipedia.org/wiki/Diskite:Paj_Prensipal [08:10:22] great! syncing [08:10:29] (backport first, then config) [08:10:46] actually... [08:10:55] never mind, htwiki is group0 [08:11:18] no, it's not, my screen confused me at https://versions.toolforge.org/ [08:11:32] oh, hm [08:11:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:44] i should backport to wmf.22 as well, right? [08:13:03] yeah, if you want the code to be live at htwiki [08:13:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:13:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:17] you can also wait for Thursday (when train deploys wmf.23 there) [08:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:13:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21337 and previous config saved to /var/cache/conftool/dbconfig/20220223-081338-ladsgroup.json [08:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:43] (03PS1) 10Bartosz Dziewoński: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326) [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:46] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [08:13:48] let's do it, if you don't mind [08:13:52] not at all [08:14:03] (03CR) 10Urbanecm: [C: 03+2] Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński) [08:14:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:14:32] (i was testing it wrong, i didn't realize that i was seeing the new tools because i had them enabled in preferences) [08:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:55] makes sense :) [08:17:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2108.codfw.wmnet with OS bullseye [08:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:27] (03Merged) 10jenkins-bot: ReverseChronologicalPager: Fix displaying date headers for non-revisions [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764841 (https://phabricator.wikimedia.org/T302343) (owner: 10Bartosz Dziewoński) [08:17:33] (03CR) 10Filippo Giunchedi: prometheus: sketch out proxied prometheus web with IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [08:18:42] (03Merged) 10jenkins-bot: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764843 (https://phabricator.wikimedia.org/T302326) (owner: 10Bartosz Dziewoński) [08:19:19] MatmaRex: all backports are now at mwdebug1001 (together with the mobile config patch) [08:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21338 and previous config saved to /var/cache/conftool/dbconfig/20220223-081929-ladsgroup.json [08:19:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:07] urbanecm: thanks. now it's really as expected on https://ht.m.wikipedia.org/wiki/Diskite:Paj_Prensipal, while logged out too [08:20:16] great! [08:20:17] syncing :) [08:20:41] and for the other backport, https://www.mediawiki.org/wiki/Special:Contributions/Matma_Rex looks fixed as well [08:20:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:20:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:00] excellent, will sync it too [08:21:46] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/DiscussionTools/: b82e4eb: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions (T302326) (duration: 00m 52s) [08:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:51] T302326: Enable reply and new topic tools unconditionally when Discussion Tools mobile is enabled - https://phabricator.wikimedia.org/T302326 [08:21:56] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) [08:22:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:35] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/DiscussionTools/: 269dcfd: Mobile config: Always enable reply/newtopic tools on mobile, disable subscriptions (T302326) (duration: 00m 50s) [08:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:17] (03PS1) 10Elukey: admin_ng: add new namespaces for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/765198 (https://phabricator.wikimedia.org/T301415) [08:24:19] (03PS1) 10Elukey: ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) [08:24:23] (03PS1) 10MMandere: varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) [08:24:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d9e8861: Enable mobile DiscussionTools at ht.wiki (T302259) (duration: 00m 50s) [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:32] T302259: [Config Change] Offer mobile Reply and New Discussion Tools at ht.wiki - https://phabricator.wikimedia.org/T302259 [08:24:42] MatmaRex: so, htwiki stuff is live now. Can you advise a good sync order for the core backport? [08:25:32] I'm thinking about HistoryPager, ContribsPager, MergeHistoryPager and then the other two files, but I'm not sure about that [08:25:43] oh, hm [08:25:54] (03PS1) 10Muehlenhoff: Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998) [08:26:04] (03PS2) 10Muehlenhoff: Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998) [08:26:19] urbanecm: IndexPager last, the rest is whatever [08:26:43] okay [08:26:47] (03CR) 10jerkins-bot: [V: 04-1] ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [08:27:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:28:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:53] started [08:29:00] (03CR) 10Elukey: [C: 03+2] admin_ng: add new namespaces for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/765198 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [08:29:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:40] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/actions/pagers/HistoryPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 1/5) (duration: 00m 49s) [08:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:46] T302343: Date headings on Special:Contributions don't work well for Flow edits - https://phabricator.wikimedia.org/T302343 [08:29:47] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński) [08:30:29] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/specials/pagers/ContribsPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 2/5) (duration: 00m 49s) [08:30:31] (03Merged) 10jenkins-bot: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765195 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński) [08:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2108.codfw.wmnet with reason: host reimage [08:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/specials/pagers/MergeHistoryPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 3/5) (duration: 00m 49s) [08:31:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10fgiunchedi) >>! In T299468#7730659, @Papaul wrote: > @fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you pl... [08:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:31:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:13] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/pager/ReverseChronologicalPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 4/5) (duration: 00m 53s) [08:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:02] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/pager/IndexPager.php: 38f33d3: ReverseChronologicalPager: Fix displaying date headers for non-revisions (T302343; 5/5) (duration: 00m 48s) [08:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:10] MatmaRex: core backport should be live now [08:33:27] and the last config patch is at mwdebug1001 now. MatmaRex, can you test please? [08:34:05] looking [08:34:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:34:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2108.codfw.wmnet with reason: host reimage [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21339 and previous config saved to /var/cache/conftool/dbconfig/20220223-083433-ladsgroup.json [08:34:35] urbanecm: yeah, looks good on https://www.mediawiki.org/wiki/Talk:Talk_pages_project/Usability [08:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:41] great, syncing! [08:35:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:35:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 10cb05a: Enable DiscussionTools newtopictool, topicsubscription on MediaWiki.org (T302256) (duration: 00m 49s) [08:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:56] T302256: Config Change: offer Reply Tool, New Discussion Tool, Topic Subscriptions as Opt-Out at mediawiki.org - https://phabricator.wikimedia.org/T302256 [08:36:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:37] MatmaRex: and all should be done now [08:37:39] anything else? [08:37:45] (sorry, had a short network issue here) [08:37:49] thanks [08:38:00] no more, that's enough patches ;) [08:38:06] fair enough :) [08:38:16] !log UTC morning B&C window done [08:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:28] speaking of patches, tomorrow we have a trainee for the morning slot! [08:38:38] good luck to them :) [08:38:42] I oughta know, I got her to sign up :-D [08:39:11] (03PS1) 10Muehlenhoff: Revert "Disable cluster rebalances temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) [08:39:24] I hope more than just me will be aruond tomorrow morning (please) [08:39:57] will try to :) [08:40:00] (03PS2) 10MMandere: varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) [08:40:03] cool! [08:40:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:40:22] apergos: just out of curiosity, how does one know when there are trainees or not? [08:40:24] first, I don't really wanna deploy by myself, and second, I want the norm to be that people learning have other people to rely on [08:40:38] and that doesn't happen if there's only one person here! [08:40:58] oh, if it's not a special case like this one where I said "go make a task", I go check the board: [08:41:10] https://phabricator.wikimedia.org/project/view/5265/ [08:41:26] oh wow TWO trainees [08:41:43] even better :) [08:42:02] yeah we definitely need a couple people around, maybe one person can share screen while they deploy and the other can actually discuss what's going on and give all thelinks and so on [08:42:21] wow so exciting :-) [08:42:29] * urbanecm is happy to play whichever role he's assigned [08:42:38] cool! thanks for just showing up [08:45:13] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:49:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2108.codfw.wmnet with OS bullseye [08:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300992)', diff saved to https://phabricator.wikimedia.org/P21340 and previous config saved to /var/cache/conftool/dbconfig/20220223-084938-ladsgroup.json [08:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:49:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:49:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:45] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:49:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21341 and previous config saved to /var/cache/conftool/dbconfig/20220223-084951-ladsgroup.json [08:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:28] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi) [08:50:35] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Disable cluster rebalances temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/765202 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:51:39] (03PS1) 10Filippo Giunchedi: prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) [08:52:45] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: ms-be2068, cloudcontrol1005, cloudcontrol1003, ms-be2066, ms-be2067, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:52:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33944/console" [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi) [08:53:53] seeking reviewer for an easy but kinda urgent one ^ [08:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21342 and previous config saved to /var/cache/conftool/dbconfig/20220223-085411-ladsgroup.json [08:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:34] (03CR) 10Elukey: [C: 03+1] prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi) [08:56:09] thank you elukey [08:56:18] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: fix quantile config value type [puppet] - 10https://gerrit.wikimedia.org/r/765203 (https://phabricator.wikimedia.org/T302372) (owner: 10Filippo Giunchedi) [08:57:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21343 and previous config saved to /var/cache/conftool/dbconfig/20220223-085755-ladsgroup.json [08:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:01] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [09:00:05] dduvall and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T0900). [09:01:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2077.codfw.wmnet with reason: Maintenance [09:01:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2077.codfw.wmnet with reason: Maintenance [09:01:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21345 and previous config saved to /var/cache/conftool/dbconfig/20220223-090109-ladsgroup.json [09:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] !log bounce prometheus-statsd-exporter on C:prometheus::statsd_exporter - T302372 [09:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] T302372: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 [09:03:28] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:03:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2077.codfw.wmnet with OS bullseye [09:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [09:08:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The update is complete [09:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21346 and previous config saved to /var/cache/conftool/dbconfig/20220223-090916-ladsgroup.json [09:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Both main Ganeti cluster have been upgraded to Buster. [09:12:31] (03PS1) 10Majavah: os-reports: add clouddb2001-dev task [puppet] - 10https://gerrit.wikimedia.org/r/765204 [09:14:21] !log restarting blazegrah on wdqs1007 (jvm stuck for 11hours) [09:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:09] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging" [puppet] - 10https://gerrit.wikimedia.org/r/765204 (owner: 10Majavah) [09:17:48] (03PS4) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 [09:18:23] RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2077.codfw.wmnet with reason: host reimage [09:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:08] (03CR) 10Ayounsi: [C: 03+2] drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190 (owner: 10Ayounsi) [09:21:41] (03Merged) 10jenkins-bot: drmrs: add HE peers [homer/public] - 10https://gerrit.wikimedia.org/r/765190 (owner: 10Ayounsi) [09:23:23] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, followup at {T302373} [09:24:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21347 and previous config saved to /var/cache/conftool/dbconfig/20220223-092421-ladsgroup.json [09:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2077.codfw.wmnet with reason: host reimage [09:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:45] (03CR) 10Vgutierrez: "looks good, make sure that varnish6 is already available on the main component before merging as PCC (https://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [09:33:47] (03PS1) 10Ayounsi: drmrs: use BGP_aggregate_contributors for main prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/765205 [09:36:26] (03PS2) 10Giuseppe Lavagetto: ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:36:28] (03PS1) 10Giuseppe Lavagetto: Rakefile: check existence of fixtures directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/765226 [09:37:21] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff) [09:38:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2077.codfw.wmnet with OS bullseye [09:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300992)', diff saved to https://phabricator.wikimedia.org/P21348 and previous config saved to /var/cache/conftool/dbconfig/20220223-093925-ladsgroup.json [09:39:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:39:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:31] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:39:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21349 and previous config saved to /var/cache/conftool/dbconfig/20220223-093933-ladsgroup.json [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:43] (03CR) 10MMandere: varnish: change the default archive component for varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [09:41:47] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:43:03] (03CR) 10Elukey: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765226 (owner: 10Giuseppe Lavagetto) [09:43:09] (03CR) 10Elukey: [C: 03+2] ml-services: add helmfile config for the new revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765199 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21350 and previous config saved to /var/cache/conftool/dbconfig/20220223-094405-ladsgroup.json [09:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:29] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21351 and previous config saved to /var/cache/conftool/dbconfig/20220223-094655-ladsgroup.json [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:01] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [09:47:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Prometheus doesn't run on VMs in eqiad/codfw (not sure if this fact was... [09:49:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:10] (03PS1) 10Elukey: ml-services: move reverted models to their new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/765228 (https://phabricator.wikimedia.org/T301415) [09:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21352 and previous config saved to /var/cache/conftool/dbconfig/20220223-095909-ladsgroup.json [09:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:37] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:06:55] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:08:03] (03CR) 10Elukey: [C: 03+2] ml-services: move reverted models to their new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/765228 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [10:10:21] (03PS1) 10Ladsgroup: cumin: Avoid creating alias for tendril [puppet] - 10https://gerrit.wikimedia.org/r/765234 [10:11:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:44] (03CR) 10Ladsgroup: "We have a happy PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/33947/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup) [10:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21353 and previous config saved to /var/cache/conftool/dbconfig/20220223-101414-ladsgroup.json [10:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:03] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup) [10:16:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] cumin: Avoid creating alias for tendril [puppet] - 10https://gerrit.wikimedia.org/r/765234 (owner: 10Ladsgroup) [10:19:44] (03PS1) 10Elukey: ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) [10:23:07] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:24:31] (03CR) 10Elukey: [C: 03+2] ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [10:24:49] (03CR) 10Klausman: [C: 03+1] ml-services: move goodfaith/damaging models to the new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [10:26:02] (03PS1) 10Ladsgroup: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765236 (https://phabricator.wikimedia.org/T302363) [10:26:57] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:27:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765236 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup) [10:29:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21354 and previous config saved to /var/cache/conftool/dbconfig/20220223-102919-ladsgroup.json [10:29:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:29:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:36] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [10:31:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:31:58] !log running schema change against s3 T300774 [10:32:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21355 and previous config saved to /var/cache/conftool/dbconfig/20220223-103204-ladsgroup.json [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:08] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 13 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:32:59] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:33:13] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:38:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:45] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:45:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:46:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:44] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21356 and previous config saved to /var/cache/conftool/dbconfig/20220223-104644-kormat.json [10:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:51] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:46:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21357 and previous config saved to /var/cache/conftool/dbconfig/20220223-104704-ladsgroup.json [10:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:17] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:48:31] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:49:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:15] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [10:51:51] (03PS1) 10Ayounsi: Export POPs aggregates and private prefixes over BGP [homer/public] - 10https://gerrit.wikimedia.org/r/765240 [10:56:44] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:02:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21358 and previous config saved to /var/cache/conftool/dbconfig/20220223-110209-ladsgroup.json [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:05:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21359 and previous config saved to /var/cache/conftool/dbconfig/20220223-110540-ladsgroup.json [11:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:49] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [11:06:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:07:02] (03CR) 10Volans: "I think at this point this could be squashed with the other CR that introduces reposync." [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [11:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:46] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21360 and previous config saved to /var/cache/conftool/dbconfig/20220223-111714-ladsgroup.json [11:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1181.eqiad.wmnet with OS bullseye [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] (03PS1) 10Elukey: ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) [11:22:28] (03PS1) 10Elukey: Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415) [11:23:59] (03PS1) 10Elukey: Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415) [11:25:22] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21361 and previous config saved to /var/cache/conftool/dbconfig/20220223-112522-kormat.json [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:29] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:25:57] (03CR) 10Elukey: [C: 03+2] ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:26:04] (03CR) 10Klausman: [C: 03+1] ml-services: deprecate the revscoring-editquality ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:26:17] (03CR) 10Klausman: [C: 03+1] Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:26:26] (03CR) 10Klausman: [C: 03+1] Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:26:56] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:28:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:28:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [11:29:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove references to revscoring-editquality [labs/private] - 10https://gerrit.wikimedia.org/r/765243 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:26] (03CR) 10Elukey: [C: 03+2] Remove references of revscoring-editquality for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/765244 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [11:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300992)', diff saved to https://phabricator.wikimedia.org/P21362 and previous config saved to /var/cache/conftool/dbconfig/20220223-113219-ladsgroup.json [11:32:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:32:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:25] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:32:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21363 and previous config saved to /var/cache/conftool/dbconfig/20220223-113226-ladsgroup.json [11:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [11:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:04] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [11:34:10] (03CR) 10JMeybohm: [C: 03+2] Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [11:35:44] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:36:54] (03PS1) 10Aklapper: MFA Phab accounts email: Fix incorrect SQL query; misc improvements [puppet] - 10https://gerrit.wikimedia.org/r/765245 (https://phabricator.wikimedia.org/T302385) [11:37:41] (03Merged) 10jenkins-bot: Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [11:40:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21364 and previous config saved to /var/cache/conftool/dbconfig/20220223-114026-kormat.json [11:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:26] (03PS7) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [11:41:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:42:06] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:13] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:42:16] (03PS38) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [11:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [11:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:08] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [11:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:48:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1181.eqiad.wmnet with OS bullseye [11:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:28] (03Abandoned) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [11:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21365 and previous config saved to /var/cache/conftool/dbconfig/20220223-115233-ladsgroup.json [11:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:40] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [11:53:52] (03CR) 10JMeybohm: [C: 03+2] miscweb: Enable ingress for all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [11:55:32] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21366 and previous config saved to /var/cache/conftool/dbconfig/20220223-115531-kormat.json [11:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:37] (03Merged) 10jenkins-bot: miscweb: Enable ingress for all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:02:13] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:03] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:04:57] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:28] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:07:32] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21367 and previous config saved to /var/cache/conftool/dbconfig/20220223-120738-ladsgroup.json [12:07:39] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:13] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 14 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:08:58] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:36] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300774)', diff saved to https://phabricator.wikimedia.org/P21368 and previous config saved to /var/cache/conftool/dbconfig/20220223-121036-kormat.json [12:10:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:10:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:44] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:53] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:12:11] (03PS1) 10Jbond: C:package_builder: install tools to build node packages [puppet] - 10https://gerrit.wikimedia.org/r/765250 [12:12:55] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:14:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33949/console" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:14:22] (03CR) 10Muehlenhoff: "node-babel7 is only in bullseye, this will need an os_release condition since deneb is still around for a few weeks." [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:15:41] (03CR) 10Jbond: "The errors you see in CI are due to the fact that our puppet-lint plug-in expects this define to exists in all roles. We would first need" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [12:15:59] (03CR) 10Muehlenhoff: C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:17:30] (03CR) 10Jbond: "further i think we could maybe add this functionality to profile::base with and use the $::_role variable" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [12:20:24] (03PS1) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) [12:21:47] (03CR) 10Muehlenhoff: C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:22:25] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:22:37] (03CR) 10Muehlenhoff: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:22:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21369 and previous config saved to /var/cache/conftool/dbconfig/20220223-122242-ladsgroup.json [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:16] (03PS2) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) [12:24:26] (03CR) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:24:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:24:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21370 and previous config saved to /var/cache/conftool/dbconfig/20220223-122449-kormat.json [12:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:57] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:25:01] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:13] (03PS1) 10Ladsgroup: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210 [12:26:00] (03PS2) 10Ladsgroup: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210 [12:26:04] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765210 (owner: 10Ladsgroup) [12:26:36] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:27] (03PS1) 10Kevin Bazira: ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) [12:27:46] (03PS1) 10Ladsgroup: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765255 (https://phabricator.wikimedia.org/T302363) [12:28:39] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765255 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup) [12:30:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21372 and previous config saved to /var/cache/conftool/dbconfig/20220223-123017-kormat.json [12:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:23] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21373 and previous config saved to /var/cache/conftool/dbconfig/20220223-123246-ladsgroup.json [12:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:53] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:34:49] (03PS1) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [12:35:08] (03CR) 10Muehlenhoff: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:35:55] (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [12:36:26] (03CR) 10Jbond: [V: 03+1] "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:37:18] (03PS2) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [12:37:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21374 and previous config saved to /var/cache/conftool/dbconfig/20220223-123747-ladsgroup.json [12:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:54] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [12:38:00] (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [12:38:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33951/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [12:40:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:40:22] I am switching the operations-puppet-tests-buster-docker Jenkins job to a new instance (Stretch > Bullseye) [12:40:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21375 and previous config saved to /var/cache/conftool/dbconfig/20220223-124027-ladsgroup.json [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:34] which in practice should be almost a noop since everything runs inside a Docker container [12:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:53] (03PS4) 10JMeybohm: Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) [12:40:58] (03PS3) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [12:41:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [12:41:38] (03CR) 10jerkins-bot: [V: 04-1] P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [12:44:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1174.eqiad.wmnet with OS bullseye [12:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:22] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21376 and previous config saved to /var/cache/conftool/dbconfig/20220223-124521-kormat.json [12:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:47] (03PS1) 10Elukey: kserve-inference: dry model config for revscoring_inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415) [12:45:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21377 and previous config saved to /var/cache/conftool/dbconfig/20220223-124751-ladsgroup.json [12:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:32] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [12:55:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [12:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] (03PS4) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [12:59:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [12:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33954/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [13:00:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21378 and previous config saved to /var/cache/conftool/dbconfig/20220223-130026-kormat.json [13:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] (03CR) 10Elukey: [C: 03+2] kserve-inference: dry model config for revscoring_inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/765260 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [13:02:23] (03CR) 10Hashar: ci: Qemu image and snapshot creation (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [13:02:31] (03PS18) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [13:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21379 and previous config saved to /var/cache/conftool/dbconfig/20220223-130255-ladsgroup.json [13:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcontrol1004, ms-be2066, cloudcontrol1003, ms-be2068, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:04:35] (03CR) 10Hashar: "I have cherry picked PS18 on integration-puppetmaster-02 . On integration-agent-qemu-1003 I have deleted /srv/vm-images/*qcow2 and I am ru" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [13:06:54] (03CR) 10JMeybohm: [C: 03+2] Add LVS servie k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [13:09:37] (03PS1) 10Elukey: kserve-inference: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/765264 [13:11:18] (03PS5) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [13:11:20] (03PS1) 10Jbond: motd::message: add new define for simple motd entries [puppet] - 10https://gerrit.wikimedia.org/r/765265 [13:12:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33955/console" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [13:14:22] (03CR) 10Elukey: [C: 03+2] kserve-inference: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/765264 (owner: 10Elukey) [13:14:24] (03CR) 10Jbond: Rename system::role to base::add_motd_role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [13:14:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1174.eqiad.wmnet with OS bullseye [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:package_builder: install tools to build node packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [13:15:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans) [13:15:31] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300774)', diff saved to https://phabricator.wikimedia.org/P21380 and previous config saved to /var/cache/conftool/dbconfig/20220223-131531-kormat.json [13:15:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:15:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:37] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300992)', diff saved to https://phabricator.wikimedia.org/P21381 and previous config saved to /var/cache/conftool/dbconfig/20220223-131801-ladsgroup.json [13:18:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:18:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:08] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:22] (03PS1) 10Bartosz Dziewoński: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388) [13:18:30] (03PS1) 10Bartosz Dziewoński: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388) [13:19:37] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2029/ganeti2030 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/765201 (https://phabricator.wikimedia.org/T298998) (owner: 10Muehlenhoff) [13:19:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] (03PS1) 10Jbond: C:package_builder: only install node-babel7 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765267 [13:23:33] !log debugging on mwdebug1002 [13:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:37] err. didn't mean to log [13:23:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33959/console" [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond) [13:23:43] (03CR) 10Jbond: "FYI there where dependency issues on buster so i have moved to bullseye and will build on build2001.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond) [13:23:52] (03CR) 10Jbond: [C: 03+2] C:package_builder: only install node-babel7 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765267 (owner: 10Jbond) [13:25:16] (03PS2) 10Kevin Bazira: ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) [13:29:35] (03CR) 10Phuedx: [C: 03+1] Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [13:30:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:30:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:30:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:32] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21383 and previous config saved to /var/cache/conftool/dbconfig/20220223-133031-kormat.json [13:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:45] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:30:49] (03CR) 10Klausman: [C: 03+1] ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:32:00] (03CR) 10Elukey: [C: 03+2] ml-services: add hrwiki, huwiki, idwiki & iswiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765254 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:32:51] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:54] (03CR) 10Hashar: [C: 03+1] "Tested and it works. I have confirmed the CI job works with the new image as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [13:35:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21384 and previous config saved to /var/cache/conftool/dbconfig/20220223-133559-kormat.json [13:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:36:38] jouncebot: nowandnext [13:36:38] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [13:36:38] In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1400) [13:37:39] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:37:42] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:25] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21385 and previous config saved to /var/cache/conftool/dbconfig/20220223-133858-ladsgroup.json [13:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:06] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [13:39:40] !log import libvmod-netmapper_1.9-1.dsc and libvmod-netmapper_1.9-1_amd64.deb to main component - T302301 [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:46] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [13:41:24] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:12] (03PS1) 10Ayounsi: Prepend AS to anycast prefixes learned on the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315) [13:45:20] !log Deployed patch for T302192 [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:41] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:01] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21386 and previous config saved to /var/cache/conftool/dbconfig/20220223-135103-kormat.json [13:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:40] !log import libvmod-re2_1.5.3-1.dsc and libvmod-re2_1.5.3-1_amd64.deb to main component - T302301 [13:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:46] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [13:54:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21387 and previous config saved to /var/cache/conftool/dbconfig/20220223-135404-ladsgroup.json [13:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:56:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1400). [14:00:04] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:10] o/ [14:00:16] I can deploy today! [14:00:20] hi [14:00:29] hi MatmaRex [14:00:42] (03CR) 10Urbanecm: [C: 03+2] Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński) [14:00:44] (03CR) 10Urbanecm: [C: 03+2] Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński) [14:02:01] i might also want to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/765213 [14:03:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MatthewVernon) [14:03:22] actually, i think i don't want to, until someone reviews it [14:03:42] if i do just a revert, then it has localisation changes, which are annoying to backport (right?) [14:03:48] indeed [14:03:48] and if i make other changes, then i' prefer a review [14:03:54] i'd* [14:04:07] (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) [14:04:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MatthewVernon) [14:04:49] (03Merged) 10jenkins-bot: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/765211 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński) [14:04:52] but we can do i18n changes too if the reason for the revert is an urgent problem :) [14:05:00] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:05:06] !log import varnish_6.0.10-1wm1.dsc, varnish_6.0.10-1wm1_amd64.deb, varnish-dbg_6.0.6-1wm1_amd64.deb, varnish-dbgsym_6.0.10-1wm1_amd64.deb, varnish-doc_6.0.10-1wm1_all.deb to main component - T302301 [14:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:12] i mean, it's Special:ApiSandbox [14:05:13] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [14:05:21] so probably not that urgent [14:05:24] (03Merged) 10jenkins-bot: Fix check for enabling features on mobile [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765212 (https://phabricator.wikimedia.org/T302388) (owner: 10Bartosz Dziewoński) [14:05:26] (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) [14:05:30] your call :) [14:06:07] MatmaRex: both backports are at mwdebug1001 now, can you test? [14:06:09] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21388 and previous config saved to /var/cache/conftool/dbconfig/20220223-140608-kormat.json [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] yeah. looking [14:06:35] (03PS1) 10Ssingh: test_dns: update EDNS client subnet test for IPv6 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/765270 [14:08:21] !log restarting pybal on lvs1020,lvs2010 - T290966 [14:08:21] urbanecm: seems good [14:08:25] (03CR) 10Ssingh: [C: 03+2] test_dns: update EDNS client subnet test for IPv6 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/765270 (owner: 10Ssingh) [14:08:25] syncing [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:26] T290966: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966 [14:09:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21389 and previous config saved to /var/cache/conftool/dbconfig/20220223-140908-ladsgroup.json [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:21] urbanecm: ping me when done please? [14:09:26] sure thing [14:09:46] unless MatmaRex wants me to deploy anything else, should be just two syncs [14:10:01] that is all [14:10:07] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 815b3d1: Fix check for enabling features on mobile (T302388) (duration: 00m 50s) [14:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] T302388: Discussion Tools features are unexpectedly enabled on mobile ([reply] links, "Add discussion" button, [subscribe] links) - https://phabricator.wikimedia.org/T302388 [14:11:20] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 78f0d9d: Fix check for enabling features on mobile (T302388) (duration: 00m 49s) [14:11:24] MatmaRex: should be live [14:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:30] taavi: the floor is yours [14:11:36] thanks [14:11:50] thanks [14:11:52] deploying the updated patch for https://phabricator.wikimedia.org/T302248 [14:11:52] !log import libvarnishapi2_6.0.10-1wm1_amd64.deb, libvarnishapi2-dbgsym_6.0.10-1wm1_amd64.deb, libvarnishapi-dev_6.0.10-1wm1_amd64.deb to main component - T302301 [14:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [14:12:27] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.70:30443]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:45] !log restarting pybal on lvs1019,lvs2009 - T290966 [14:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] (03PS3) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) [14:13:14] (03CR) 10Vgutierrez: aptrepo: Add thirdparty/haproxy24 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:13:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:13:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:14:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:37] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:47] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:16:05] (03PS2) 10JMeybohm: Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) [14:16:12] syncing [14:17:35] !log deploy second patch for T302248 [14:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:48] anyone have anything else to deploy? [14:18:04] i don't think so [14:18:22] !log UTC afternoon deploys done [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:51] !log import varnish-modules_0.15.0-1+wmf1.dsc, varnish-modules-dbgsym_0.15.0-1+wmf1_amd64.deb, varnish-modules_0.15.0-1+wmf1_amd64.deb to main component - T302301 [14:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:57] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [14:19:25] me done testing on mwdebug1002 [14:19:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS buster [14:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [14:19:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:19:55] (03CR) 10David Caro: [C: 03+2] discovery_dashboards: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [14:20:02] * urbanecm didn't know Krinkle was testing. Hopefully the B&C deploys didn't interfere (much) :) [14:20:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300774)', diff saved to https://phabricator.wikimedia.org/P21390 and previous config saved to /var/cache/conftool/dbconfig/20220223-142113-kormat.json [14:21:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:21:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:21:18] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:19] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:21:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21391 and previous config saved to /var/cache/conftool/dbconfig/20220223-142121-kormat.json [14:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) 05Open→03Resolved a:03dcaro I think this is ready to be closed! \o/ There's some related patches pending, but those are not directly these anymore. [14:22:26] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10dcaro) [14:22:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) @fgiunchedi thanks will check and see why the drive is missing. [14:24:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21392 and previous config saved to /var/cache/conftool/dbconfig/20220223-142413-ladsgroup.json [14:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:19] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [14:25:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:25:06] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:25:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:48] !log import varnishkafka_1.1.0-1_amd64.deb, varnishkafka_1.1.0-1.dsc, varnishkafka-dbg_1.1.0-1_amd64.deb to main component - T302301 [14:26:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21393 and previous config saved to /var/cache/conftool/dbconfig/20220223-142652-kormat.json [14:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:55] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [14:26:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:03] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:16] !log import varnishkafka_1.1.0-1_amd64.deb, varnishkafka_1.1.0-1.dsc, varnishkafka-dbg_1.1.0-1_amd64.deb to main component - T300164 [14:29:16] T300164: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 [14:29:51] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > As far as this task goes to me it still remains a mystery why it looks l... [14:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:59] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [14:33:03] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:33:05] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:33:11] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:33:13] thats me [14:33:50] (03PS1) 10MVernon: admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782) [14:34:22] (03PS1) 10JMeybohm: Add k8s-ingress-wikikube to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/765273 (https://phabricator.wikimedia.org/T300740) [14:34:39] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:35:39] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/765273 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [14:36:33] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:36:33] PROBLEM - Host ms-be2068 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:37] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:36:45] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube [14:36:45] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:36:49] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:02] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:38:11] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [14:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [14:40:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS buster [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [14:41:15] RECOVERY - Host ms-be2068 is UP: PING WARNING - Packet loss = 33%, RTA = 33.56 ms [14:41:35] (03CR) 10Ssingh: [C: 03+1] admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782) (owner: 10MVernon) [14:41:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21394 and previous config saved to /var/cache/conftool/dbconfig/20220223-144158-kormat.json [14:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [14:42:33] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/764739 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:26] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [14:46:36] RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:15] !log power down ms-be2068 for re-image [14:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:19] [14:48:33] (03CR) 10MVernon: [C: 03+2] admin: add mhay, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765272 (https://phabricator.wikimedia.org/T301782) (owner: 10MVernon) [14:48:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch [14:50:52] (03CR) 10MMandere: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33960/console" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [14:53:58] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10jhathaway) @fgiunchedi very sorry about the breakage, I wish I would have caught that in the review. [14:55:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:56:12] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Add thirdparty/haproxy24 component [puppet] - 10https://gerrit.wikimedia.org/r/765253 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:56:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [14:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21395 and previous config saved to /var/cache/conftool/dbconfig/20220223-145703-kormat.json [14:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] (03PS1) 10MVernon: admin: add skyenet, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) [14:58:48] (03CR) 10Vgutierrez: [C: 03+1] "it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [14:59:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1031.eqiad.wmnet with OS buster [14:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [14:59:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [15:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Done. [15:03:48] !log installing expat security updates [15:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:07:14] this is me testing ^ (wdqs@codfw is depooled) [15:07:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] /win 5 [15:12:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300774)', diff saved to https://phabricator.wikimedia.org/P21396 and previous config saved to /var/cache/conftool/dbconfig/20220223-151207-kormat.json [15:12:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1033.eqiad.wmnet with OS buster [15:12:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:13] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:12:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [15:12:15] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [15:12:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [15:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:15:11] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) Per Cathal's feedback above, we are closing this ticket as he correctly stated "it represents significant risk for what seems to be scant benefit.... [15:15:48] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) 05Open→03Resolved [15:17:49] !log rolling restart of FPM and Apache on mediawiki canaries to pick up expat security updates [15:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:19:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1032.eqiad.wmnet with OS buster [15:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [15:21:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:23:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:26:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) epic task! kudos for finishing it [15:26:37] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2068.codfw.wmnet with OS stretch [15:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors: - m... [15:28:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:28:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:29:01] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:46] (03CR) 10Bearloga: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:30:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:30:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21397 and previous config saved to /var/cache/conftool/dbconfig/20220223-153044-kormat.json [15:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:50] (03CR) 10Jbond: [C: 03+1] "lgrm" [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon) [15:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:53] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:30:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon) [15:31:32] (03CR) 10MVernon: [C: 03+2] admin: add skyenet, krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/765280 (https://phabricator.wikimedia.org/T301581) (owner: 10MVernon) [15:33:49] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:36:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21398 and previous config saved to /var/cache/conftool/dbconfig/20220223-153611-kormat.json [15:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:18] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:36:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS stretch [15:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch [15:38:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Done. [15:42:25] (03CR) 10Jbond: "lgtm see nits" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [15:43:19] (03PS2) 10Jbond: O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438 [15:43:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [15:44:10] (03CR) 10Jbond: [C: 03+2] O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438 (owner: 10Jbond) [15:49:45] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:16] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21399 and previous config saved to /var/cache/conftool/dbconfig/20220223-155116-kormat.json [15:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:18] (03CR) 10JHathaway: Rename system::role to base::add_motd_role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [15:55:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:32] PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 6.988e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:56:58] PROBLEM - WDQS high update lag on wdqs2002 is CRITICAL: 6.91e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:57:32] PROBLEM - WDQS high update lag on wdqs2003 is CRITICAL: 6.88e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:00:06] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia - T290005 [16:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:12] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:01:32] PROBLEM - WDQS high update lag on wdqs2007 is CRITICAL: 6.368e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:01:52] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability: prometheus-statsd-exporter failure to start due to invalid yaml config - https://phabricator.wikimedia.org/T302372 (10fgiunchedi) No worries @jhathaway ! It was a combination of factors that meant deployment would fail silently too :( i.e. no puppe... [16:03:12] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 6.196e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:04:12] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:04:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:35] (03PS1) 10Vgutierrez: cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) [16:05:44] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:54] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21400 and previous config saved to /var/cache/conftool/dbconfig/20220223-160621-kormat.json [16:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:34] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:09:55] (03PS2) 10Vgutierrez: cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) [16:13:08] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33963/console" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:14:32] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 4.911e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:21:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300774)', diff saved to https://phabricator.wikimedia.org/P21401 and previous config saved to /var/cache/conftool/dbconfig/20220223-162125-kormat.json [16:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:32] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:23:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS stretch [16:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch completed: - ms-be2068 (*... [16:25:45] PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 3.186e+07 ge 3.6e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:27:51] RECOVERY - WDQS high update lag on wdqs2001 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 1.999e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:30:18] (03CR) 10Hnowlan: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:30:35] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:12] (03PS1) 10Ladsgroup: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221 [16:31:19] (03PS2) 10Ladsgroup: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221 [16:31:43] RECOVERY - WDQS high update lag on wdqs2003 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.094e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:32:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765221 (owner: 10Ladsgroup) [16:32:33] (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:34:02] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10hashar) `gerrit2001.wikimedia.org` is a replica and can also be used as a spare to switch the primary service. It also serves repos over `gerrit-replica.wikimedia.org` which is used by various scripts an... [16:34:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) >>! In T302265#7731305, @fgiunchedi wrote: > The current pings from promet... [16:35:39] RECOVERY - WDQS high update lag on wdqs2002 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 4.793e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:36:20] (03PS1) 10Ladsgroup: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765308 (https://phabricator.wikimedia.org/T302363) [16:38:22] (03CR) 10Hnowlan: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:38:47] (03CR) 10Jbond: "see comments inline, have also added Moritz who may have a view. Also regardless of the inline comments im also happy to go with just the" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [16:39:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765308 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup) [16:41:19] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [16:42:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch [16:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch [16:43:19] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 7.68e+05 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:43:49] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! In terms of eventual re-use I replicated this (in templates/asw/policy-options.conf) for the LSWs as they need it, but I didn't wan" [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [16:44:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [16:44:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [16:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21403 and previous config saved to /var/cache/conftool/dbconfig/20220223-164453-ladsgroup.json [16:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:02] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [16:45:28] (03CR) 10Jbond: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:46:41] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 1.046e+06 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:47:31] RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 2.43e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:48:18] (03CR) 10Majavah: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:48:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1127.eqiad.wmnet with OS bullseye [16:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF) [16:49:49] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF) [16:50:25] (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:50:41] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:54:17] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [16:55:07] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [16:55:38] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF) [16:56:02] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF) [16:57:51] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [16:58:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1127.eqiad.wmnet with reason: host reimage [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:37] RECOVERY - WDQS high update lag on wdqs2007 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.437e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:00:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1127.eqiad.wmnet with reason: host reimage [17:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:08] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:17] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:14:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1127.eqiad.wmnet with OS bullseye [17:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2066.codfw.wmnet with OS stretch [17:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors: - m... [17:19:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:20:33] (03PS1) 10BBlack: eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) [17:21:42] (03CR) 10BBlack: "Note, I've already reserved .17-.20 in all 8 of the vlans in netbox, too. Seemed the simplest scheme for now, given there's already a few" [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [17:21:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS stretch [17:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch [17:22:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21404 and previous config saved to /var/cache/conftool/dbconfig/20220223-172206-ladsgroup.json [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:13] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [17:23:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [17:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [17:26:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [17:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:23] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) [17:30:58] (03PS1) 10Hnowlan: restbase: disable redundant jmx config [puppet] - 10https://gerrit.wikimedia.org/r/765313 (https://phabricator.wikimedia.org/T295375) [17:35:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:35:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21406 and previous config saved to /var/cache/conftool/dbconfig/20220223-173711-ladsgroup.json [17:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:44] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:40:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:44:33] (03CR) 10JHathaway: Add nagios_core & mailalias_core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [17:44:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [17:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:45:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS stretch [17:45:29] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10TheresNoTime) Hey all, sorry for the delay, I tested positive for COVID on Sunday and its been a little rough! Thank you //all// for the comments—I absolutely respect and appreciate tho... [17:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch completed: - ms-be2066 (*... [17:45:48] (03PS2) 10Hnowlan: restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) [17:45:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:46:21] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided) [17:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:29] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided) (duration: 00m 07s) [17:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [17:49:37] (03CR) 10Btullis: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [17:49:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [17:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21407 and previous config saved to /var/cache/conftool/dbconfig/20220223-175217-ladsgroup.json [17:52:18] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 52.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:44] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:53:26] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:53:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS stretch [17:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch [17:56:02] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:50] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:32] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:03:02] (03PS3) 10Hnowlan: restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) [18:05:18] (03PS2) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 [18:05:20] (03CR) 10Hnowlan: [C: 03+2] restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [18:07:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21408 and previous config saved to /var/cache/conftool/dbconfig/20220223-180722-ladsgroup.json [18:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:29] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [18:11:15] (03PS3) 10Ahmon Dancy: mediawiki: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 [18:11:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) [18:11:55] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) 05Open→03Resolved Thanks to John and Chris for the help on this, all done with the testing now. I've set the 3 servers back to the status they'd have been after regular provision, so they can be image... [18:12:03] (03PS1) 10Ladsgroup: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765316 (https://phabricator.wikimedia.org/T302363) [18:12:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765316 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup) [18:13:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:13:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:13:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21409 and previous config saved to /var/cache/conftool/dbconfig/20220223-181350-ladsgroup.json [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:02] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [18:15:55] 10SRE, 10ops-eqiad, 10Patch-For-Review: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10RobH) [18:16:44] (03PS1) 10Majavah: service: generate config yaml in puppet instead of via templates [puppet] - 10https://gerrit.wikimedia.org/r/765317 [18:17:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33964/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah) [18:18:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1158.eqiad.wmnet with OS bullseye [18:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:01] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33965/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah) [18:19:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10RobH) [18:19:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [18:19:58] (03PS2) 10Majavah: service: generate config yaml in puppet instead of via templates [puppet] - 10https://gerrit.wikimedia.org/r/765317 [18:20:10] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH) [18:20:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:20:44] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10RobH) [18:20:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33966/console" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah) [18:20:49] pro tip: remember to push your updated patch to gerrit before running pcc on it, otherwies you're going to be very confused [18:20:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RobH) [18:21:06] Good advice [18:21:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install clouddumps100[12] - https://phabricator.wikimedia.org/T299610 (10RobH) [18:21:44] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10RobH) [18:23:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2069.codfw.wmnet with OS stretch [18:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:44] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch executed with errors: - m... [18:24:18] (03CR) 10Majavah: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [18:25:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) @RobH (cc: @Jelto ) gitlab1002 has existed as a VM in the past, when contractors used it but the... [18:29:42] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks Brandon. IP reservations in Netbox seem good also." [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [18:29:57] 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Quiddity) 05Open→03Resolved This was done. ZI_Jony (added, and listed on info-page) and others offered to help, Plus I set the list to "reject with bounce" non-members to deal with the large wave... [18:30:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [18:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:13] (03CR) 10Cathal Mooney: [C: 03+1] "Change is fine +1. But I'm wondering why it's needed? Without the "aggregate" there the routes sent by the ASW should be propagated anyw" [homer/public] - 10https://gerrit.wikimedia.org/r/765240 (owner: 10Ayounsi) [18:33:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [18:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:58] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah) [18:34:34] (03CR) 10JHathaway: Remove ordered_yaml function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [18:35:06] jhathaway: do you want someone else to review that too? I don't have merge rights on the puppet repo [18:35:37] (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense to me but I'm no expert on puppetcode. Logic seems good +1." [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [18:36:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) a:05Jclark-ctr→03LSobanski @lsobanski: Is it ok to shift these hostnames from gitlab100[23] to gi... [18:36:01] taavi: yes, I would defer to hnowlan, as I don't have any knowledge of that service [18:36:20] ack, makes sense [18:36:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:38:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I should have set msw1-eqiad as parent for LSWs too I realize, will add." [puppet] - 10https://gerrit.wikimedia.org/r/764725 (owner: 10Ayounsi) [18:39:56] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Aklapper) assuming this is about #puppet [18:41:57] (03PS2) 10Cathal Mooney: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) [18:42:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:43:45] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Dzahn) To start with I would just like to add a bit of info that we have a history of using git submodules inside the puppet repo and not liking them and then moving away from them again, whic... [18:44:37] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Proc) Specifically in the case of T302047, I would prefer that active contributors on primarily single Wikipedias //not// be deploying those patches. For example, and without getting in... [18:44:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [18:45:06] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) >>! In T302423#7733059, @Dzahn wrote: > To start with I would just like to add a bit of info that we have a history of using git submodules inside the puppet repo and not liking the... [18:45:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [18:46:48] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) @jhathaway thanks for writing this up just a few quick comments. Before commenting i would say that in my mind we have [[ https://phabricator.wikimedia.org/T265138#7041244 | four type... [18:49:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1158.eqiad.wmnet with OS bullseye [18:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [18:51:59] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) p:05Triage→03Medium [18:52:28] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) [18:52:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [18:54:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [18:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21410 and previous config saved to /var/cache/conftool/dbconfig/20220223-185740-ladsgroup.json [18:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:47] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [18:58:58] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10Dzahn) >>! In T302423#7733064, @jhathaway wrote: >>>! In T302423#7733059, @Dzahn wrote: >> To start with I would just like to add a bit of info that we have a history of using git submodules i... [19:00:04] dduvall and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1900). [19:00:04] dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T1900). [19:06:32] Amir1: Do we rotate primary DBs for the OS upgrades, or will finishing the work be stalled on the next DC switch-over? [19:06:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans) [19:07:09] James_F: for most I think we will do a switchover, s6 is already planned [19:07:23] T300471 [19:07:23] T300471: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T300471 [19:07:38] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) >>! In T301428#7730915, @Joe wrote: > Without knowing more about the type of data and your access patterns, it's hard to provide a good... [19:08:10] but only core dbs left, es, m, pc, and x are already had swichovers [19:08:20] Right. s7 going RO for a few minutes isn't terrible though. [19:08:32] * James_F nods. [19:09:08] s8 and s4 would be the hard ones, I guess. [19:09:33] yeah [19:09:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [19:10:07] But single wikis, so they could cope if it's 15 minutes not 5. [19:10:22] there are a lot of schema changes pending for primary switchover as well. See the list T301312 [19:10:22] T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312 [19:10:24] Whereas s3 going down for 15 minutes would make a bunch of people whine [19:10:26] T300402 T300992 T300381 T298554 [19:10:27] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [19:10:27] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:10:28] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [19:10:28] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:10:31] * James_F nods. [19:11:36] the RO time is around a minute these days [19:11:59] Yeah, I'm just being pessimistic if something goes wrong. [19:12:12] honestly if we can automate it a bit, it should be done fully automatically and unannounced if you ask me :D [19:12:21] yeah [19:12:26] Right. [19:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21411 and previous config saved to /var/cache/conftool/dbconfig/20220223-191245-ladsgroup.json [19:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:50] Just in time for k8s for everything else, so the scale value won't be high. :-) [19:13:33] Automated tools that make everything very easy are great when we have 2000 boxes, but a bit dull when we have 100 boxes plus 2000 k8s pods. [19:14:38] dbs won't be in k8s (did I misunderstand you?) [19:14:59] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10dr0ptp4kt) Thanks all. @MatthewVernon I'm delegating responsibility on research and response on this to my direct report, @SCherukuwada (Senior Engineering Manager, Web team), who i... [19:15:22] generally stateful services should not go to containers [19:15:32] Yeah, the DBs will still be the 100. [19:15:57] it's around 300 these days :P [19:16:08] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:16:38] Hah. Ouch. [19:16:39] we have to do schema changes 100 times because for codfw we just run them on primary and it gets replicated [19:17:02] Right. [19:18:10] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:19:51] 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD insto centrallog1001 - https://phabricator.wikimedia.org/T302437 (10RobH) [19:20:13] 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD insto centrallog1001 - https://phabricator.wikimedia.org/T302437 (10RobH) [19:20:52] Testing scap mods on deploy server for a few minutes [19:22:29] 🍿 [19:26:35] !log dancy@deploy1002 Started scap: testing [19:26:36] PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdx1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:27] !log dancy@deploy1002 scap failed: CalledProcessError Command 'make -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver' returned non-zero exit status 2. (duration: 00m 51s) [19:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21413 and previous config saved to /var/cache/conftool/dbconfig/20220223-192749-ladsgroup.json [19:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:39] (03PS1) 10Ladsgroup: Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222 [19:30:47] (03PS1) 10Ladsgroup: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223 [19:30:54] (03PS2) 10Ladsgroup: Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222 [19:30:59] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765222 (owner: 10Ladsgroup) [19:31:13] (03PS2) 10Ladsgroup: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223 [19:31:17] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765223 (owner: 10Ladsgroup) [19:32:30] !log dancy@deploy1002 Started scap: testing scap container image building [19:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:41] !log dancy@deploy1002 Started scap: testing scap container image building [19:33:45] !log dancy@deploy1002 scap failed: CalledProcessError Command 'make -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver' returned non-zero exit status 2. (duration: 00m 03s) [19:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:45] !log dancy@deploy1002 Started scap: testing scap container image building [19:35:48] !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawik [19:35:48] i-webserver' returned non-zero exit status 2. (duration: 00m 03s) [19:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:44] Done testing for now. [19:42:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21414 and previous config saved to /var/cache/conftool/dbconfig/20220223-194254-ladsgroup.json [19:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:01] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [20:02:21] (03PS1) 10Andrew Bogott: nfs-mounts: remove wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/765331 (https://phabricator.wikimedia.org/T301646) [20:03:36] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: remove wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/765331 (https://phabricator.wikimedia.org/T301646) (owner: 10Andrew Bogott) [20:06:04] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:07:08] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:33] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10Reedy) [20:24:45] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) >>! In T302423#7733067, @jbond wrote: > Before commenting i would say that in my mind we have [[ https://phabricator.wikimedia.org/T265138#7041244 | four types types of modules ]]... [20:34:34] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) >>! In T302423#7733421, @jhathaway wrote: > According to [[ https://puppet.com/docs/puppet/6/type.html#puppet-60-type-changes | puppet's docs ]] and my own inspection of Puppet's 6.26... [20:44:34] !log run CentralAuthUser::importLocalNames for FuzzyBot T302399 [20:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:40] T302399: FuzzyBot account is not attached to global account on many Wikimedia wikis - https://phabricator.wikimedia.org/T302399 [20:52:21] (03CR) 10Ssingh: [C: 03+1] "Thanks for working on this! Confirmed NOOP on other hosts as expected." [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [20:56:48] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334 [20:56:50] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334 (owner: 10Dduvall) [20:57:49] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765334 (owner: 10Dduvall) [21:00:05] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:18] indeed, nothing to do [21:01:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:18] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:24] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:09:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:13] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.23 refs T300199 [21:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:19] T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199 [21:10:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:46] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.23 refs T300199 (duration: 01m 31s) [21:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:23] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: bird6 errors expected, not serving any traffic [21:17:25] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: bird6 errors expected, not serving any traffic [21:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:56] (03PS1) 10Andrew Bogott: nfs-mounts: remove account-creation-assistance project [puppet] - 10https://gerrit.wikimedia.org/r/765339 (https://phabricator.wikimedia.org/T301294) [21:54:52] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: remove account-creation-assistance project [puppet] - 10https://gerrit.wikimedia.org/r/765339 (https://phabricator.wikimedia.org/T301294) (owner: 10Andrew Bogott) [21:57:01] (03PS1) 10Reedy: Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433) [21:57:08] jouncebot: nowandnext [21:57:09] For the next 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220223T2100) [21:57:09] In 3 hour(s) and 2 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0100) [21:57:53] (03CR) 10Reedy: [C: 03+2] Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433) (owner: 10Reedy) [22:00:16] (03Merged) 10jenkins-bot: Add table and script for UCoC ratification vote [extensions/SecurePoll] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765225 (https://phabricator.wikimedia.org/T302433) (owner: 10Reedy) [22:01:30] (03PS1) 10Jbond: (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 [22:02:09] (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [22:03:53] (03PS2) 10Jbond: (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 [22:04:31] (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [22:06:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:07:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:59] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/SecurePoll/cli/wm-scripts/ucoc/: (no justification provided) (duration: 00m 50s) [22:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:44] (03CR) 10Jbond: [C: 04-1] "-1 this as it requires puppet > 6" [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [22:11:45] (03PS3) 10Jbond: (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 [22:12:24] (03CR) 10jerkins-bot: [V: 04-1] (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [22:16:09] (03CR) 10Jbond: [C: 04-1] (WIP) bolt: Add bolt rake tasks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [22:19:02] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) 05Open→03Resolved I closed out the ticket and this is now resolved. [22:19:18] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [22:32:51] !quip "svn->git migration is not completely trivial, due to the free-form nature of SVN repos --valhallasw in 2014 on T60801 [22:32:52] T60801: Copy contents of https://svn.toolserver.org/ to Wikimedia Diffusion - https://phabricator.wikimedia.org/T60801 [22:33:04] !quip help [22:33:59] !bash "svn->git migration is not completely trivial, due to the free-form nature of SVN repos --valhallasw in 2014 on T60801 [22:33:59] mutante: Stored quip at https://bash.toolforge.org/quip/Gem4KH8Ba_6PSCT9RHkp [22:35:11] quip help is https://bash.toolforge.org/help [22:37:13] !log phabricator - disabling repository dibyaduttabook [22:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:37] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Confirmed the following: - Known-good ES startup script `(shasum:2a11d1b38f6712e4898a383bf68c7ed5937ba0a1)` is from Elastic's 6.5.4 release - Known-bad ES startu... [22:42:25] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) This came up again in T301507. [22:50:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS stretch [22:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:01] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch [22:51:04] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [22:51:13] !log phabricator - disabled empty but active repos: dibyaduttabook and xtools-H (T296022) [22:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:18] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [22:55:04] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Ran a diff against good and bad, the bad has the following inserted in 23-29: `# If the quote-aware filesystem plugin is installed, then we need to pass extra # flags... [22:57:35] (03PS1) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) [22:58:09] !log phabricator - disabled empty but active repo: wikidata-query-LDFServer (WQLD) created in 2018 by qchris (T296022) [22:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:15] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [22:58:31] (03CR) 10jerkins-bot: [V: 04-1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [23:04:11] (03PS4) 10Jbond: (WIP) bolt: Add bolt rake tasks [puppet] - 10https://gerrit.wikimedia.org/r/765342 [23:05:18] (03PS2) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) [23:09:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [23:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [23:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS stretch [23:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch completed: - ms-be2069 (*... [23:47:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [23:47:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:52:40] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete after long hours of workaround because puppet wasn't happy at ` mkfs on /dev/sdc1 ` hopefully w... [23:54:22] 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul) [23:55:30] 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul) [23:59:51] (03PS1) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465)