[00:09:43] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:43] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:43] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:43] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:41] 10Puppet, 10Release-Engineering-Team, 10Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277 (10hashar) a:03hashar [11:01:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) Yes, that would be possible even though there is no documented way on how to do this and what is supported or not. The two main options I see is either via a... [11:30:50] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetmaster1005` - puppetmaster1005 (**WARN**) - Downtimed host on Icinga... [11:30:59] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin2002 for hosts: `puppetmaster2005` - puppetmaster2005 (**WARN**) - Downtimed host on Icinga... [12:13:40] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) After fixing the `redirect_uri` I'm able to login successfully to the admin interface (https://gitlab.wikimedia.org/admin) using... [12:13:46] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [12:27:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) [12:37:22] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Andrew) 05Open→03Resolved [12:51:10] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm executed with errors: - puppet... [14:05:49] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [14:10:18] 10netops, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Today, using First World-grade Internet connections, I could still very simply reproduce the bug.... [14:24:56] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm executed with errors: - puppet... [14:25:55] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [14:38:26] 10netbox, 10Infrastructure-Foundations, 10Puppet-Core: Make netbox the source of truth for cloudceph networks - https://phabricator.wikimedia.org/T338329 (10jbond) p:05Triage→03Medium [14:54:58] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm completed: - puppetserver1001... [15:08:20] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm [15:10:05] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:17:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney all those connections are no longer on the old switch we can delete those. thanks [15:17:32] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm executed with errors: - puppet... [15:17:53] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm [15:25:51] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans on lsw1-a1 which is a new switch, after running the cookbook it did PASS . However no configuration was done on the switch itsel... [15:27:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul thanks I'll remove them from netbox cheers. [15:37:35] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) It didn't complete successfully, it failed to check the uptime of the switch and asked the operator what to do, and when it was answered... [16:12:40] volans (not urgent, can answer tomorrow): How confident are you in the reboot-cluster cookbook? We're in need of restarting the whole traffic fleet and are considering using that once we're comfortable with the workflow of single nodes [16:39:48] brett: I'm not the owner of *all* the cookbooks ;) that one is not mine and is from 2020 and as such predates the batch classes. My suggestion would be to write one for the cp cluster using the batch LB class as only there you could fine-tune the logic that you need [16:40:24] volans: My apologies, I was just assigned to check with you about your confidence in that cookbook before using it. I should have used a git blame. [16:40:33] for example you could speed up the process doing up to one upload and one text host a time per dc in the most "quick" version of it [16:41:19] I see last runs were on january: https://sal.toolforge.org/production?p=0&q=%22sre.hosts.reboot-cluster%22&d= [16:41:54] the other key point is if you have to do anything special before/after the reboot [16:42:29] like managing some specific service that a simple reboot would not manage well [18:14:43] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:43] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:03] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10MoritzMuehlenhoff) >>! In T338195#8906984, @Andrew wrote: >> >> I think this is a missing dependency in the package. > > Indeed, installing 'ruby-sorted-set' fix...