[02:38:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:39] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:46:02] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:13:00] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:43:21] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10elukey) [09:59:50] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:23] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:32] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:13:23] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:02:06] volans: I just ran a reimage after changing the role of a node from a puppet5 to a puppet7 one, which seems to make the migrate-host cookbook fail [13:02:20] SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, puppet7 is only avalible for bullseye [13:02:30] sudo cookbook sre.hosts.reimage -t T351074 --os bullseye -p 7 mw2420 [13:03:01] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:03:37] jbond: ^^^ [13:04:09] I'm at lunch, I can check in a bit, but I think we might call the migrate-host *before* the reimage right now [13:04:38] and maybe wwe should move it to after d-i before the first puppet run [13:05:58] yeah, it's definitely running before [13:06:56] and it does not know about the changed role, asking me to add a hieradata/hosts patch [13:07:09] ofc [13:07:30] probably an option to skip running it at all would be enough [13:07:43] jayme: so to double check whre yu trying to use the reimage cookbook to change from buster -> bullseye and migrate from puppet5 -> puppet7 in one go? [13:08:11] jbond: and move from puppet role A (puppet5) to B (puppet7), yes [13:08:30] yeah, this is for a reimage of mw/buster to k8s/bullseye (which has been enabled for P7 at the role level) [13:10:22] oh wow i didn;t realise that mw serveres can now alos be kubernetes::worker's :/ [13:10:45] the mising rename part is just an implementation detail, we might fix it or not ;) [13:11:05] but they are repurposed slowly from mw hoss to k8s to support mw-on-k8s [13:11:33] sure sure, i dont think it makes much difference would have still failed with a rename i think [13:11:45] yep [13:12:50] so there are a few things here. first it saying to add the hiera even thous its allready there. i don;t think its worth fixing this and tbh its not an easy thing to fix [13:13:50] jbond: I would assume this is kind of an edge-case, right? [13:14:18] a bit, but not too edgy [13:14:22] jayme: well this is definetly an edge case on a few points [13:14:33] i think others will hit the hiera thing mentioned above [13:14:46] but either way i still think its not worth the fix [13:15:20] i think the thing that make it tricker is the changing roles [13:15:23] but couldn't we just add a --skip-puppet-migration flag? [13:15:57] and skipp all the puppet version checks as well as running migrate-host [13:16:15] jayme: i think to unblck you it would be better to reimage to puppet 5 & bullseye [13:16:20] then run the migrate cookbook [13:16:39] if yu add force_puppet7: false to the host it should kep you on puppet5 [13:16:54] hm..that's yet another step for 300+ hosts [13:17:19] that's 2 more commits, not doable for the migration, but for a one off-test might [13:17:19] like i said just to unblock you, will think on a better solution [13:17:58] I fail to understand why it would be required to run migrate-host in this case [13:18:29] jayme: happy to review a patch [13:18:34] as the target role is migrated the node should be automatically after the reimage - or am I wrong? [13:18:39] but otherwise im looking at the code now [13:18:58] we could also re-use the -p/--puppet: 5, 7, ignore [13:20:23] to me it seems like the reimage cookbook should use the current puppet version to perform the pre-reimage cleanup, and the given version only when signing the certs for the new installation [13:21:56] * jbond grabbing some food [13:24:48] jbond: volans: I was thinking: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976732 [13:26:06] as-is would not work [13:26:52] we need to detect the current puppet version to do the cleanup on the proper puppetmaster/server and know/detect the version of puppet after the reimage to know where to sign the CSR [13:27:46] is that cleanup part of the migrate-puppet cookbook? [13:27:53] so it needs some refactoring [13:28:13] no also the reimage itself clears puppetdb and the cert at every reimage [13:28:23] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:03] but let me check the curent code with the migrate logic integrated [13:31:29] hmm, I did not see any other use of get_puppet_version() there [13:31:50] no but there are uses of self.puppet_server that it's currently either one or the other [13:31:56] I'm doing a local refactor [13:32:39] that's right...but I did not rouch that [13:32:42] *touch [13:33:15] if you run with -p5 you'll get puppet 5, if you run with p7 you get puppet 7 (from _get_puppet_server()) [13:36:03] hmm volans i think that patch could work as we do [13:36:04] self.puppet_server.delete(self.fqdn) [13:36:04] >> if self.args.puppet_version == 7: # Ensure we delete the old certificate from the Puppet 5 infra [13:36:07] self.spicerack.puppet_master().delete(self.fqdn) [13:36:20] if self.args.puppet_version == 7: [13:36:25] self.spicerack.puppet_master().delete(self.fqdn) [13:36:44] in fact im now wondering if we even need to call the migrate cookbook from reimage [13:36:53] anyway ill wait for you refactor [13:37:35] ofc in this case we don't need to call the migrate-host [13:37:54] and in the general case I think you did it to avoid duplication of code [13:38:02] but we could also not do that [13:38:43] yes i think that is what i did. however i am now wondering if its is needed in the genral case. [13:39:03] ultimatly all we need to do, unless im missing something. is delete the old cert and sign the new one. [13:39:14] and i think the reimage cookbook allready dose that [13:39:43] yes the problem is the auto-discovery of the version IMHO [13:39:54] if I forget to set -p in this case I endup with detecting puppet 5 [13:40:07] installing puppet5 [13:40:14] but then having hiera set for puppet 7 [13:40:30] and it's too easy to get into that situation and I'd like to avoid it [13:40:43] but thats yes and i think we have that problem regardless of wether we call the migrate cookbook or not [13:40:50] yes [13:40:58] FYI this is one of the reason i wanted to add the hiera_lookup methoed [13:41:13] but somehow convinced my self i didn't need it [13:41:21] lol [13:41:25] we can re-add it [13:41:33] oh yes we could just look the same information up via puppetdb so a puppetdb module would probably be better but more work [13:41:45] puppetdb wouldn't work in this case :d [13:41:55] has puppet5 data and we need the info before the new puppet run [13:42:38] from puppetdb we can see what the last force_puppet7 value was which would give us the anser [13:43:13] it would be pretty much the same result as using hiera lookup [13:43:15] I guess people merges hiera while reimging, so puppet might hve not run to populate puppetdb [13:43:25] fyi the hiera lookup patch is ready to merge so we can do that https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/972459 [13:43:50] volans: yes thats fair [13:45:01] yeah probably is the cleanest way as there is no way from the current cookbook to know if the host is changing puppet version [13:45:01] volans: so yes for that bit i think the best idea would be to update get_puppet_version to use puppet_server.hiera_lookup [13:46:16] +1 [13:46:33] ok ill mereg that above change put will need a spicerack release [13:47:02] sure [13:48:32] lmk if I can be of any help :) [13:49:03] jayme: not repurposing mw hosts into k8s ones :D [13:49:14] no can do :) [13:49:48] btw IMHO I think we should automate the renaming and rename those, it hurts me having a mwXXXX being a k8s host [13:59:29] jbond: sorry I cancelled your +2 to make sure it works as epxected both on puppetmaster and puppetserver, as I'm failing to make it work on the hosts running the same command [14:00:05] also that method is inherited by PuppetMaster() too so it should either work there too or we should override it to raise [14:01:19] volans: the command is working as expected. it shuld alo work on puppetmaster [14:02:00] I've put the error I'm getting on the CR (for the others) [14:02:25] ftr /govol [14:28:23] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:50] (SystemdUnitFailed) firing: (7) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:50] (SystemdUnitFailed) firing: (8) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:50] (SystemdUnitFailed) firing: (7) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:23] (SystemdUnitFailed) firing: (7) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:50] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:50] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:53] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:49:40] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:59:50] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:33] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:16:29] jayme: to keep you updated, john and have updated and released spicerack with the function that we can now use to make the reimage support your use case [16:16:49] we've now to refactor it a bit to use that, not sure if it will be today, I've a meeting in 45 and still stuff to do [16:17:59] cool, thanks. I'm not in a real rush. Tomorrow would be super fine [16:23:59] (PuppetZeroResources) firing: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:25:45] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:38:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:53:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:29:13] (DiskSpace) firing: Disk space build2001:9100:/ 4.955% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:38:15] moritzm: FYI^^^ I'm having a look at the disk space [17:46:10] the big chunk is pbuilder, I pinged the top /home users in -sre [17:46:30] is there any procedure to cleanup old cruft from pbuilder in a sfe way? [17:50:58] sorry the actual top user is docker [17:52:01] in particular /var/lib/docker/overlay2 [17:57:05] for exmple starting with "docker builder prune", it should be safe AFAICT [17:59:13] (DiskSpace) resolved: Disk space build2001:9100:/ 4.001% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:59:49] thaks to b.en we got some GB back but we should still perform some cleanup (and probably put it in a timer) [18:02:06] I see we hve 2 timers, a daily one with "docker system prune --force" and a weekly one with "docker system prune --all --volumes --force" [18:02:22] I wonder if we should add the builder prune too or that's included into system prune [18:02:49] from the docs is unclear if system prune removes the builder cache too [18:03:44] as it's not critical not doing anything for now, waiting for feedbck [18:17:14] (DiskSpace) firing: Disk space build2001:9100:/ 5.283% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:33:04] I've run systemctl start docker-system-prune-dangling.service [18:33:11] Total reclaimed space: 33.62GB [18:33:21] it should be ok for a bit [18:37:13] (DiskSpace) resolved: Disk space build2001:9100:/ 2.914% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:48:23] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:23] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:23] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:23] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:08:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:18:23] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:23] (SystemdUnitFailed) firing: (3) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:51] (SystemdUnitFailed) firing: (4) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:24] (SystemdUnitFailed) firing: (5) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:50] (SystemdUnitFailed) firing: (6) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:23] (SystemdUnitFailed) firing: (7) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:50] (SystemdUnitFailed) firing: (8) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:23] (SystemdUnitFailed) firing: (9) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:44:50] (SystemdUnitFailed) firing: (12) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:23] (SystemdUnitFailed) firing: (13) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:23] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed