[01:41:32] bd808, ryankemper, I found that specifying --puppet on the cli for the reimage script worked but waiting for it to prompt for puppet version and then selecting 5 or 7 did not. Worth a try at least! [01:41:52] (sorry about slow response, just saw the ping) [02:43:57] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [06:43:57] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:57] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:23:28] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:48:29] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:54:23] sukhe: as top.ranks attended the office hours, we ended up chatting a bit about it. You can read the summary in the meeting notes doc (linked in the calendar event). More or less what we said here, just with some additional details [10:02:03] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:02:42] moritzm: do you have an open screen on cumin2002 with a sre.puppet.migrate-role thanos::backend cookbook run? [10:03:10] ah sorry, just created, I misread the date [10:03:33] yeah, that's mine [10:03:43] for the spicerack upgrade? [10:03:49] I'll ping you when it's completed [10:04:01] ack thx [10:04:03] yes [10:05:06] will do [10:05:10] ca. 10-15m [10:13:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:15:04] volans: done, you can proceed [10:15:10] ack thx [10:18:37] moritzm: cumin2002 done, tested, all good. for cumin1001... I'm tempted to upgrade without waiing, there are the very long cookbooks there that will take forever and the only code change was in the puppet module that should be already in memory [10:19:01] sounds good to me! [10:21:53] moritzm: all done [10:28:57] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:21] thanks [10:31:21] was this the last blocker to migrate cumin hosts to p7? [10:33:00] TBBOMK yes, but let's doublecheck with jbond when he's around [10:33:14] if so, we could initially move cumin2002 [10:37:06] taavi migrated cloudcumin2001 to Puppet 7 as well earlier the day, could you also roll out the updated spicerack there? [10:43:35] volans: moritzm: yes afaik that is the only blocker [10:46:20] moritzm: ack I'll tell francesco as they usually self-manage upgrades there, I just notify them of new releases [10:46:57] at the same time they don't use that function so it's not a real blocker [10:48:37] ok [11:08:33] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:15:03] fyi cloudcumin upgraded too [11:17:34] ack [11:19:46] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:44:59] 10netops, 10Infrastructure-Foundations, 10SRE: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved Patches merged, all looking ok. For example on dns5004 this was situation before, server using TTL 2, CR using 193: ` 19:27:22.338917 IP (... [11:54:51] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:56:32] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10taavi) [12:16:42] I'll migrate cumin2002 to Puppet 7 in a bit [12:17:30] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [12:22:04] ack [12:37:32] cumin2002 has been moved, I typically run all cookbook/cumin/debdeploy things from 2002, so if there are any issues, we'll notice. I'll do 1001 next week unless there are any issues [12:38:13] +1 thanks [12:38:15] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:46:21] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) I'm gonna close this one for now, if we see an issue again we should get a better error message which should point us to what PuppetDB data triggered i... [12:46:29] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved [12:47:04] 10netops, 10Infrastructure-Foundations, 10SRE: Use default BGP multihop TTL between CRs and servers - https://phabricator.wikimedia.org/T350488 (10cmooney) [12:53:57] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:15] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:35:20] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:53:57] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:19] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:19:08] moritzm: fyi puppet seems to be running slowly on some hosts. possibly related to load. for now i have disabled puppet on puppet7 agents to do a bit of debugging [14:19:27] ok [14:20:00] I'll re-enable on kafkamon*, though since those are currently in the progress of running the cookbook [14:20:07] ack [14:22:36] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:42:17] volans: thanks! will read the notes! [14:42:28] yw [14:42:49] moritzm: fyi going to try and re-enable now [14:44:07] ok! [14:45:33] jbond: does the puppetserver caching seem to help? [14:45:56] jhathaway: definetly helps with one host will have to wait 30 mins to see what its like with all of them [14:46:11] nod [14:46:57] what was the issue? did puppet agent times spike or so [14:47:42] agent runs where sporadiclly taking a long time. with some very high CPU spikes on the puppetservers [14:48:31] where were you noticing, puppetboard? [14:49:05] specific hosts aftermigrating i asked people to test and they reported that host that use to complet in ~30secs where taking 5 mins [14:49:33] disabling puppet fleet wide so there was no load on the serveres and running puppet again i saw the times go back to normal [14:49:54] when looking at the debug logs for one slow host it was taking a very lkong time to fetch files [14:49:59] (PuppetFailure) firing: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:51:48] ack, thanks [14:52:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:53:29] jbond: do we export any metrics anywhere for puppet agent run times? [14:53:59] (PuppetFailure) firing: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:49] jhathaway: its definetly in logstash but i think there is also a board [14:54:59] (PuppetFailure) firing: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:55:13] thanks, I'll take a look around [14:57:18] jhathaway: looking in /var/lib/prometheus/node.d/puppet_agent.prom it dosn;t look like we have the applied time [14:57:54] so id say logstash or some pql is the best thing for now [14:58:55] nod, thanks [14:59:57] will need to wait a bit longer but so far it seems like your patch has hel[p relive cpu and memory pressure quite a bit https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1001&var-datasource=thanos&var-cluster=misc [14:59:59] (PuppetFailure) resolved: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:03] * jbond suspects mem will grwo again [15:03:59] (PuppetFailure) resolved: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:04:59] (PuppetFailure) resolved: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:07:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:52:34] jbond, jhathaway: are we good to resume adding hosts or rather pause for some observation/stabilisation? [15:53:12] moritzm: yes you should be goopd to continue [15:53:58] ok, it seems puppetserver simply attempts to consume all avail memory, even a week ago when we had hardly any servers in P7 it used 59G [15:54:10] the joys of Java [15:54:43] yes thats correct its java [15:59:47] seems to have reduced the the number of "procs running" as well, not sure why [16:00:04] maybe less java threads doing IO? [17:34:05] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [17:54:27] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:17] 10netops, 10Infrastructure-Foundations, 10SRE: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) p:05Triage→03Low [20:41:22] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10kostajh) [21:54:56] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:21] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:47] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:10:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c937612c-c0eb-4c9e-a245-9810a56c0a33) set by cmooney@cu...