[00:02:43] anyone around to give eyes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/976878 [00:03:33] mutante any chance you can take a quick look at ^ [00:38:34] mutante: puppet-merge seems to working as expected now, stepping away from the computer, i'll check back in on things in a bit [00:41:17] thanks again for your help [00:57:17] jhathaway: I still have some issue with my VM but that should be a separate problem. I will also try that again at a later time. cheers [00:57:46] confirmed the generalized puppet issue is gone :) cya [09:03:38] volans: freed some old builds in my home too [09:04:01] thx :D [09:27:05] any recent changes to smart-data-dump / raid.rb fact? it's randomly timeouting at 120 seconds in a few traffic instances [09:27:16] git log --grep doesn't show anything obvious [09:28:31] oh... [09:28:32] T251293 [09:28:33] T251293: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 [09:28:45] I'll ping c.white there [09:29:17] I've seen that happening sometimes [09:29:44] try resetting the ipmi interface, sometimes it gets slow, it may help? [09:30:49] https://www.irccloud.com/pastebin/LQ8zjJ78/ [09:31:10] jynus: seems like an intermittent issue here [09:32:40] ah ok, then [09:36:34] jynus: https://phabricator.wikimedia.org/T251293#9354533 [09:36:54] seems like it's impacting at least to a ~5% of the fleet [09:39:24] I also added a comment to this old ticket yesterday about smart-data-dump https://phabricator.wikimedia.org/T320636#9352341 - We're also seeing a lot of timeouts from this. [09:42:09] vgutierrez: interesting, I had seen it at times on d.p.-owned boxes, but I didn't knew it was so widespread [09:46:28] btullis: added some extra info on https://phabricator.wikimedia.org/T320636#9354548 [09:47:06] vgutierrez: Many thanks. [11:09:55] ~. [13:13:25] puppet-merge takes ages, is it just me? [13:14:40] Looks like it is super slow on the Starting run on puppetserver1001.eqiad.wmnet. steps [13:16:10] I think there was a change yesterday to avoid race conditions that uses symlinks instead of in-place updates, could be relate [13:17:09] (directory symlink) [13:17:36] cc jhathaway [13:19:15] I've done quite a few merges today and it wasn't slower than usual [13:19:40] yeah same here, it was fine during the morning [13:19:48] Emperor: how was your last merge? [13:25:35] the merges on puppet 7 puppetservers do feel significantly slower compared to the puppet 5 puppetmasters to me too [13:26:01] maybe the script could update all of the puppetmasters in parallel? [13:26:12] headsup: I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/974998 that generates the dhcpd.conf config file from our hiera network data. If anything looks fishy, please shout [13:27:45] brouberol: great! please notify also in -dcops as they are also potentially affected [13:27:56] yep, on it [13:28:51] the puppet-merge change was a hosfix AFAIUI to prevent too many failures on the puppetservers, the longer term IMHO is to make it a cookbook (free SAL, locking, parallelization) [13:29:41] +1 on making this a cookbook [13:33:03] puppet merge on the puppetserveres do do a bit more work. on the puppetmasteres we just do a git pull. [13:33:20] but on the puppetserveres we go i gusse the equivilent of [13:33:35] git -C staging_dir pull [13:34:08] correction [13:34:10] git -C git_dir pull [13:34:23] cp git_dir staging_dir [13:34:33] ln -sv staging_dir live_dir [13:34:40] rm -rf old_staging_dir [13:34:54] so as you can see the addtional cp and rm will increase the time [13:43:38] I'm seeing timeouts related to puppet7 when running run-puppet-agent on install1004.wikimedia.org: Error: Connection to https://puppetserver1001.eqiad.wmnet:8140/puppet/v3 failed, trying next route: Request to https://puppetserver1001.eqiad.wmnet:8140/puppet/v3 timed out connect operation after 60.005 seconds [13:44:08] it ended up running anyway, but I thought I'd mention it [13:48:37] I've just had similar from a couple of swift frontends too [13:48:49] see the ticker vgutierrez mentioned some time ago [13:49:34] there seems to be a regression with RAID + puppet 7 (I may not be exact, the details were on the ticket) [13:49:41] ack thanks [13:49:45] marostegui: my puppet-merge was fine speed-wise [13:50:19] brouberol: T320636 [13:50:20] T320636: smart-data-dump fails occasionally due to facter timeouts - https://phabricator.wikimedia.org/T320636 [13:50:33] I think it is that, it could be something else [13:50:35] thanks jynus [13:50:56] I saw instead timeouts connecting to a puppet server (but the puppet-agent ran in the end) [13:51:12] which seems different to factor being slow? [13:51:17] yeah [13:51:56] I guess could be related to the ongoing migration [13:52:22] but yeah, different kind of timeouts [13:53:45] I am going to add the tag puppet 7 there [13:53:56] and that way we can track ongoing known issues [16:01:46] hi all sorry for not circiling back, but we did see some issues with puppetserver1001 and a restart of puppetserver daemon seemed to make things better [16:02:53] notice the spike in cpu and drop in network for about 2 hours starting at 12:30 UTC https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1001&var-datasource=thanos&var-cluster=misc&from=now-6h&to=now [16:32:06] !oncall-now [16:32:06] Oncall now for team SRE, rotation business_hours: [16:32:06] k.amila_, c.laime [16:33:49] (oh, turkeys, it all makes sense) [16:43:31] yep