[07:40:11] hashar: o/ puppet on contint2001 seems having troubles with the git repo /srv/dev-images, do you mind to check later on? [07:40:33] it seems a chmod issue but I don't want to make a mess :) [07:40:42] err chown [07:41:31] you say that like it's not apparently already in a mess :D [07:47:29] Reedy: you'd be surprised how I can make things worse! :D [08:24:27] Amir1: sorry, commented on your page with the wrong account /o\ [08:27:49] elukey: I will check [08:33:18] in short the Puppet define `git::clone` learned to "smartly" update the remote origin when the upstream URL is changed [08:33:24] to do so it invokes `git remote set-url` [08:34:42] since /srv/dev-images is shared between users, the .git directory is currently owned by b.rennen but Puppet runs the git command as user root [08:35:03] that causes newer git versions to bail out because of user mismatch (that is the `safe.directory` feature) [08:35:30] tldr: gotta make the repository owned by a shared system user since git safe.directory does not support group shared repos. [08:35:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/927975 is the patc [08:35:40] h [13:13:53] hi all just a heads up i have added puppetserver1001 (the new puppet7 compiler) to puppet-merge. you shuldn't really notice this other then puppet-merge now going to 9 serveres instead of 8 btu if you do see any issues please ping me and/or T340635 [13:13:53] T340635: puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 [13:15:05] claime. effie: there was another massive backlog issue on the job queue last night... did it resolve on its own, or did you do anything to fix it? [13:15:29] duesen: we talked about it a bit on serviceops and I commented on the ticket [13:15:30] Looks like at some point were were backed up by 2:40! [13:15:52] It resolved on its own, but we're probably going to throw a few more servers at the jobrunners cluster [13:16:07] (just to avoid hitting a complete saturation of jobrunner workers) [13:16:41] Doesn't change our plan for the rest of wikis, since we've already done it for the biggest [13:19:35] claime: ok, thank you for looking into it! [13:19:50] Is there anything we can do to make the concurrency graph less misleading? [13:20:00] I don't actually understand how that query works, I just copied it... [13:21:02] change the metric type and the query, it's a prometheus summary when it should be an histogram (summaries are non-aggregatable), and the query makes little sense as a result [13:21:41] But since it's a global query for all cp-jobqueue jobs (iiuc), we need to investigate a bit what the buckets should be, etc. [13:22:22] For now we're operating on the rule of thumb that too much backlog is either concurrency or jobrunners being a bit overloaded [13:22:54] If you figure out how to improve the metrics, please let me know! [13:26:18] #fyi all ill be reverting the puppet-merge change sortly as we need to somehow sync the new ssh fingerprint with the puppetmasters [13:37:07] jbond: can I merge patches at the moment? [13:37:31] jynus: I just merged, if that helps and it went fine [13:37:37] thanks [13:37:40] sorry [13:38:07] no problem, just asking just in case [14:41:33] Unless there objections, I'm going to work on upgrading the sessionstore eqiad nodes to Bullseye (within the half hour). I'm not planning to depool this time —it'll just be one host going down at a time. [16:14:23] Hey, I'm trying to do a service first-deploy to staging but the helmfile command seems to hang and `kubectl get pods` says no resources found yet. [16:18:15] The diff shown by helmfile is wrong, it's out-of-date compared to the git log. Is there a second cron job / manual update that needs to run? [16:20:27] James_F: There is a timer that pulls deployment charts to chartmuseum every two minutes. https://wikitech.wikimedia.org/wiki/ChartMuseum#Interacting_with_ChartMuseum [16:20:47] btullis: Ah, that's different from the timer to update the git repo on the deploymen server? [16:21:26] Yep, two independent timers I believe. I've sometimes got caught by that, but there shouldn't be any other manual steps. [16:21:33] Ack. [16:21:48] It's been ~10 minutes now, should I just keep waiting and re-trying until the chart is right? [16:21:59] Or can I force helm to refresh or something? [16:26:53] James_F: Is your chart correctly published here? https://helm-charts.wikimedia.org/api/stable/charts [16:28:41] btullis: It is, but perhaps I should have updated the version in Chart.yaml? [16:29:18] claime, effie: can we go ahead with switching off parsoid pc writes today, or should we wait until the additional servers have been moved into the jobrunner cluster? [16:29:29] akosiaris: --^ [16:29:54] James_F: Ah yes, that'll be it. You said `first-deploy`so I didn't twig that there was a version update to the chart. [16:30:03] btullis: Aha, yeah, thanks! [16:50:15] duesen: it is 19:47 here, I suggest we put a pin on it, and discuss tomorrow [17:10:04] There are some small changes pending from `sre.puppet.sync-netbox-hiera` relating to `xhgui2002`- Can I take it this is OK to apply? [17:14:47] btullis: should be fine, xhgui are new WIP replacement hosts for the current xhgui* VMs [17:16:03] moritzm: Ack, thanks. TIL about xhgui and https://performance.wikimedia.org/xhgui/ :-) [17:20:44] btullis: Yes,bthsnk you in advance. [17:20:59] I'm working with the XHGUI hosts right now. :) [17:21:03] Thanks!! [17:21:45] * btullis denisse: Ack. Looks good. [17:52:06] duesen: let's do that tomorrow. We chatted with claime, with the big wikis done already, we don't expect any huge difference from the rest. Adding more servers is more of a precautionary measure down the line, not a blocker [18:36:26] akosiaris: yea, I don't expect the config change to actually add much load in normal operation. But it has the potential to increase the impact of template edits. [18:37:22] Previously, when the job queue got long, this would have caused the update requests coming from restbase to win the race condition, so parsing would have happend on the parsoid cluster, and execution of the prewarm jobs would have been fat. [18:37:24] *fast [18:38:09] Now, prewarm jobs have to do the actual parsing even if they are executed 20 minutes too late. So it takes much longer to resolve the backlog. [18:39:40] akosiaris: my takeway is that that in normal operation, job runners can take the load of parsoid prewarm parsing. But the cluster struggles with handling spikes (presumably caused by template edits, but may also be bursts of bot edits / purges) [18:39:57] If it gets no help from the parsoid cluster, that makes things quite a bit worse when they are already bad.