[03:31:20] andrewbogott: Thank you for helping. Also If by local ‘master’ server you meant the `production` branch in `ssh toolsbeta-puppetserver-1.toolsbeta.eqiad1.wikimedia.cloud` in the git repo `cd /srv/git/operations/puppet`, then no it doesn’t explain what I am experiencing. It’s not that it doesn’t deploy from `production` branch, it’s that even when you have a commit rebased on top of production, it [03:31:20] only deploys the previous rebased commit [03:35:29] For example I have 3 commits A, B and C. Commit A is the latest that is merged to production. I want to test Commit B and C. if I rebased commit B on production (so production now points to commit B) and ssh into the other server to run puppet agent —test, I get the changes I want applied alright. But if I undo commit B and rebase commit C on top of production (so production branch now points to commit C), [03:35:30] then switch over to the target server and run `puppet agent —test`, it keeps applying commit B, even when production branch the puppet server now points to commit C. [03:36:16] ——— [03:36:22] Testing a theory [03:40:59] Yup, for some reason I don’t know yet, you need to rebase to commit on top of `production`, then go away for sometime and when you come back the commit will be applied on target host. Attempting to ssh to target host and run `puppet agent —test` seems like it does nothing. It’s almost like something is caching the previous applied commit [03:41:41] Happy to pair on this with someone maybe tomorrow to show exactly what I mean. It sounds pretty unreal but that’s what I’m seeing on my terminal right now [03:42:32] I went to bed and came back to see the commit applied. But I tried this multiple times before sleeping [09:28:32] Raymond_Ndibe: did you use pgit instead of git? If not, it might have changed permissions on the git tree and fail to run the hook that actually applies the changes on commit to the directory that the puppet server publishes to agents [11:51:18] topranks: I drained a cloudcephosd last night. The graphs look good to me (switch traffic stayed out of the orange and osd traffic stayed out of the red). Do you agree? [11:53:53] andrewbogott: yes just looking there it all looks very good [11:54:14] the total traffic generated was below "total saturation", but we can still see things are working as expected and the QoS is helping [11:54:22] andrewbogott: is that related to the alert `Node cloudcephosd1021 is down` [11:54:26] - We have no drops anywhere for queue 4 that I can see [11:54:37] (that's the heartbeat/keepalive traffic) [11:55:22] - We have some drops in queue 0 (normal), but they are all "RED" drops (i.e. pre-emptive to manage the flow rate rather than due to total resource exhaustion) [11:55:59] - The number of drops in queue 3 (Ceph bulk data) is higher, which is what we want, and again they are all RED drops which means those flows are being managed much more sensibly [11:56:04] RED is this btw - https://en.wikipedia.org/wiki/Random_early_detection [11:56:13] overall all looks good! [11:57:46] I'm reasonably confident if we had a few more go down/come on at one time - saturating the 40G links between switches - we'd successfully mitigate the impact to the rest of the stack as things rebalanced [11:58:03] great! I have more to drain later today, want me to try a higher load? [12:00:51] yeah no reason to hold back I think [12:01:19] ok! [13:06:54] arturo: I changed a bit the KernelErrors alert while you were away (T382961) [13:06:55] T382961: Kernel error metrics have overlapping definitions - https://phabricator.wikimedia.org/T382961 [13:07:22] they are still a bit spammy, but they are at least grouped together [13:07:58] excellent [13:08:26] I think next step would be to add a way to ignore certain kernel messages, based on a regexp or whatever [13:08:46] yes, or another option could be ignoring the ones after a host reboot, but without ignoring kernel panics causing an expected reboot [13:08:47] for example, ignore `ACPI: .*` [13:09:28] but adding regexes would also reduce a lot the number of alerts [13:09:59] one thing I'm not understanding is why I'm not getting emails for those alerts [13:10:34] there was a problem with mail server config that taavi fixed last week, and I'm now getting many more alert emails [13:10:42] but still no email with "KernelErrors" alerts [13:11:03] I have no idea, Do you expect them in the cloud-admin-feed@l.w.o address? [13:11:17] yep I think so, but not 100% sure [13:12:42] the receiver in alertmanager is called "wmcs-taskircmail" [13:13:20] task and irc are getting the notification, the email is lost somewhere [13:13:51] to: 'cloud-admin-feed@<%= @facts["networking"]["domain"] %>' [13:17:06] shouldn't that be hardcoded to `lists.wikimedia.org`? [13:17:16] I think the emails I get are all from promeetheus-wmcloud, the missing ones are from prometheus-eqiad [13:19:00] the @facts expands to "cloud-admin-feed@wikimedia.org" [13:19:13] which seems to be missing "lists." [13:19:23] yeah [13:20:04] I'll send a patch [13:21:32] arturo: new replacement cloudgw boxes are ready to be put into service (this is T382356). I'm guessing there's more to it than just applying puppet changes... would you like to take over, or coach me through it? [13:21:33] T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356 [13:22:12] andrewbogott: yes, sure. Ideally we would need to schedule the operation [13:23:38] Yep. Maybe a week from today? Is an hour long enough? [13:24:18] andrewbogott: works for me! 1 hour should be enough [13:24:47] dcaro: I used pgit, but that was after the normal process refused to work [13:25:24] No idea why it feels like I’m the only one experiencing this [13:26:27] Raymond_Ndibe: would you like me to take a look? [13:27:05] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114723 [13:27:11] Raymond_Ndibe: I doubt you're the only one experiencing it, I suspect that I'm just more used the crappy experience. I'm still curious about what's specifically happening though. [13:27:53] arturo: I sent a calendar invite, if you accept then I can send a cloud-announce email (best case: network resets, worst case: a few minutes of network downtime. that right?) [13:28:04] dhinus: +1'd [13:28:12] thx [13:28:25] andrewbogott: yeah, sounds accurate [13:28:34] Raymond_Ndibe: the new process is flakier yep, if it was already refusing to work, pgit would not help iirc as the permissions in the git repo would be already booked, you can try chowing it to gitpuppet (or whatever it was the user it needs) [13:29:28] arturo: yes that’d be super helpful [13:30:15] Raymond_Ndibe: ok, will look soon. [13:30:17] dcaro: thanks for the pointers. I think a permission issue is more likely to be the cause [13:32:52] arturo: to reproduce you just need to apply the puppet patches https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114007/4 and 1113871: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113871 after each other. The first apply will probably work, but subsequent `puppet agent —test` will keep applying the commit you applied first, regardless of [13:32:52] what it checked out in puppet server. [13:33:49] Raymond_Ndibe: what do you mean by apply? [13:39:58] Raymond_Ndibe: could you send me all the steps you are doing for patching the repo? [13:58:31] I too see something weird, I cannot remove the last cherry pick somehow [13:58:31] https://gitlab.wikimedia.org/-/snippets/209 [14:08:30] 1. cd /srv/git/operations/puppet [14:16:33] https://www.irccloud.com/pastebin/SgPr4SNp [14:16:43] arturo: here [14:20:29] arturo: Yeaaa I think your test in the gitlab snippet already captured the problem. After resetting puppet server to have production branch point to a different commit, `puppet agent —test` on the target server keeps applying the previously checked out commit [14:38:50] Raymond_Ndibe: have you tried '/usr/local/bin/puppetserver-deploy-code' on the puppetserver before testing on the client? It might be that your particular sequence isn't hooking that at the right time [14:48:05] Nope I didn’t use /usr/local/bin/puppetserver-deploy-code. This is probably going to be the source of my problem 🤦🏽 [14:48:11] Let me test that [14:56:47] Different kind of error this time. Something about dubious ownership. The repository is owned by gitpuppet already though. Tried with both my user and root: [14:56:54] https://www.irccloud.com/pastebin/p6TeR1y7 [15:04:24] I think you need to run the command as gitpuppet [15:24:59] Figured out where the whole issue is from. When people say `git` here they mean `pgit`. Found the command `alias git=pgit` in the command history. So actually `git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/48/1114748/2 && git cherry-pick FETCH_HEAD` is really `pgit` and not `git`. [15:25:59] Either that or `sudo -i -u gitpuppet` and `sudo puppetserver-deploy-code`. Thanks Andrew! [15:27:34] hey guys is d.caro around today? [15:30:11] topranks: no, he is on PTO [15:30:32] ok np [15:30:48] it's not big deal, I am making some changes to our gnmi based stats pipeline [15:30:54] I think he built some graphs based on that [15:31:08] unfortunately it means the device tag will change from "target" to "source" [15:31:37] unfortunately unavoidable but it brings big performance boost - I'll feed back on the task [15:55:37] 👍 [16:47:55] papaul is going to swap drives in a few ceph OSDs, 1002[1-3]. So those are down, nothing to worry about. [16:56:19] I was just reminded we have some reboots pending T384946 [17:02:22] dhinus: we may need to approve the sender in the mailing software https://usercontent.irccloud-cdn.com/file/bYNVOrLu/image.png [17:02:39] * arturo offline [17:02:59] thanks [17:03:07] I'm not an admin of that list, maybe andrewbogott ? [17:04:31] I do see the emails in my inbox though, that was probably already sorted out [17:07:43] (context: I fixed the email address for alerts coming from prod to wmcs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114723)