[08:24:00] volans: Amir1: Heads up, I'm about to redirect 0.5% of global traffic to mw-on-k8s [08:24:14] wohoo [08:24:15] I've added you as reviewers of the revert patch in case of problem [08:24:17] claime: ack thx for the notice! [08:24:21] thanks [08:24:23] I'll send an email to ops as well [09:32:11] 🍿 [09:36:30] Krinkle: ack, I see denisse already assisted (thank you!) [09:40:34] vgutierrez: lol [09:41:00] This is like 30rps to mw-api-ext and 15rps to mw-web :P [09:48:43] claime: slow eating popcorns then ;P [09:48:51] :D [09:49:22] But it does validate that we can change our default traffic spread, and exclude some domains [09:49:28] So that's good data :) [09:56:05] yup :) [11:03:30] I could use a puppet patch to be merged to fix up the way we run docker-pkg to publish Docker dev images. The git repo is currently shared by multiple users and that causes various problems (umask, different unix groups, git safe.directory bailing out any git operation). So the fix is to use a system user shared by all deployers :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/927975 [11:03:45] which should only affect the contint server :] [11:16:02] hashar: on it, I suppose you need me to run the post-merge commands that are in the commit message ? [11:16:24] claime: yeah I will do them :) [11:16:30] ack [11:16:45] it is a bit tricky and I am not entirely sure I have the proper commands :) [11:17:02] * hashar grabs a coffee [11:17:36] hashar: merged [11:19:20] merci! [11:20:59] Git::Systemconfig[safe.directory-srv-dev-images]: has no parameter named 'ensure' .. [11:21:18] ~_~ [11:28:05] claime: https://gerrit.wikimedia.org/r/c/operations/puppet/+/934268 contint: rm git safe.directory for dev-images [11:28:21] I screwed up cause git::systemconfig does not have an ensure parameter and the PCC did not caught that :] [11:28:32] I will clean it up manually [11:30:41] I need someone to +2 it though [11:33:34] that was "removed" in the previous change [11:33:35] :) [11:33:58] ik [11:34:08] and not needed anymore since the previous change switch the repo to be owned by the same user that will run the command ;] [11:34:12] claime: I'm happy to do that. Seems fine to me. [11:34:31] btullis: s'ok I just +2'd it myself [11:34:45] I'm a rebel. [11:34:49] then to be fair, git shared.directory is broken when a repo is shared between user of the same group. The use case has not been taken into account [11:34:54] 👍 👍 [11:35:14] hashar: merged [11:35:28] internally the code only compare the file uid ownership vs user effective uid but when the repo is shared it should compare the gid instead [11:35:58] in the end I blame the catalogue compiler :-] [11:55:30] claime: that is working all fine. Thank you! [14:24:21] <_joe_> volans, Amir1 https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a [14:24:47] _joe_: what's up? [14:24:49] * volans opening [14:25:03] <_joe_> sorry, I should've given context [14:25:13] <_joe_> it's the new slow log dashboard for mw on k8s [14:25:22] "slow" being 5s [14:25:33] oh nice! [14:25:42] I was worried something wrong was going on :D [14:28:37] let me see [14:30:10] cool [14:30:19] that looks useful [14:33:59] https://logstash.wikimedia.org/app/dashboards#/view/018bde90-a08d-11ed-8137-c3b9b9c0225e [14:34:04] That's the accesslog [14:47:51] heads-up: I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/933473, which is a NOOP but a big change that affects all hosts since it changes how we generate the NTP peers [14:48:07] I am on on-call as well if that helps but if you see something breaking, please let me know [14:48:10] thanks [14:48:18] <_joe_> volans, Amir1, sukhe: I would like to merge https://gerrit.wikimedia.org/r/c/operations/alerts/+/884039 [14:48:33] <_joe_> it's a paging alert, which will mostly go off when we have our surges of swift errors [14:48:47] <_joe_> if you want to review it, be my guests [14:49:09] <_joe_> oh denisse as well is oncall, just not in the topic yet :) [14:49:11] _joe_: go for it! [14:49:28] as long as it has a runbook link :D [14:55:38] <_joe_> volans: err it's pretty hard for this specific alert [14:55:42] <_joe_> "go look at the backend" [14:56:07] yeah I figured [14:58:00] have fun [14:58:54] <_joe_> Amir1: no, YOU have fun! [14:59:51] :**** [15:00:00] Thank you for all the fun and joy you bring to me [15:01:44] https://gerrit.wikimedia.org/r/c/operations/puppet/+/933473 is now merged on a few hosts, no issues so far [15:01:55] I have more dangerous DNS-related change today and then I will be done [15:02:02] if you see something breaking, please shout [15:04:34] sukhe: wait for an hour before deploying the rest so I would be out of oncall [15:04:41] lol [15:05:41] Amir1: I promise to collect all the pieces if I break it! but done [15:05:48] :D [15:06:12] :P Thanks <3 [15:10:03] <_joe_> the alert is live. if you all get paged you can blame me [15:10:32] I won't blame you because I know you will respond to it :* [15:10:45] so far it looks fine :D [15:10:47] don't worry we blame you also without the alert [15:10:52] haha [15:23:35] If we do have another batch of swift failures when v.gutierrez and I are around, we'd quite like to do a bit of debugging before restarting the frontends. I realise this means it'll only ever happen when we're not... [15:24:31] <_joe_> Emperor: the good news is - now we'll be notified when it happens [15:28:45] maybe some trigger could be executed based on that alert? [15:29:19] preparing some commands and running them à la pt-stalk [16:31:31] denisse: I'm around for xhgui deploy, any time is fine :) [18:12:55] denisse: hi! I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/933497. if there is a page, I will take care of it but thought you should be aware [18:12:58] thanks! [18:13:09] sukhe: ACK, thanks. :) [18:39:51] denisse: all done :) [18:41:39] sukhe: Awesome!! Glad it worked out. :D [18:42:05] yes, indeed! while I like excitement, I don't want to break all DNS servers in one go!