[05:12:26] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10Joe) Hey @thcipriani that would be correct, although I need to do i... [09:29:59] <_joe_> hnowlan: I'm now starting to work on your CI issue, sorry I hoped I'd get to it earlier :/ [09:32:33] no worries! [09:34:13] <_joe_> to unblock you [09:34:29] <_joe_> the issue is currently with the addition of both the deployment and the chart in the same change [09:34:58] <_joe_> that causes issues to the deployment stuff because I think we assumed when diffing that we'd have the chart in the previous change [09:35:42] <_joe_> so if you split your change in two I'm confident the issue will go away. [09:43:48] aha, I think I hit this before [09:43:56] that should be fine [09:44:24] <_joe_> ah nevermind [09:44:35] <_joe_> found an error in the helmfile [09:45:43] <_joe_> anyways, I strongly suggest you use rake run_locally [09:45:57] <_joe_> and rake run_locally[check_{deployments,charts}] [09:46:06] <_joe_> to check what works and what does not [09:48:24] ah, will do [09:48:42] <_joe_> hnowlan: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/789876/3..4 this is the fix [09:49:48] <_joe_> but yes, the error message should be better, I need to wrap errors in diffing so that the output of lint would appear in a full run [09:49:58] <_joe_> that told me it what was actually broken [09:51:45] oh, well that's embarrassing :D [09:52:18] It'd be neat to wrap the writing of the helmfile.d stuff into the service scaffolding, we're kinda half way there already given there's the _example_ helmfile.d folder [10:03:33] <_joe_> yeah we need to rewrite scaffolding as a whole IMO [10:04:07] <_joe_> as in, we should make a better-partitioned library of common templates and build the scaffolded chart as a collection of those better tailored to your specific needs [11:31:08] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [11:32:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move away from system:node RBAC role - https://phabricator.wikimedia.org/T299236 (10JMeybohm) [11:32:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [11:35:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [12:21:32] 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10elukey) Hi folks! In T307927 I am trying to figure out why ml-team d... [12:24:57] 10serviceops: move mw241[2-9].codfw.wmnet into production - https://phabricator.wikimedia.org/T307255 (10Jelto) After yesterdays incident `mw2412` got depooled again to restore the state before the incident (see [SAL](https://sal.toolforge.org/log/5pf5p4ABa_6PSCT9wW7X)). I'm going to adjust this and pool `mw2412... [13:20:29] 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10JMeybohm) >>! In T305729#7917199, @elukey wrote: > Hi folks! In T307... [13:31:07] 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10Joe) >>! In T305729#7917199, @elukey wrote: > Hi folks! In T307927 I... [15:43:10] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10thcipriani) >>! In T303857#7916168, @Joe wrote: > Hey @thcipriani t... [15:59:15] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10jcrespo) If it helps- we have daily /srv backups of the deployment... [16:30:56] btullis, ottomata: I see in https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPanel=37&editPanel=37&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos that while we have a somewhat better view of things, they remain not really clear. Seeing values of 5s for [16:30:56] /robots.txt is insane. [16:32:25] I am gonna suggest increase replicas and see how it affects things. [16:33:14] since these doesn't seem to be CPU throttled, my only guess currently is that somehow the nodejs event loop becomes hugely backlogged [16:34:34] increasing replicas, should alleviate this, especially in peak hours like right now [16:35:12] if I am right, we should also see quantiles (p90, p99) for latencies decreasing at ~05:00 UTC [16:35:21] that would be the other telltale sign [16:35:27] akosiaris: Thanks for that Yes, it's like the path/route is irrelevant as regard how long the request takes, at least for those p90,p99 values. [16:36:22] Do you want me to prepare a CR to increase the replicas? [16:36:43] btullis: I 'd say you can deploy it to if you feel like it. [16:36:59] it can also wait for tomorrow ofc [16:37:14] I am wondering what a good number would be [16:37:27] OK, might be tomorrow morning. What's your recommendation for the increase? Ah, snap. :-) [16:37:27] I see 12 replicas for e-a-e (eventgate-analytics-external) [16:37:42] say... we double it for the experiment? [16:38:06] Yep, happy to do that. We recently doubled it from 6 after an incident. [16:38:59] figuring out a good number from p90,p99 isn't easy I guess. I wonder if there even is some methodology tbh, I 'll have to research that. [16:39:35] Here is the incident I mentioned: https://docs.google.com/document/d/1xYYzFlJcAP9pckqBWyiXUbs7HN5iThg85lkjv_RUh_o/edit#heading=h.vg6rb6x2eccy [16:41:20] thanks [16:42:27] ottomata: similarly, you might want to bump from 5 eventgate-main replicas to 10 or more? [16:42:43] heck I 'd say that you can keep on adding capacity until p99, p90 look ok to both of you. [16:43:20] assuming ofc that I am correct and adding capacity actually makes a difference [16:49:19] Well, this will be a good experiment. We can always reduce the replica count if it doesn't work as we hope. [16:49:43] yup [17:13:06] ok great, lets see what happens with e-g-a increase, and if it reduces p90+ then do the same for others [17:13:10] thanks akosiaris [18:55:17] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10Dzahn) [18:58:32] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10dancy) [19:43:14] 10serviceops, 10GitLab: import subversion repos from Phabricator into Gitlab - https://phabricator.wikimedia.org/T308061 (10Dzahn) [19:43:37] 10serviceops, 10GitLab: import subversion repos from Phabricator into Gitlab - https://phabricator.wikimedia.org/T308061 (10Dzahn) [20:08:54] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10Dzahn) For "docker-registry.wikimedia.org...