[05:12:26] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10Joe) Hey @thcipriani that would be correct, although I need to do i...
[09:29:59] <_joe_>	 hnowlan: I'm now starting to work on your CI issue, sorry I hoped I'd get to it earlier :/
[09:32:33] <hnowlan>	 no worries! 
[09:34:13] <_joe_>	 to unblock you
[09:34:29] <_joe_>	 the issue is currently with the addition of both the deployment and the chart in the same change
[09:34:58] <_joe_>	 that causes issues to the deployment stuff because I think we assumed when diffing that we'd have the chart in the previous change
[09:35:42] <_joe_>	 so if you split your change in two I'm confident the issue will go away.
[09:43:48] <hnowlan>	 aha, I think I hit this before 
[09:43:56] <hnowlan>	 that should be fine
[09:44:24] <_joe_>	 ah nevermind
[09:44:35] <_joe_>	 found an error in the helmfile
[09:45:43] <_joe_>	 anyways, I strongly suggest you use rake run_locally
[09:45:57] <_joe_>	 and rake run_locally[check_{deployments,charts}] 
[09:46:06] <_joe_>	 to check what works and what does not
[09:48:24] <hnowlan>	 ah, will do 
[09:48:42] <_joe_>	 hnowlan: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/789876/3..4 this is the fix 
[09:49:48] <_joe_>	 but yes, the error message should be better, I need to wrap errors in diffing so that the output of lint would appear in a full run
[09:49:58] <_joe_>	 that told me it what was actually broken
[09:51:45] <hnowlan>	 oh, well that's embarrassing :D 
[09:52:18] <hnowlan>	 It'd be neat to wrap the writing of the helmfile.d stuff into the service scaffolding, we're kinda half way there already given there's the _example_ helmfile.d folder
[10:03:33] <_joe_>	 yeah we need to rewrite scaffolding as a whole IMO
[10:04:07] <_joe_>	 as in, we should make a better-partitioned library of common templates and build the scaffolded chart as a collection of those better tailored to your specific needs
[11:31:08] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[11:32:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move away from system:node RBAC role - https://phabricator.wikimedia.org/T299236 (10JMeybohm)
[11:32:34] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[11:35:45] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernets clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[12:21:32] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10elukey) Hi folks! In T307927 I am trying to figure out why ml-team d...
[12:24:57] <wikibugs>	 10serviceops: move mw241[2-9].codfw.wmnet into production - https://phabricator.wikimedia.org/T307255 (10Jelto) After yesterdays incident `mw2412` got depooled again to restore the state before the incident (see [SAL](https://sal.toolforge.org/log/5pf5p4ABa_6PSCT9wW7X)). I'm going to adjust this and pool `mw2412...
[13:20:29] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10JMeybohm) >>! In T305729#7917199, @elukey wrote: > Hi folks! In T307...
[13:31:07] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10Joe) >>! In T305729#7917199, @elukey wrote: > Hi folks! In T307927 I...
[15:43:10] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10thcipriani) >>! In T303857#7916168, @Joe wrote: > Hey @thcipriani t...
[15:59:15] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10jcrespo) If it helps- we have daily /srv backups of the deployment...
[16:30:56] <akosiaris>	 btullis, ottomata: I see in https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPanel=37&editPanel=37&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos that while we have a somewhat better view of things, they remain not really clear. Seeing values of 5s for
[16:30:56] <akosiaris>	 /robots.txt is insane. 
[16:32:25] <akosiaris>	 I am gonna suggest increase replicas and see how it affects things.
[16:33:14] <akosiaris>	 since these doesn't seem to be CPU throttled, my only guess currently is that somehow the nodejs event loop becomes hugely backlogged 
[16:34:34] <akosiaris>	 increasing replicas, should alleviate this, especially in peak hours like right now
[16:35:12] <akosiaris>	 if I am right, we should also see quantiles (p90, p99) for latencies decreasing at ~05:00 UTC 
[16:35:21] <akosiaris>	 that would be the other telltale sign
[16:35:27] <btullis>	 akosiaris: Thanks for that Yes, it's like the path/route is irrelevant as regard how long the request takes, at least for those p90,p99 values.
[16:36:22] <btullis>	 Do you want me to prepare a CR to increase the replicas?
[16:36:43] <akosiaris>	 btullis: I 'd say you can deploy it to if you feel like it. 
[16:36:59] <akosiaris>	 it can also wait for tomorrow ofc
[16:37:14] <akosiaris>	 I am wondering what a good number would be
[16:37:27] <btullis>	 OK, might be tomorrow morning. What's your recommendation for the increase? Ah, snap. :-)
[16:37:27] <akosiaris>	 I see 12 replicas for e-a-e (eventgate-analytics-external)
[16:37:42] <akosiaris>	 say... we double it for the experiment?
[16:38:06] <btullis>	 Yep, happy to do that. We recently doubled it from 6 after an incident.
[16:38:59] <akosiaris>	 figuring out a good number from p90,p99 isn't easy I guess. I wonder if there even is some methodology tbh, I 'll have to research that.
[16:39:35] <btullis>	 Here is the incident I mentioned: https://docs.google.com/document/d/1xYYzFlJcAP9pckqBWyiXUbs7HN5iThg85lkjv_RUh_o/edit#heading=h.vg6rb6x2eccy
[16:41:20] <akosiaris>	 thanks
[16:42:27] <akosiaris>	 ottomata: similarly, you might want to bump from 5 eventgate-main replicas to 10 or more?
[16:42:43] <akosiaris>	 heck I 'd say that you can keep on adding capacity until p99, p90 look ok to both of you. 
[16:43:20] <akosiaris>	 assuming ofc that I am correct and adding capacity actually makes a difference
[16:49:19] <btullis>	 Well, this will be a good experiment. We can always reduce the replica count if it doesn't work as we hope.
[16:49:43] <akosiaris>	 yup
[17:13:06] <ottomata>	 ok great, lets see what happens with e-g-a increase, and if it reduces p90+ then do the same for others
[17:13:10] <ottomata>	 thanks akosiaris 
[18:55:17] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10Dzahn)
[18:58:32] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10dancy)
[19:43:14] <wikibugs>	 10serviceops, 10GitLab: import subversion repos from Phabricator into Gitlab - https://phabricator.wikimedia.org/T308061 (10Dzahn)
[19:43:37] <wikibugs>	 10serviceops, 10GitLab: import subversion repos from Phabricator into Gitlab - https://phabricator.wikimedia.org/T308061 (10Dzahn)
[20:08:54] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10cloud-services-team (Kanban): Assess GitLab-provided docker container registry as a default for docker-in-docker build processes - https://phabricator.wikimedia.org/T307537 (10Dzahn) For "docker-registry.wikimedia.org...