[07:29:01] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10elukey) >>! In T288198#9035228, @dancy wrote: > Another case of failing to push a large image: T342084.  Is it possible to configure NGIN...
[09:16:09] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[09:16:20] <wikibugs>	 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05In progress→03Resolved Now rdf-streaming-updater uses the read-only endpoint on k8s, meaning it...
[09:26:07] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10akosiaris) >>! In T288198#9035228, @dancy wrote: > Another case of failing to push a large image: T342084.  Is it possible to configure N...
[09:44:23] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10akosiaris) >>! In T288198#9036874, @elukey wrote: >>>! In T288198#9035228, @dancy wrote: >> Another case of failing to push a large image...
[09:55:36] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) We'll first make the move to 2% of traffic, then ramp up from there during the week.
[10:02:07] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9035364, @cmassaro wro...
[11:11:26] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) >>! In T341463#9014217, @Quiddity wrote: > Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_...
[11:11:38] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:11:50] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved
[12:12:26] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr)
[12:15:31] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH)
[12:15:58] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH)
[12:21:25] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH)
[12:21:52] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH)
[12:31:31] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye
[12:36:43] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye
[12:39:53] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) jforrester opened https://gitlab...
[12:41:19] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9037120, @JMeyb...
[13:22:26] <inflatador>	 Does anyone know if we support/use partial clone on gitlab? re: https://docs.gitlab.com/ee/topics/git/partial_clone.html
[13:28:38] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**)   - Remov...
[13:28:41] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Remov...
[13:30:45] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye
[13:30:50] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye
[13:44:56] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Remov...
[13:44:58] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**)   - Remov...
[13:48:52] <elukey>	 hi folks
[13:49:39] <elukey>	 something interesting - I am trying to debug some 503s returned by the ores-legacy's tls proxy, and after setting logging to debug I noticed some connect timeouts before 503s (why don't they clearly indicate that in the response is not clear to me but..)
[13:50:08] <elukey>	 I checked the config_dump and for "inference" I see a connect_timeout set to 0.250s
[13:50:28] <elukey>	      "cluster": {
[13:50:28] <elukey>	       "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
[13:50:31] <elukey>	       "name": "inference",
[13:50:33] <elukey>	       "type": "STRICT_DNS",
[13:50:36] <elukey>	       "connect_timeout": "0.250s",
[13:50:47] <elukey>	 is that supposed to be so tight? Or is there a way to tune ot?
[13:50:52] <elukey>	 *it?
[13:51:41] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye
[13:51:47] <elukey>	 ah no I see it is hardcoded in the mesh.configuration._cluster bit, if I see it correctly
[13:52:38] <elukey>	 I'd need to possibly se it a little higher, anything against me adding a tunable for it?
[13:53:59] <jayme>	 hmm...we even have a 1s connect timeout in the admin interface cluster config
[13:54:43] <elukey>	 it may be a pebcak on my side
[13:55:14] <_joe_>	 0.25 s connect timeout doesn't seem like "tight" to me for internal services
[13:55:54] <jayme>	 Just to be sure: This is a config from the tls-proxy of something having the "inference" listener enabled, right?
[13:55:59] <wikibugs>	 10serviceops, 10Data Products, 10RESTbase Sunsetting, 10Code-Health-Objective, 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10Jgiannelos)
[13:56:04] <_joe_>	 it's unreasonable for a tcp connection to take 0.25 seconds to be established inside a single datacenter
[13:56:27] <_joe_>	 if it happens commonly, we need to understand why
[13:56:33] <elukey>	 we need to mock some calls that ores makes, in which a user can request multiple rev-ids to be scored at once - if they are too many, ores-legacy will have to call lift wing multiple times
[13:56:49] <elukey>	 and the more requests the higher the possibility of a delay
[13:56:58] <elukey>	 this is why I'd like to have a less tight timeout
[13:57:06] <elukey>	 or at least configurable
[13:59:56] <elukey>	 jayme: yes sorry exactly
[14:00:11] <elukey>	 I checked the config via the envoy proxy's localhost port
[14:00:11] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye
[14:00:14] <elukey>	 via nsenter
[14:00:20] <elukey>	 ("config_dump")
[14:02:54] <jayme>	 elukey: what I don't get is why more requests means more dely in th *tcp connection* to inference
[14:03:14] <jayme>	 maybe I don't get which is which :D but lift wing == inference, right?
[14:03:43] <elukey>	 exactly yes, but sometimes we have ores-like requests from ores-legacy that can ask for 100 scores
[14:03:46] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) apine merged https://gitlab.wiki...
[14:04:06] <elukey>	 that translates into a lot of calls to Lift Wing, since we have a different model than ORES 
[14:04:26] <elukey>	 and the ores models on lift wing are not super fast, let's say :)
[14:04:39] <elukey>	 so my read is that connections pile up and they are delayed
[14:06:04] <elukey>	 we are going to verify if our lift wing code can accept multiple rev-ids in the same request, and possibly open a single request to the mw api for all of them
[14:06:29] <elukey>	 this could probably improve the issue, but there are a lot of weird calls that ores-legacy needs to mock
[14:07:34] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Remov...
[14:11:27] <jayme>	 I would have expected that in the tls-proxy pools/re-uses connections to inference
[14:11:49] <jayme>	 (but it feels like I'm missing something obvious :))
[14:12:33] <elukey>	 jayme: nono it should happen but there is an idle time in theory, connections are not kept alive for a long time
[14:12:51] <elukey>	 so I guess that the pool needs to be created when we fire 100-like requests
[14:14:41] <akosiaris>	 elukey: 0.25s of a connect timeout isn't tight as Giuseppe says. We do have 1 case where we witnessed it being reached https://phabricator.wikimedia.org/T292663 but it happened in the p99 of latency buckets and we worked around it with a retry
[14:15:22] <akosiaris>	 if you notice a lot of connect timeouts from service mesh, we need to debug it more I 'd say
[14:16:58] <akosiaris>	 btw, this pannel https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?forceLogin&from=1684424094488&orgId=1&to=1684516973613&var-datasource=eqiad%20prometheus%2Fops&var-destination=shellbox-syntaxhighlight&var-origin=parsoid&var-origin_instance=All&viewPanel=24 will tell you if you have issues like that
[14:17:06] <jayme>	 which application where you debugging? It should in theory be possible to spot connectivity issues on the https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s 
[14:17:22] <akosiaris>	 and the k8s one that jayme points out ofc
[14:17:29] <akosiaris>	 we should at some point have those be a single 1
[14:17:42] <akosiaris>	 it's happening ofc via mw-on-k8s :-)
[14:19:49] <elukey>	 I think it is a little tight, but I do see that there is not a lot of space for a debate so I'll find different ways
[14:20:18] <elukey>	 using the retry may help but it is masking an issue, imho
[14:20:38] <akosiaris>	 why do you feel it's tight though ? 
[14:21:36] <elukey>	 because of the use case above, we need to mock a one-to-many (a lot) request via ores-legacy, and we know that the model servers are not that fast
[14:22:06] <elukey>	 it is a little tight for my specific use case, but I agree that it is a good default
[14:22:10] <akosiaris>	 to accept a TCP handshake? We aren't talking about serving the request, just doing the 3way handshake
[14:22:17] <akosiaris>	 it's just a connect timeout 
[14:22:43] <akosiaris>	 this is before even a single line of HTTP protocol is sent. 
[14:22:53] <elukey>	 yep I know :)
[14:23:08] <akosiaris>	 and the servers aren't fast enough for the 3 way handshake? 
[14:23:43] <akosiaris>	 ok, I need some more info in this, I am struggling a bit 
[14:23:56] <elukey>	 for 100 in parallel, or more, it may be delayed a bit, but again I'll find another way
[14:24:06] <elukey>	 thanks for the brainbounce
[14:46:36] <James_F>	 For wikfunctions services, both the orchestrator and evaluator prod images have wmf-certificates in them now.
[14:47:23] <James_F>	 Perhaps claime it's a local routing issue (do we have the right port?) or similar?
[14:52:10] <jayme>	 I'll take that ping for c.laime :-)
[14:52:46] <jayme>	 James_F: am I assuming correctly that the error message posted in phab is from the orchestrator calling the evaluator?
[14:52:54] <James_F>	 Yes.
[14:53:12] <James_F>	 Though apine is in shell right now and I'm elsewhere.
[14:53:57] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**)   - Remov...
[14:54:14] <jayme>	 Then it makes ofc. not difference that the evaluator now has the certs (but I suppose it will need them at some point anyways as IIRC it will make calls to mw-api)
[14:55:04] <taavi>	 sorry if this is a stupid question: why are those calls not routed via envoy? (in which case envoy would deal with TLS cert verification)
[14:55:27] <James_F>	 Those should always be from the orchestrator, the evaluator should never make outbound requests except to the orchestrator (and not for the initial launch).
[14:58:39] <jayme>	 taavi: not at all. With how it currently works that would require the evaluator to be set up in e.g. the service-catalog as a single entity which is not really required/desired as it is in fact tightly coupled with its orchestrator instance
[15:00:02] <jayme>	 James_F: ah, okay. I'm about to go into meetings for about 2h - so I might not be able to take a closer look today but please make sure your http client uses the proper ca-certs for validation
[15:03:32] <James_F>	 Ack.
[15:04:17] <James_F>	 Can one of you merge/deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/940245 (and https://gerrit.wikimedia.org/r/c/operations/puppet/+/940246 though that's not needed until tomorrow)?
[15:07:30] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr)
[15:30:20] <James_F>	 And of course we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/940152 deployed. ;-)
[15:49:31] <wikibugs>	 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) >>! In T340087#9027766, @akosiaris wrote: >>>! In T340087#9027714, @TheresNoTime wrote: >>>>! In T340087#9008668, @akosiaris wrote: >>> [....
[15:58:55] <wikibugs>	 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10akosiaris) OK, scheduling for tomorrow then, https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_July_25.
[17:00:18] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10akosiaris) 05Open→03Resolved a:03akosiaris This is apparently stopped happening [yesterday, the 23rd of July ~9:00am](https://grafana.wikimedia.org/...
[17:23:51] <jayme>	 James_F: / apine: any news on the certificate validation side?
[17:24:27] <James_F>	 jayme: No. :-( Perhaps we've got the wrong exit port/service domain?
[17:28:31] <jayme>	 I don't think so. The one in the error message is correct and it validates fine
[17:29:52] <apine>	 I see that the heartbeat probe is working fine. How are we hitting that endpoint?
[17:30:00] <apine>	 (heartbeat for the evaluator)
[17:30:47] <jayme>	 if you're talking about the healthcheck configured in k8s you're accessing your application directly, without TLS
[17:31:51] <jayme>	 I just double checked the certificate your service is serving and it is correct. I would still assume your http client does not have the proper ca loaded
[17:32:57] <James_F>	 Right.
[17:33:19] <James_F>	 And we can't just use HTTP?
[17:33:22] <jayme>	 AIUI it's nodejs and that seems to be "special" https://phabricator.wikimedia.org/T249633
[17:33:48] <James_F>	 Eurgh, lovely.
[17:34:05] <jayme>	 James_F: Traffic is not supposed to be unencrypted, no.
[17:34:32] <James_F>	 This isn't traffic, it's intentionally PII-stripped so we could run this in AWS/etc. in the future, but ah well.
[17:35:38] <akosiaris>	 what part of the traffic flow is failing again?
[17:35:51] <jayme>	 akosiaris: orchestrator -> evaluator
[17:36:06] <James_F>	 Which is effectively a localhost callback?
[17:36:14] <jayme>	 nope
[17:36:22] <James_F>	 Oh, sorry, "local" cluster.
[17:36:26] <James_F>	 Not host.
[17:36:33] <jayme>	 it might be a different machine somewhere indeed
[17:36:38] * James_F nods.
[17:36:42] <akosiaris>	 I was about to ask. This isn't via localhost cause no service-mesh? 
[17:36:57] <jayme>	 akosiaris: yeah, exactly
[17:37:11] <akosiaris>	 and no service mesh cause no service::catalog, right ? 
[17:37:34] <James_F>	 Yeah.
[17:37:40] <jayme>	 yes. And ultimately no real requirement for that exept for node being...node
[17:37:40] <akosiaris>	 maybe we should revisit that ? Overall apps shouldn't need to know how to speak TLS 
[17:37:44] <James_F>	 Should we just put it in the service catalogue?
[17:38:00] <akosiaris>	 that's what we got the service mesh for, so that we don't have to fight node
[17:38:09] <akosiaris>	 or python or whatever
[17:38:56] <jayme>	 yep...then we'll have to have a second ingress plus two service-catalog entries (one for orchestrator and one for evaluator)
[17:39:31] <akosiaris>	 sigh
[17:39:33] <jayme>	 effectively adding a hop (the ingress)
[17:39:55] <jayme>	 or find a way to support clusterip services as listeners in the mesh
[17:40:12] <jayme>	 (which would be the "right thing" I suppose)
[17:40:18] <akosiaris>	 yes, that's the clean solution
[17:40:36] <James_F>	 But hard to get done tonight.
[17:40:47] <akosiaris>	 btw, jayme: https://gerrit.wikimedia.org/r/#/c/940152/ updated with your comments
[17:40:57] <akosiaris>	 updated addressing your comments
[17:42:22] <akosiaris>	 jayme: it's late, let's reconvene tomorrow EU morning to figure out how to address this best
[17:42:41] <akosiaris>	 in the meantime, James_F: I guess the NODE_EXTRA_CA_CERTS trick should get you covered for the launch?
[17:42:44] <jayme>	 I'd say the quickest solution is to provide the proper certs to nodejs as done in the ticket for changeprop ... 
[17:42:50] <jayme>	 yes, that
[17:43:02] <James_F>	 akosiaris: I'm not sure how to do that, but I'll give it a go.
[17:43:45] <akosiaris>	 James_F: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/589570/ should be a good guide 
[17:44:14] <James_F>	 Ack.
[17:44:50] <James_F>	 akosiaris: Also if you could +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/940245 (or tell us it's not OK) that'd be great.
[17:46:15] <jayme>	 that is quite comprehensive. I you should be gtg by just adding something to the values.yaml like
[17:46:23] <jayme>	 config:
[17:46:24] <jayme>	   public:
[17:46:26] <jayme>	     NODE_EXTRA_CA_CERTS: /etc/ssl/certs/wmf-ca-certificates.crt
[17:46:38] <James_F>	 And add to the template?
[17:46:46] <jayme>	 nothing
[17:46:57] <jayme>	 that's basic functionality of the container templating
[17:46:58] <akosiaris>	 ah, the files exist already, right ?
[17:47:14] <James_F>	 Ah, neat.
[17:47:16] <akosiaris>	 yeah what jayme says will work in that case
[17:47:17] <jayme>	 yes. wmf-certificates package manages the file
[17:48:42] <James_F>	 Ack.
[17:48:52] <apine>	 To be clear, we'd need to do this just in the orchestrator, correct? 
[17:48:57] <akosiaris>	 James_F: reviewed, PCCed and merged
[17:49:03] <James_F>	 akosiaris: Awesome, thank you!
[17:49:06] <jayme>	 apine: yes
[17:49:13] <jayme>	 well, no
[17:49:22] <James_F>	 evaluator not orchesrtator
[17:49:25] <James_F>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940977
[17:49:28] <jayme>	 you said the evaluator will call back to the orchestrator=
[17:50:03] <James_F>	 Oh, so both?
[17:50:12] <James_F>	 Eval won't call orch yet, that's weeks/months away.
[17:50:12] <jayme>	 if your thing calls some wmf infrastructure, it will need to trust our CAs
[17:50:45] <jayme>	 then "not yet" is probably correct :)
[17:51:10] <apine>	 Ah, okay. The calling back will be via websocket, so maybe we don't need both?
[17:52:14] <James_F>	 It probably won't hurt to put it in both.
[17:53:21] <akosiaris>	 James_F: I 've also run PCC on https://gerrit.wikimedia.org/r/c/operations/puppet/+/940246 and I 'll +1 but I 'd rather deploy tomorrow
[17:53:30] <James_F>	 akosiaris: Totally.
[17:53:34] <jayme>	 I -1ed the change you send, see comment in gerrit
[17:53:43] <James_F>	 jayme: Thanks.
[17:54:24] * akosiaris off
[17:54:35] <James_F>	 Thanks again.
[18:04:46] <jayme>	 James_F: LGTM - with that I'm off as well for today. Please add anything that might come up (or that works now :)) to the task and we'll catch up tomorrow EU morning
[18:06:31] <James_F>	 <3
[18:07:37] <James_F>	 Bingo, the evaluator worked.
[18:08:34] <jayme>	 nice - o/
[18:12:13] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF)
[18:13:01] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF)
[20:52:53] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) 05Open→03In progress a:03...
[20:53:24] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) a:05Jdforrester-WMF→03cmass...
[20:58:35] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF) OK, so situation as I understand it right now at 2023-07-24Z20:55 is: * SRE to deploy ** [[https://gerrit.wiki...
[22:42:30] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye
[23:12:30] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Rem...
[23:35:46] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF)