[07:29:01] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10elukey) >>! In T288198#9035228, @dancy wrote: > Another case of failing to push a large image: T342084. Is it possible to configure NGIN... [09:16:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [09:16:20] 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05In progress→03Resolved Now rdf-streaming-updater uses the read-only endpoint on k8s, meaning it... [09:26:07] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10akosiaris) >>! In T288198#9035228, @dancy wrote: > Another case of failing to push a large image: T342084. Is it possible to configure N... [09:44:23] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10akosiaris) >>! In T288198#9036874, @elukey wrote: >>>! In T288198#9035228, @dancy wrote: >> Another case of failing to push a large image... [09:55:36] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) We'll first make the move to 2% of traffic, then ramp up from there during the week. [10:02:07] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9035364, @cmassaro wro... [11:11:26] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) >>! In T341463#9014217, @Quiddity wrote: > Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_... [11:11:38] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:11:50] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved [12:12:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [12:15:31] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH) [12:15:58] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH) [12:21:25] 10serviceops, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH) [12:21:52] 10serviceops, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH) [12:31:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [12:36:43] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [12:39:53] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) jforrester opened https://gitlab... [12:41:19] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9037120, @JMeyb... [13:22:26] Does anyone know if we support/use partial clone on gitlab? re: https://docs.gitlab.com/ee/topics/git/partial_clone.html [13:28:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [13:28:41] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [13:30:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [13:30:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [13:44:56] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [13:44:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [13:48:52] hi folks [13:49:39] something interesting - I am trying to debug some 503s returned by the ores-legacy's tls proxy, and after setting logging to debug I noticed some connect timeouts before 503s (why don't they clearly indicate that in the response is not clear to me but..) [13:50:08] I checked the config_dump and for "inference" I see a connect_timeout set to 0.250s [13:50:28] "cluster": { [13:50:28] "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster", [13:50:31] "name": "inference", [13:50:33] "type": "STRICT_DNS", [13:50:36] "connect_timeout": "0.250s", [13:50:47] is that supposed to be so tight? Or is there a way to tune ot? [13:50:52] *it? [13:51:41] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [13:51:47] ah no I see it is hardcoded in the mesh.configuration._cluster bit, if I see it correctly [13:52:38] I'd need to possibly se it a little higher, anything against me adding a tunable for it? [13:53:59] hmm...we even have a 1s connect timeout in the admin interface cluster config [13:54:43] it may be a pebcak on my side [13:55:14] <_joe_> 0.25 s connect timeout doesn't seem like "tight" to me for internal services [13:55:54] Just to be sure: This is a config from the tls-proxy of something having the "inference" listener enabled, right? [13:55:59] 10serviceops, 10Data Products, 10RESTbase Sunsetting, 10Code-Health-Objective, 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10Jgiannelos) [13:56:04] <_joe_> it's unreasonable for a tcp connection to take 0.25 seconds to be established inside a single datacenter [13:56:27] <_joe_> if it happens commonly, we need to understand why [13:56:33] we need to mock some calls that ores makes, in which a user can request multiple rev-ids to be scored at once - if they are too many, ores-legacy will have to call lift wing multiple times [13:56:49] and the more requests the higher the possibility of a delay [13:56:58] this is why I'd like to have a less tight timeout [13:57:06] or at least configurable [13:59:56] jayme: yes sorry exactly [14:00:11] I checked the config via the envoy proxy's localhost port [14:00:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [14:00:14] via nsenter [14:00:20] ("config_dump") [14:02:54] elukey: what I don't get is why more requests means more dely in th *tcp connection* to inference [14:03:14] maybe I don't get which is which :D but lift wing == inference, right? [14:03:43] exactly yes, but sometimes we have ores-like requests from ores-legacy that can ask for 100 scores [14:03:46] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) apine merged https://gitlab.wiki... [14:04:06] that translates into a lot of calls to Lift Wing, since we have a different model than ORES [14:04:26] and the ores models on lift wing are not super fast, let's say :) [14:04:39] so my read is that connections pile up and they are delayed [14:06:04] we are going to verify if our lift wing code can accept multiple rev-ids in the same request, and possibly open a single request to the mw api for all of them [14:06:29] this could probably improve the issue, but there are a lot of weird calls that ores-legacy needs to mock [14:07:34] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [14:11:27] I would have expected that in the tls-proxy pools/re-uses connections to inference [14:11:49] (but it feels like I'm missing something obvious :)) [14:12:33] jayme: nono it should happen but there is an idle time in theory, connections are not kept alive for a long time [14:12:51] so I guess that the pool needs to be created when we fire 100-like requests [14:14:41] elukey: 0.25s of a connect timeout isn't tight as Giuseppe says. We do have 1 case where we witnessed it being reached https://phabricator.wikimedia.org/T292663 but it happened in the p99 of latency buckets and we worked around it with a retry [14:15:22] if you notice a lot of connect timeouts from service mesh, we need to debug it more I 'd say [14:16:58] btw, this pannel https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?forceLogin&from=1684424094488&orgId=1&to=1684516973613&var-datasource=eqiad%20prometheus%2Fops&var-destination=shellbox-syntaxhighlight&var-origin=parsoid&var-origin_instance=All&viewPanel=24 will tell you if you have issues like that [14:17:06] which application where you debugging? It should in theory be possible to spot connectivity issues on the https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s [14:17:22] and the k8s one that jayme points out ofc [14:17:29] we should at some point have those be a single 1 [14:17:42] it's happening ofc via mw-on-k8s :-) [14:19:49] I think it is a little tight, but I do see that there is not a lot of space for a debate so I'll find different ways [14:20:18] using the retry may help but it is masking an issue, imho [14:20:38] why do you feel it's tight though ? [14:21:36] because of the use case above, we need to mock a one-to-many (a lot) request via ores-legacy, and we know that the model servers are not that fast [14:22:06] it is a little tight for my specific use case, but I agree that it is a good default [14:22:10] to accept a TCP handshake? We aren't talking about serving the request, just doing the 3way handshake [14:22:17] it's just a connect timeout [14:22:43] this is before even a single line of HTTP protocol is sent. [14:22:53] yep I know :) [14:23:08] and the servers aren't fast enough for the 3 way handshake? [14:23:43] ok, I need some more info in this, I am struggling a bit [14:23:56] for 100 in parallel, or more, it may be delayed a bit, but again I'll find another way [14:24:06] thanks for the brainbounce [14:46:36] For wikfunctions services, both the orchestrator and evaluator prod images have wmf-certificates in them now. [14:47:23] Perhaps claime it's a local routing issue (do we have the right port?) or similar? [14:52:10] I'll take that ping for c.laime :-) [14:52:46] James_F: am I assuming correctly that the error message posted in phab is from the orchestrator calling the evaluator? [14:52:54] Yes. [14:53:12] Though apine is in shell right now and I'm elsewhere. [14:53:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [14:54:14] Then it makes ofc. not difference that the evaluator now has the certs (but I suppose it will need them at some point anyways as IIRC it will make calls to mw-api) [14:55:04] sorry if this is a stupid question: why are those calls not routed via envoy? (in which case envoy would deal with TLS cert verification) [14:55:27] Those should always be from the orchestrator, the evaluator should never make outbound requests except to the orchestrator (and not for the initial launch). [14:58:39] taavi: not at all. With how it currently works that would require the evaluator to be set up in e.g. the service-catalog as a single entity which is not really required/desired as it is in fact tightly coupled with its orchestrator instance [15:00:02] James_F: ah, okay. I'm about to go into meetings for about 2h - so I might not be able to take a closer look today but please make sure your http client uses the proper ca-certs for validation [15:03:32] Ack. [15:04:17] Can one of you merge/deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/940245 (and https://gerrit.wikimedia.org/r/c/operations/puppet/+/940246 though that's not needed until tomorrow)? [15:07:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [15:30:20] And of course we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/940152 deployed. ;-) [15:49:31] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) >>! In T340087#9027766, @akosiaris wrote: >>>! In T340087#9027714, @TheresNoTime wrote: >>>>! In T340087#9008668, @akosiaris wrote: >>> [.... [15:58:55] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10akosiaris) OK, scheduling for tomorrow then, https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_July_25. [17:00:18] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10akosiaris) 05Open→03Resolved a:03akosiaris This is apparently stopped happening [yesterday, the 23rd of July ~9:00am](https://grafana.wikimedia.org/... [17:23:51] James_F: / apine: any news on the certificate validation side? [17:24:27] jayme: No. :-( Perhaps we've got the wrong exit port/service domain? [17:28:31] I don't think so. The one in the error message is correct and it validates fine [17:29:52] I see that the heartbeat probe is working fine. How are we hitting that endpoint? [17:30:00] (heartbeat for the evaluator) [17:30:47] if you're talking about the healthcheck configured in k8s you're accessing your application directly, without TLS [17:31:51] I just double checked the certificate your service is serving and it is correct. I would still assume your http client does not have the proper ca loaded [17:32:57] Right. [17:33:19] And we can't just use HTTP? [17:33:22] AIUI it's nodejs and that seems to be "special" https://phabricator.wikimedia.org/T249633 [17:33:48] Eurgh, lovely. [17:34:05] James_F: Traffic is not supposed to be unencrypted, no. [17:34:32] This isn't traffic, it's intentionally PII-stripped so we could run this in AWS/etc. in the future, but ah well. [17:35:38] what part of the traffic flow is failing again? [17:35:51] akosiaris: orchestrator -> evaluator [17:36:06] Which is effectively a localhost callback? [17:36:14] nope [17:36:22] Oh, sorry, "local" cluster. [17:36:26] Not host. [17:36:33] it might be a different machine somewhere indeed [17:36:38] * James_F nods. [17:36:42] I was about to ask. This isn't via localhost cause no service-mesh? [17:36:57] akosiaris: yeah, exactly [17:37:11] and no service mesh cause no service::catalog, right ? [17:37:34] Yeah. [17:37:40] yes. And ultimately no real requirement for that exept for node being...node [17:37:40] maybe we should revisit that ? Overall apps shouldn't need to know how to speak TLS [17:37:44] Should we just put it in the service catalogue? [17:38:00] that's what we got the service mesh for, so that we don't have to fight node [17:38:09] or python or whatever [17:38:56] yep...then we'll have to have a second ingress plus two service-catalog entries (one for orchestrator and one for evaluator) [17:39:31] sigh [17:39:33] effectively adding a hop (the ingress) [17:39:55] or find a way to support clusterip services as listeners in the mesh [17:40:12] (which would be the "right thing" I suppose) [17:40:18] yes, that's the clean solution [17:40:36] But hard to get done tonight. [17:40:47] btw, jayme: https://gerrit.wikimedia.org/r/#/c/940152/ updated with your comments [17:40:57] updated addressing your comments [17:42:22] jayme: it's late, let's reconvene tomorrow EU morning to figure out how to address this best [17:42:41] in the meantime, James_F: I guess the NODE_EXTRA_CA_CERTS trick should get you covered for the launch? [17:42:44] I'd say the quickest solution is to provide the proper certs to nodejs as done in the ticket for changeprop ... [17:42:50] yes, that [17:43:02] akosiaris: I'm not sure how to do that, but I'll give it a go. [17:43:45] James_F: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/589570/ should be a good guide [17:44:14] Ack. [17:44:50] akosiaris: Also if you could +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/940245 (or tell us it's not OK) that'd be great. [17:46:15] that is quite comprehensive. I you should be gtg by just adding something to the values.yaml like [17:46:23] config: [17:46:24] public: [17:46:26] NODE_EXTRA_CA_CERTS: /etc/ssl/certs/wmf-ca-certificates.crt [17:46:38] And add to the template? [17:46:46] nothing [17:46:57] that's basic functionality of the container templating [17:46:58] ah, the files exist already, right ? [17:47:14] Ah, neat. [17:47:16] yeah what jayme says will work in that case [17:47:17] yes. wmf-certificates package manages the file [17:48:42] Ack. [17:48:52] To be clear, we'd need to do this just in the orchestrator, correct? [17:48:57] James_F: reviewed, PCCed and merged [17:49:03] akosiaris: Awesome, thank you! [17:49:06] apine: yes [17:49:13] well, no [17:49:22] evaluator not orchesrtator [17:49:25] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940977 [17:49:28] you said the evaluator will call back to the orchestrator= [17:50:03] Oh, so both? [17:50:12] Eval won't call orch yet, that's weeks/months away. [17:50:12] if your thing calls some wmf infrastructure, it will need to trust our CAs [17:50:45] then "not yet" is probably correct :) [17:51:10] Ah, okay. The calling back will be via websocket, so maybe we don't need both? [17:52:14] It probably won't hurt to put it in both. [17:53:21] James_F: I 've also run PCC on https://gerrit.wikimedia.org/r/c/operations/puppet/+/940246 and I 'll +1 but I 'd rather deploy tomorrow [17:53:30] akosiaris: Totally. [17:53:34] I -1ed the change you send, see comment in gerrit [17:53:43] jayme: Thanks. [17:54:24] * akosiaris off [17:54:35] Thanks again. [18:04:46] James_F: LGTM - with that I'm off as well for today. Please add anything that might come up (or that works now :)) to the task and we'll catch up tomorrow EU morning [18:06:31] <3 [18:07:37] Bingo, the evaluator worked. [18:08:34] nice - o/ [18:12:13] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [18:13:01] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF) [20:52:53] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) 05Open→03In progress a:03... [20:53:24] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) a:05Jdforrester-WMF→03cmass... [20:58:35] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF) OK, so situation as I understand it right now at 2023-07-24Z20:55 is: * SRE to deploy ** [[https://gerrit.wiki... [22:42:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [23:12:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [23:35:46] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF)