[07:28:42] good news folks, I am finally able to use the istio-proxy sidecar in one of the ORES models [07:28:58] all test setup, now I need to productionize everything, but something seems working [07:29:37] in the meantime, I am going to drain + reimage kubernetes2005 with the new recipe [07:30:36] <_joe_> elukey: so we can start moving ores models away? [07:41:48] _joe_ we have already started loading models to lift wing, but then we almost finished the svc ips and now we need to re-init both clusters with a bigger pool (in the meantime, I was testing the istio sidecar) [07:41:58] once we do that, the MVP will be way closer [07:45:44] if you are wondering why the svc ips are almost finished, https://phabricator.wikimedia.org/T302701 [07:48:16] <_joe_> I'm not sure I understand [07:48:31] <_joe_> why do you need a svc ip per revision *and* per pod [07:48:40] <_joe_> a svc ip per revision, I understand [07:49:11] nono one for the pod, and many for revisions [07:49:25] maybe I have not explained it correctly [07:49:40] <_joe_> "but just to support ORES models we'll have to allocate ~ 100 pods, that may translate into 300 svc IP allocations very easily." [07:50:00] <_joe_> do you mean you'll need 100 *deployments*? [07:50:09] <_joe_> do we have 100 ores models?!? [07:50:21] yeah :) [07:50:52] there is a project called Phoenix, in collaboration with Research, to re-create those models in a more flexible ways [07:50:59] but it is a little far in the future [07:52:07] every model is a kserve pod, that can have up to 2 knative past revisions (this is configurable, the default is unlimited) [07:52:25] and then if we want to use things like canary traffic split etc.. other revisions ) [07:52:28] :) [07:52:52] <_joe_> yeah no sorry, "pod" is the wrong term here [07:53:10] <_joe_> I'd use "deployment" or "function" or "model" [07:53:20] <_joe_> else one thinks it's about every single pod [07:53:30] <_joe_> not deployments [07:53:55] in practical terms it is a pod, that corresponds to a specific kserve resource (InferenceService) [07:54:17] <_joe_> it can be many pods doing that work [07:54:22] <_joe_> I hope [07:54:26] yes true, I'll use deployment [07:54:42] (knative can autoscale based on rps etc..) [07:54:49] (without the need of a metric server) [08:00:59] 10serviceops, 10Release-Engineering-Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) 05Open→03Resolved [08:11:18] the recipe for kubernetes2005 doesn't work, I see the swap partition [08:23:20] trying a new version [08:37:06] elukey: I think the new partman recipe we could also use for reimaging the masters, (https://phabricator.wikimedia.org/T299634) right? [08:38:00] jayme: I think so [10:23:40] so the partman testing is not going well, for some reason without lvm the swap partition is always added, even if I add the usual config to avoid it [10:23:52] I tried various configs, very weird [10:31:13] hmm [10:32:43] unfortunately I've no idea about partman...but I heard k.ormat has mastered it :p [10:32:55] 🚌 [10:35:00] it seems that the recipe that creates the lvm volume works as intended, namely no swpa [10:35:03] *swap [10:35:37] jayme: is it ok to leave kubernetes2005 down for more hours (so I can keep debugging) or should I wrap up and just use the root lvm volume for the moment? [10:46:00] otherwise we just use the recipe with lvm [10:46:05] (the current one basically) [11:01:42] elukey: it's fine to keep it down for a while [11:20:25] ack so I'll keep working on it for a few hours then [11:20:31] (after the lunch break) [11:33:20] so after a chat with Filippo, we found the bug, digging into the install logs [11:33:46] the flat.cfg recipe, that I copied, has . . instead of . in the first field of the expert recipe [11:34:02] that makes the rest completely useless [11:34:09] and partman uses its defaults, namely swap [11:36:10] kudos to Filippo for the intuition, my soul is really in pain [13:02:21] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) p:05Triage→03Medium [13:32:35] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10fgiunchedi) Thanks Janis for kickstarting the discussion. I more or less guessed the thresholds for critical/warning, def... [13:32:46] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) > I don't know where the 96h come from (maybe that's the cfssl default if nothing is configured on the profile lev... [13:40:52] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) >>! In T303932#7781820, @jbond wrote: >> I don't know where the 96h come from (maybe that's the cfssl default i... [13:40:54] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle) @dancy @joe I'd like to run a thought by you. For much of our frontend handling in Resource... [13:47:08] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) > 264h I don't see in hiera - what is that used for? ok so i told a white lie its actually [[ https://github.com/w... [13:58:53] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle) [14:02:15] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) Ah, I see :-) Would we be fine with icinga/alertmanager set to warn at 9 days and critical at 7? [14:07:56] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) >>! In T303932#7781970, @JMeybohm wrote: > Would we be fine with icinga/alertmanager set to... [14:13:25] ok https://gerrit.wikimedia.org/r/c/operations/puppet/+/771355 and next should be the correct recipes for the kubernetes vms :D [14:32:26] I am testing the regular flat.cfg (with the fix) on kubernetes2005 to be sure it works fine, then I'll try the new noswap recipe [14:32:31] if nobody disagrees :) [14:56:01] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10cmooney) FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows. [15:06:01] 10serviceops, 10Release-Engineering-Team, 10SRE, 10SRE-Access-Requests: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) I verified that I can run docker commands now. Thanks @Joe! [15:11:54] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created deployment users and tokens in `profile::kubernetes::infrastructure_users:` key in the private repo, as well as corresponding dummy valu... [16:04:15] jayme: kubernetes2005 is ready for a check before uncordon (if you have a moment) [16:04:28] elukey: looking [16:04:30] the new recipe is what we discussed, no lvm/swap/etc.. [16:07:20] elukey: I'd have expected an efi partition - am I wrong? [16:08:14] looks like I am, never mind [16:09:49] elukey: LTGM :) [16:10:10] \o/ [16:10:37] uncordoned [16:11:22] jayme: ok if I drain + reimage 2006 ? [16:11:37] elukey: sure! [16:14:01] then tomorrow I should be able to do 2015 and 16, to complete the cluster [16:14:20] ❤️ [16:14:34] then for eqiad we'll wait your manager :P :P :P [16:15:38] I'm pretty sure he has an IRC highligh for the word manager in this channel by now :-p [16:18:33] aaahahh [16:32:16] <_joe_> ahahahaha [16:32:37] <_joe_> jayme: he prefers "boss", just fyi [16:33:01] oh, good to know! [16:33:11] :P [16:33:26] but yeah, I 'll start with the next kube hosts tomorrow [16:33:30] new* [16:33:41] I 'll reach out with questions! [16:58:56] ack! [16:59:10] jayme: 2006 ready to be uncordoned [17:06:55] (uncordoned) [17:51:57] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle) [17:52:07] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle) [18:07:52] hey I've got exactly one cp host (cp6011 in drmrs) with agent disabled for ~10h with the message: [18:08:00] {"disabled_message":"dangerous change ahead --joe"} [18:08:20] probably there was a timing issue with reimage work today or something, and the removal got missed while it was rebooting or something [18:08:27] but I have no context, so I don't want to just blindly remove it [18:08:35] any idea what that was? [18:12:24] bblack: I suspect https://gerrit.wikimedia.org/r/c/operations/puppet/+/770905/, timing seems to match [18:13:01] yeah was about to say [18:13:27] only mostly sure it was that patch specifically, but it was definitely work on T302471 [18:13:46] should be okay to re-enable, your reboot theory sounds right [18:14:41] ok, thanks! [19:47:42] 10serviceops, 10Wikimedia-Etherpad: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Zapipedia-WMF) [20:13:27] 10serviceops, 10SRE, 10Wikimedia-Etherpad: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10RLazarus) From the time sliders it looks like the issue is that all or part of the pad gets deleted and replaced by a character, at these revisions respectively: - https://etherpad.wikimedia.org/p/T... [21:26:41] 10serviceops, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jbond) Sorry for the slow response on this, there is already a function, wmflib::role_hosts, which dose alsmot what...