[07:28:42] <elukey>	 good news folks, I am finally able to use the istio-proxy sidecar in one of the ORES models
[07:28:58] <elukey>	 all test setup, now I need to productionize everything, but something seems working
[07:29:37] <elukey>	 in the meantime, I am going to drain + reimage kubernetes2005 with the new recipe
[07:30:36] <_joe_>	 elukey: so we can start moving ores models away?
[07:41:48] <elukey>	 _joe_ we have already started loading models to lift wing, but then we almost finished the svc ips and now we need to re-init both clusters with a bigger pool (in the meantime, I was testing the istio sidecar)
[07:41:58] <elukey>	 once we do that, the MVP will be way closer
[07:45:44] <elukey>	 if you are wondering why the svc ips are almost finished, https://phabricator.wikimedia.org/T302701
[07:48:16] <_joe_>	 I'm not sure I understand
[07:48:31] <_joe_>	 why do you need a svc ip per revision *and* per pod
[07:48:40] <_joe_>	 a svc ip per revision, I understand
[07:49:11] <elukey>	 nono one for the pod, and many for revisions
[07:49:25] <elukey>	 maybe I have not explained it correctly
[07:49:40] <_joe_>	 "but just to support ORES models we'll have to allocate ~ 100 pods, that may translate into 300 svc IP allocations very easily."
[07:50:00] <_joe_>	 do you mean you'll need 100 *deployments*?
[07:50:09] <_joe_>	 do we have 100 ores models?!?
[07:50:21] <elukey>	 yeah :)
[07:50:52] <elukey>	 there is a project called Phoenix, in collaboration with Research, to re-create those models in a more flexible ways
[07:50:59] <elukey>	 but it is a little far in the future
[07:52:07] <elukey>	 every model is a kserve pod, that can have up to 2 knative past revisions (this is configurable, the default is unlimited)
[07:52:25] <elukey>	 and then if we want to use things like canary traffic split etc.. other revisions )
[07:52:28] <elukey>	 :)
[07:52:52] <_joe_>	 yeah no sorry, "pod" is the wrong term here
[07:53:10] <_joe_>	 I'd use "deployment" or "function" or "model"
[07:53:20] <_joe_>	 else one thinks it's about every single pod
[07:53:30] <_joe_>	 not deployments
[07:53:55] <elukey>	 in practical terms it is a pod, that corresponds to a specific kserve resource (InferenceService)
[07:54:17] <_joe_>	 it can be many pods doing that work
[07:54:22] <_joe_>	 I hope
[07:54:26] <elukey>	 yes true, I'll use deployment
[07:54:42] <elukey>	 (knative can autoscale based on rps etc..)
[07:54:49] <elukey>	 (without the need of a metric server)
[08:00:59] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) 05Open→03Resolved
[08:11:18] <elukey>	 the recipe for kubernetes2005 doesn't work, I see the swap partition
[08:23:20] <elukey>	 trying a new version
[08:37:06] <jayme>	 elukey: I think the new partman recipe we could also use for reimaging the masters, (https://phabricator.wikimedia.org/T299634) right?
[08:38:00] <elukey>	 jayme: I think so 
[10:23:40] <elukey>	 so the partman testing is not going well, for some reason without lvm the swap partition is always added, even if I add the usual config to avoid it
[10:23:52] <elukey>	 I tried various configs, very weird
[10:31:13] <jayme>	 hmm
[10:32:43] <jayme>	 unfortunately I've no idea about partman...but I heard k.ormat has mastered it :p
[10:32:55] <jayme>	 🚌
[10:35:00] <elukey>	 it seems that the recipe that creates the lvm volume works as intended, namely no swpa
[10:35:03] <elukey>	 *swap
[10:35:37] <elukey>	 jayme: is it ok to leave kubernetes2005 down for more hours (so I can keep debugging) or should I wrap up and just use the root lvm volume for the moment?
[10:46:00] <elukey>	 otherwise we just use the recipe with lvm
[10:46:05] <elukey>	 (the current one basically)
[11:01:42] <jayme>	 elukey: it's fine to keep it down for a while
[11:20:25] <elukey>	 ack so I'll keep working on it for a few hours then
[11:20:31] <elukey>	 (after the lunch break)
[11:33:20] <elukey>	 so after a chat with Filippo, we found the bug, digging into the install logs
[11:33:46] <elukey>	 the flat.cfg recipe, that I copied, has . . instead of . in  the first field of the expert recipe
[11:34:02] <elukey>	 that makes the rest completely useless
[11:34:09] <elukey>	 and partman uses its defaults, namely swap
[11:36:10] <elukey>	 kudos to Filippo for the intuition, my soul is really in pain
[13:02:21] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) p:05Triage→03Medium
[13:32:35] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10fgiunchedi) Thanks Janis for kickstarting the discussion. I more or less guessed the thresholds for critical/warning, def...
[13:32:46] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) > I don't know where the 96h come from (maybe that's the cfssl default if nothing is configured on the profile lev...
[13:40:52] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) >>! In T303932#7781820, @jbond wrote: >> I don't know where the 96h come from (maybe that's the cfssl default i...
[13:40:54] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle) @dancy @joe I'd like to run a thought by you. For much of our frontend handling in Resource...
[13:47:08] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) > 264h I don't see in hiera - what is that used for? ok so i told a white lie its actually [[ https://github.com/w...
[13:58:53] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle)
[14:02:15] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10JMeybohm) Ah, I see :-)  Would we be fine with icinga/alertmanager set to warn at 9 days and critical at 7?
[14:07:56] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review: CertAlmostExpired firing regularly for cert-manager certificates - https://phabricator.wikimedia.org/T303932 (10jbond) >>! In T303932#7781970, @JMeybohm wrote: > Would we be fine with icinga/alertmanager set to...
[14:13:25] <elukey>	 ok https://gerrit.wikimedia.org/r/c/operations/puppet/+/771355 and next should be the correct recipes for the kubernetes vms :D
[14:32:26] <elukey>	 I am testing the regular flat.cfg (with the fix) on kubernetes2005 to be sure it works fine, then I'll try the new noswap recipe
[14:32:31] <elukey>	 if nobody disagrees :)
[14:56:01] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10cmooney) FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows.
[15:06:01] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10SRE, 10SRE-Access-Requests: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) I verified that I can run docker commands now.  Thanks @Joe!
[15:11:54] <wikibugs>	 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created deployment users and tokens in `profile::kubernetes::infrastructure_users:` key in the private repo, as well as corresponding dummy valu...
[16:04:15] <elukey>	 jayme: kubernetes2005 is ready for a check before uncordon (if you have a moment)
[16:04:28] <jayme>	 elukey: looking
[16:04:30] <elukey>	 the new recipe is what we discussed, no lvm/swap/etc..
[16:07:20] <jayme>	 elukey: I'd have expected an efi partition - am I wrong?
[16:08:14] <jayme>	 looks like I am, never mind
[16:09:49] <jayme>	 elukey: LTGM :)
[16:10:10] <elukey>	 \o/
[16:10:37] <elukey>	 uncordoned
[16:11:22] <elukey>	 jayme: ok if I drain + reimage 2006 ?
[16:11:37] <jayme>	 elukey: sure!
[16:14:01] <elukey>	 then tomorrow I should  be able to do 2015 and 16, to complete the cluster
[16:14:20] <jayme>	 ❤️
[16:14:34] <elukey>	 then for eqiad we'll wait your manager :P :P :P
[16:15:38] <jayme>	 I'm pretty sure he has an IRC highligh for the word manager in this channel by now :-p
[16:18:33] <elukey>	 aaahahh
[16:32:16] <_joe_>	 ahahahaha
[16:32:37] <_joe_>	 jayme: he prefers "boss", just fyi
[16:33:01] <jayme>	 oh, good to know!
[16:33:11] <akosiaris>	 :P
[16:33:26] <akosiaris>	 but yeah, I 'll start with the next kube hosts tomorrow
[16:33:30] <akosiaris>	 new*
[16:33:41] <akosiaris>	 I 'll reach out with questions! 
[16:58:56] <elukey>	 ack!
[16:59:10] <elukey>	 jayme: 2006 ready to be uncordoned
[17:06:55] <elukey>	 (uncordoned)
[17:51:57] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle)
[17:52:07] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Technical-Debt: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs - https://phabricator.wikimedia.org/T302465 (10Krinkle)
[18:07:52] <bblack>	 hey I've got exactly one cp host (cp6011 in drmrs) with agent disabled for ~10h with the message:
[18:08:00] <bblack>	 {"disabled_message":"dangerous change ahead --joe"}
[18:08:20] <bblack>	 probably there was a timing issue with reimage work today or something, and the removal got missed while it was rebooting or something
[18:08:27] <bblack>	 but I have no context, so I don't want to just blindly remove it
[18:08:35] <bblack>	 any idea what that was?
[18:12:24] <elukey>	 bblack: I suspect https://gerrit.wikimedia.org/r/c/operations/puppet/+/770905/, timing seems to match
[18:13:01] <rzl>	 yeah was about to say
[18:13:27] <rzl>	 only mostly sure it was that patch specifically, but it was definitely work on T302471
[18:13:46] <rzl>	 should be okay to re-enable, your reboot theory sounds right
[18:14:41] <bblack>	 ok, thanks!
[19:47:42] <wikibugs>	 10serviceops, 10Wikimedia-Etherpad: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Zapipedia-WMF)
[20:13:27] <wikibugs>	 10serviceops, 10SRE, 10Wikimedia-Etherpad: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10RLazarus) From the time sliders it looks like the issue is that all or part of the pad gets deleted and replaced by a character, at these revisions respectively:  - https://etherpad.wikimedia.org/p/T...
[21:26:41] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jbond) Sorry for the slow response on this, there is already a function, wmflib::role_hosts, which dose alsmot what...