[06:45:30] hello folks! [08:31:29] Hi elukey! [09:04:20] o/ [09:36:40] Morning everyone [09:45:59] Morning Tobias :) [09:46:30] \o [09:47:43] aiko: btw, does the UK swicth to DST as well? [09:49:02] morning [09:49:05] ! [09:56:48] klausman: yes the UK does change to DST, but I'm in the Netherlands, so I already changed time forward 1 hr [09:57:30] And probably again in two weeks? [09:58:57] klausman: yeah that's right 🙃 [10:00:32] Ah DST. One of my least favorite things about timekeeping [10:00:45] Timezones I get, there's a point there. But DST? Nope. [10:01:03] Even the leap second, as annoying as it is for SREs and sysadmins, has a point. [10:04:22] going to reboot the codfw cluster nodes for kernel upgrades [10:05:35] Including control plane? [10:11:48] klausman: only the bullseye nodes, https://phabricator.wikimedia.org/T303179 (there are the new ml-staging-ctrl node as well) [10:12:37] Ah, right [10:12:45] I can do the staging machines [10:13:57] super [10:17:48] ctrl is done, now on to nodes [10:21:34] (03CR) 10Elukey: [C: 03+1] draftquality: remove transformer code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768784 (https://phabricator.wikimedia.org/T294419) (owner: 10Accraze) [10:22:13] (03CR) 10Elukey: [C: 03+1] draftquality: add http error response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768807 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [10:23:15] (03CR) 10Elukey: [C: 03+1] articlequality: add http error handling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768812 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [10:37:24] All done. [10:37:56] elukey: if you want help with the remaining ml machines in codfw, lmk [10:39:32] klausman: thanks should be fine! I am doing some experiments with the eqiad cluster so I'll probably reboot it this afternoon [10:39:46] klausman: can you reboot the ml-cache nodes too? [10:40:18] Sì, capitano [10:57:42] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) I tried to do the following experiment on ml-serve-eqiad: * cordon all worker nodes except ml-serve1001 * copy (via scp) the istio-cni and istio-... [10:57:56] I added some info about my experiments with the cni pluging in --^ [10:58:08] I think that istio requires the install-cni daemonset sigh [10:58:20] at least if we want to keep mental sanity with istioctl [11:14:37] Cache hosts all done. Will give that ticket a read [11:42:14] ack thanks [11:42:21] codfw cluster restarted, going to lunch! [11:42:49] same :) [11:44:03] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768807 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [11:47:51] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768784 (https://phabricator.wikimedia.org/T294419) (owner: 10Accraze) [11:49:06] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768812 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [11:55:48] (03Merged) 10jenkins-bot: draftquality: remove transformer code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768784 (https://phabricator.wikimedia.org/T294419) (owner: 10Accraze) [11:55:50] (03Merged) 10jenkins-bot: draftquality: add http error response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768807 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [13:44:47] eqiad cluster rebooted [13:49:04] I had a chat with Joseph earlier on and the Cassandra idea seems to be something doable [13:49:19] we'll need to figure out the data model (the most efficient one) but it is worth a test [14:21:50] Morning all! [14:22:03] Sounds like a good test [14:29:06] morning! [14:29:21] I summarized the use case in the task and asked to Eric [14:29:35] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) @Eevans Hi! I'd need some help with the initial config of the Cassandra ML cluster if you have time :) We have three nodes in eqiad and three in codfw (2x2TB SSDs,... [14:58:23] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10JMeybohm) I'm still not completely sold, sorry. :) AIUI `istio_cni` is not strictly required for the service mesh to work (because there is this other way... [15:08:32] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) >>! In T297612#7760212, @JMeybohm wrote: > I'm still not completely sold, sorry. :) No need, if we can find a solution that doesn't require the i... [15:10:12] Morning, Chris [15:10:25] elukey: you say eqiad. What about the nodes in codfw? Are they done? [15:11:01] (if not, I can do them later today) [15:11:26] klausman: it was done before lunch (see msg below) [15:11:37] Roger [15:21:22] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10akosiaris) >>! In T302701#7756193, @ayounsi wrote: > @elukey detailed me the situation over IRC, thanks! > > @akosiaris those reserved prefixes make sen... [15:29:28] 10Machine-Learning-Team, 10serviceops: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10elukey) [15:29:39] interesting weirdness --^ [15:45:15] (03PS2) 10AikoChou: editquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) [15:46:23] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @Cmjohnson These are using sfp-t adapter and are only 1g name rack Unit Port CableID ms-be1068 e2 25u 25 2013339101799 ms-be1069 e2 25u 25... [15:47:00] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:47:15] (03CR) 10jerkins-bot: [V: 04-1] editquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [15:47:55] (03PS3) 10AikoChou: editquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) [15:49:36] (03CR) 10jerkins-bot: [V: 04-1] editquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [16:08:02] (03PS2) 10Accraze: articlequality: add http error handling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768812 (https://phabricator.wikimedia.org/T300270) [16:08:41] klausman: IIRC you mentioned that the ip subnets were discussed the other week during the team meeting when I was not there, but we never synced about what it was discussed :) [16:08:57] we can get a /18 from https://netbox.wikimedia.org/ipam/prefixes/378/ [16:09:10] Oh, we didn't straight up discuss it, I only gave an overview of our concerns at that time [16:09:24] ah ok [16:09:24] Neato. /18 should last us a while [16:09:48] so /21 for pods and /20 for svc as proposed in the task? [16:09:59] or more? [16:12:43] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites, 10Patch-For-Review: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10Halfak) 05Open→03Resolved a:03Halfak Looks like this is resolved. [16:13:57] Let me do a little back-of-the-envelope thinking [16:14:29] it is ~2k ips for pods and ~4k for svcs [16:14:44] if we reach 2k pods on the cluster I think that we may be in trouble :D [16:14:52] I think you're right [16:15:38] So that's two /20s and two /21s to cover eqiad and codfw. What do we do with staging? [16:16:14] ah nono that one is only for eqiad [16:16:39] But I'd like to keep codfw as similar as possible [16:16:42] for staging we can maybe use smaller pools (from the original k8s pool) [16:16:56] yeah there is another /18 for codfw [16:17:14] Yeah, using the original pool for staging sounds like a good ide. we'll never need the same amount of service capacity there [16:17:53] exactly, I think we can safely deploy less pods in staging, one or two for each kind [16:19:06] Then I think going with one /18 each for codfw and eqiad servig, and using the normal pool for codfw staging is the right approach [16:20:57] 10Machine-Learning-Team, 10artificial-intelligence, 10Growth community maintenance, 10editquality-modeling, 10Hindi-Sites: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10Halfak) [16:20:57] ack thanks, I'll try to reserve two /18 and get back to the task [16:21:23] 10Machine-Learning-Team, 10artificial-intelligence, 10Edit-Review-Improvements-RC-Page, 10Growth community maintenance, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10Halfak) [16:26:33] (03CR) 10Klausman: [C: 03+1] articlequality: add http error handling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/768812 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [16:30:27] * elukey bbiab [21:25:26] (03PS1) 10Accraze: topic: revert to non-transformer architecture [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/769121 (https://phabricator.wikimedia.org/T294419) [21:40:26] (03PS1) 10Accraze: topic: add http error handling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/769122 (https://phabricator.wikimedia.org/T300270)