[07:11:56] hello folks [07:12:00] created https://github.com/wikimedia/ores/pull/357 for the logging error [07:12:38] elukey: o/ [07:13:09] when you get a minute please help review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/766565/ it's been seated for a while. [07:13:13] thanks [07:16:09] hey kevinbazira o/ [07:16:49] so I think that we should pause for a bit loading models due to https://phabricator.wikimedia.org/T302701 [07:17:00] we have to re-init the clusters :( [07:17:07] I still have no idea how to do it [07:17:23] but if we keep loading models we'll likely saturate svc ips [07:18:58] Oh ok ... I understand. [07:19:07] please remember to give me a green light whenever we are ready to proceed. thanks [07:19:28] definitely yes [07:19:55] I think that we should try to figure out a rough estimate of #pods #svcs that we'll want in the future [07:20:06] so we'll be able to size the ip pool accordingly [07:20:13] but it is not an easy task [07:22:50] I hear you [07:32:03] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) @Ciell thanks a lot! I created http... [07:47:25] 10Machine-Learning-Team: revscoring feature extraction error for Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) [07:49:33] 10Machine-Learning-Team: revscoring feature extraction error for Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) [07:49:52] kevinbazira: I opened --^ now that we have more logging, I see a lot of feature extraction errors for wikidata [07:50:06] it seems all related to itemquality, not sure if it will affect lift wing or not [07:50:13] (are we going to load itemquality models?) [07:51:30] I am not sure I have seen itemquality models before [07:57:05] the closest reference I could find to item quality models is here: https://github.com/wikimedia/articlequality/blob/master/CHANGELOG.md#added-1 in the articlequality change log [08:00:06] there is one item quality model in the articlequality repo https://github.com/wikimedia/articlequality/blob/master/models/wikidatawiki.item_quality.gradient_boosting.model and Amir was the last person to work on it. [08:02:06] elukey do you think Amir or Aaron would be able to help with the errors in #T302851? [08:04:12] my current understanding was that all ORES models would be moved to LW unless the team agrees not to move specific ones. [08:07:23] yeah I added Aaron to the task, let's see if he answers [08:10:04] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites, 10Patch-For-Review: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10elukey) [08:10:18] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) 05In progress→03Resolved a:03el... [08:11:53] 10Machine-Learning-Team, 10ORES: revscoring feature extraction error for Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) [08:21:05] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create etcd cluster for ml-serve-staging k8s - https://phabricator.wikimedia.org/T302197 (10elukey) [08:21:27] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create etcd cluster for ml-serve-staging k8s - https://phabricator.wikimedia.org/T302197 (10elukey) p:05Triage→03Medium a:03klausman [08:22:07] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create ml-serve-staging k8s's control plane VMs - https://phabricator.wikimedia.org/T302198 (10elukey) p:05Triage→03Medium a:03klausman [08:22:26] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10elukey) p:05Triage→03Medium a:03klausman [08:22:48] klausman: o/ I have assigned https://phabricator.wikimedia.org/T302195 and subtasks to you, so we can keep track in there about progresses [08:31:17] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey) @achou I think that we can start with doing the same and using that trim function, IIUC we use revscoring extensively in our model.py code s... [08:35:21] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Add editquality isvc configurations to ml-services helmfile - https://phabricator.wikimedia.org/T301415 (10kevinbazira) This task has been paused because if we continue loading mode... [09:55:25] elukey: ack [09:55:59] elukey: continuing with VM-making as we speak. Once I have ctrl200[12], I'll set up etcd, then the k8s ctrl VMs and so on. [09:57:32] klausman: sure, if you create new tasks make sure to add them as subtasks of the ones that we created (so we have an easy way to find what was done in the future) [09:58:04] elukey: Aye. Also, is there a downside to creating the insetup part of the site.pp config before the VMs are actually there? [09:58:08] i.e. alerting etc [09:58:31] Otherwise I'll do the MACs+Partman config+site.pp in one change [09:59:01] in theory no, it shouldn't cause troubles IIRC [09:59:23] Alright, code review incoming in a 5-10m (once I have the second MAC) [10:03:32] elukey: oh, I just realized something. Do we care that the VMs are all still on Buster? [10:04:19] klausman: IIRC we discussed to use bullseye for them, and to adjust puppet if needed (there will probably be apt packages to copy to bullseye-wikimedia) [10:04:45] Hrm. [10:05:17] I presume there is no Puppetized upgrade path for Buster->Bullseye? I'll have to re-image them? [10:12:10] yep exactly, we are going to transition to Bullseye soon so we'll have to do it for staging sooner or later [10:12:25] Where is the distor for imaging even configured? [10:12:25] we can wait if you want but soon it will be needed [10:12:38] Nah, I'd rather fix it now [10:12:56] in the dhcp config, I didn't think about it.. bullseye is not the default yet, buster is [10:13:12] modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 ? [10:13:49] `option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";` looks like it's what I need [10:14:26] exactly yes [10:18:50] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman) [10:20:40] 767478 is ready for review [10:23:40] klausman: for etcd we may need to wait, there is no cluster on Bullseye yet.. [10:24:34] and I think it is a little more complicated, we can use staging later as test env to migrate in theory [10:24:48] I don't recall exactly how the etcd packages are created etc.. [10:25:08] but if you want to quickly check and see let's do it, we'll be the first to adjust puppet if needed [10:28:01] "Yay" :) [10:28:36] Maybe just the ctrl plane ones as Bullseye, leave etcd on buster for now, yeah. I'll adhjust the patch [10:30:31] (also saves me from having to do the reimage dance for the etcd VMs right now) [10:31:08] +1 [10:37:52] Ok, ready for your scathing critique [11:03:17] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) Thanks a lot Alex, I'll open a task to reinit both clusters :( (no idea how to do it, will document myself) The use case of kserve is that every... [11:04:08] added some thoughts about ip pools in --^ [11:04:25] once we decide what to do, we can think about re-init the clusters (sigh) [11:07:54] reading up on comments [11:13:15] Yeha, I think separating-out our IP ranges from the "normal" k8s ones might be a good idea. [11:15:14] I wonder to what extent IP pools need to be unique _outside_ of k8s [11:15:35] I know that madness lies that (IP duplication) way, but y'know, rocks and hard places. [11:19:57] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman) [11:20:55] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create ml-serve-staging k8s's control plane VMs - https://phabricator.wikimedia.org/T302198 (10klausman) [11:21:42] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create ml-serve-staging k8s's control plane VMs - https://phabricator.wikimedia.org/T302198 (10klausman) 05Open→03Resolved [11:21:44] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman) [11:23:16] * klausman lunch (and then groceries) [11:25:52] * elukey lunch [12:39:14] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10achou) @elukey yes, that is what I was asking. Thanks for pointing out trim() is available in model.py :) [12:45:26] (03PS1) 10AikoChou: Add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) [12:52:15] (03CR) 10jerkins-bot: [V: 04-1] Add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/767494 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [13:14:37] o/ I can't recall if i asked this before, but, what is the complete input to score a revision? [13:14:45] wikitext content? anything else? [13:27:16] o/ [13:27:32] yeah we fetch witext content from the mw api [13:27:42] (behind the scenes) [13:28:05] for some models just the revision is fine, for others we need the whole wikitext (IIUC) [13:28:46] so, revision_id, page_id, wikitext? [13:28:47] anything else? [13:28:58] user id user text? [13:29:01] timestamps? [13:29:58] not that I know [13:30:39] revscoring is a bit complicated and I have not a lot of experience with it though [13:49:59] I really like istio's sane defaults [13:50:01] "circuitBreakers": { [13:50:01] "thresholds": [ [13:50:01] { [13:50:01] "maxConnections": 4294967295, [13:50:03] "maxPendingRequests": 4294967295, [13:50:06] "maxRequests": 4294967295, [13:50:08] "maxRetries": 4294967295 [13:50:11] } [13:50:13] ] [13:50:16] }, [13:50:19] (this is the egress gw) [14:12:40] That number looks oddly familiar :D [14:26:35] I am not sure if it is maxint or similar [14:26:54] (maxint for istio I mean) [14:26:55] looks like 2^32-1 to me [14:27:03] so max (unsinged) int [14:27:49] $ python3 -c 'print(2**32-1)' [14:27:51] 4294967295 [14:28:19] accraze: o/ I had a what with aiko a moment ago and she'd need to find a way to test https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/767494/, either locally (maybe blubber + docker build + docker run with some horrible mounts?) or on the ml-sandbox [14:28:34] that is not easy I know but we can start writing the docs about it :D [14:28:47] klausman: makes sense yes [15:27:56] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10akosiaris) >>! In T302701#7746934, @elukey wrote: > Thanks a lot Alex, I'll open a task to reinit both clusters :( (no idea how to do it, will document m... [17:02:23] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10RonnieV) Hi all, @Ciell told me yesterday about the new model that is implemented. She suggested to implement links lik... [18:50:04] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson