[07:09:59] good morning :) [07:14:07] just deployed et/fa-wiki models [08:06:30] kevinbazira: o/ if you want to deploy go ahead any time :) [08:07:47] elukey: o/ thanks for the merge, let me deploy right now. [08:12:49] both eqiad and codfw deployments have been completed successfully. [08:12:55] checking pods now ... [08:13:53] ack! [08:15:05] all 8 new pods are up and running: [08:15:07] NAME READY STATUS RESTARTS AGE [08:15:07] etwiki-damaging-predictor-default-vn2q6-deployment-7df5f84x5cwj 2/2 Running 0 62m [08:15:07] etwiki-goodfaith-predictor-default-7lnqv-deployment-766475chql6 2/2 Running 0 62m [08:15:07] fawiki-damaging-predictor-default-xqc62-deployment-69f7696rwjbr 2/2 Running 0 62m [08:15:07] fawiki-goodfaith-predictor-default-plls6-deployment-c967dcmzwv7 2/2 Running 0 62m [08:15:08] fiwiki-damaging-predictor-default-rdkrs-deployment-5fdb845bxscj 2/2 Running 0 2m17s [08:15:12] fiwiki-goodfaith-predictor-default-2g999-deployment-c4c4992cv6x 2/2 Running 0 2m16s [08:15:14] frwiki-damaging-predictor-default-qnmjx-deployment-5745654kpmx9 2/2 Running 0 2m14s [08:15:16] frwiki-goodfaith-predictor-default-556wn-deployment-79bb6bwx9zn 2/2 Running 0 2m13s [08:23:12] nice :) [08:23:19] kevinbazira: I created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#revscoring_inference_services to document the new format [08:35:18] thank you for updating the docs elukey. [08:35:19] the clarification on overriding the WIKI_HOST variable is key. 👌 [08:36:05] thanks for the review! With the new location it should be easier to keep it updated as we go [08:36:13] (and as we improve charts etc..) [08:38:06] yep [08:55:02] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10elukey) [09:13:19] good morning! [09:14:31] Morning Aiko! [09:16:11] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create etcd cluster for ml-serve-staging k8s - https://phabricator.wikimedia.org/T302197 (10elukey) [09:18:04] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create ml-serve-staging k8s's control plane VMs - https://phabricator.wikimedia.org/T302198 (10elukey) [09:38:08] updated the technology onboarding checklists for ml team https://office.wikimedia.org/wiki/Technology/Onboarding/Checklists/Template#Machine_Learning [09:38:43] and linked it to our onboarding guide :) [09:43:57] aiko: nice! [09:46:32] going afk for some mins, bbiab! [09:49:43] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) This is showing up periodically (at deploy time?) at temporary spikes of indexing errors (also triggering alerts) ` {"type"=>"mapper_parsing_exception",... [09:50:46] 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) +SRE for visibility [10:55:13] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10elukey) ml-staging200x nodes reimaged with bullseye! [11:08:28] 10Machine-Learning-Team, 10Observability-Logging, 10SRE, 10Patch-For-Review: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) I've bandaided the issue for now, though we should go back to a short `for` clause once the root cause is fixed [11:37:50] * elukey lunch! [15:40:37] I am upgrading os + docker storage on ml-serve200[1-4] [15:40:44] it will likely take two days [15:40:58] I'll also try to do ml-serve100[1-4] [15:49:46] 10Lift-Wing: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey) One important point to investigate - IIUC (needs to be verified) Feast's config assumes a single Redis endpoint (host + port combination), so we'll have to figure out how to use our three per-DC hosts (ml-cache)... [15:53:18] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) [16:40:04] elukey: I guess the multi-file delete change supersedes the 2003 one? No point in editing a file and then deleting it :) [16:41:58] just answered, yes yes [16:43:07] in theory it should be possible to do it [16:43:14] SGTM [16:43:18] the per-dc config I mean, checking via pcc [16:43:21] yeah no op [17:44:11] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) One simple architecture could be something like the following: ==== online feature store ==== Since we'll likely use Feast, we'll need to run a python client in A... [17:50:11] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) How the clients are going to reach the various caches is something to discuss. I added a similar note in T294434#7725954 Feast (IIUC) allows to specify a single ho... [17:50:34] added some ideas about how to set up the Redis nodes in --^ [17:50:46] we have the codfw cluster ready, the eqiad one needs to be racked [17:50:54] err sorry, configured (it is already racked) [17:52:38] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) As reference for the score cache: ORES currently uses a master/replica set up for each datacenter. There is a Redis instance acting as master, and one replicating d... [18:16:49] ml-serve-codfw on bullseye + overlay! (8 worker nodes in total) [18:21:37] * elukey afk!