[08:19:23] hello folks [08:26:48] good morning! [08:29:35] (03CR) 10Elukey: "I left one improvement for get_embeddings.sh, the rest looks really good, thanks a lot for this work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [08:31:17] (03CR) 10Elukey: [C: 03+1] revertrisk: upgrade to multilingual revertrisk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [08:37:32] (03CR) 10Elukey: outlink: fix mwapi session host headers (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868131 (https://phabricator.wikimedia.org/T325199) (owner: 10AikoChou) [08:56:32] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2229 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/ORES [08:58:08] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1012 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/ORES [08:58:12] ah snap this is probably reboots of rdb nodes [08:58:12] sigh [09:00:36] (ores works in this way - every worker node has a uwsgi daemon to serve HTTP traffic and a celery worker that picks up jobs from a Redis queue. The Redis servers are on rdb nodes that SRE is rebooting) [09:06:08] we had a little outage [09:06:09] https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1671094279394&to=1671094818210 [09:34:07] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [09:46:05] I am watching https://www.youtube.com/watch?v=FX6naJLaq2Y&ab_channel=CNCF%5BCloudNativeComputingFoundation%5D [09:46:18] it seems a nice intro to Kserve, I added it to the Kserve wikitech page [10:01:01] https://github.com/kserve/modelmesh-serving seems now more stable, and it is indicated for use case like ours with revscoring models [10:01:17] (instead of one isvc for each model) [10:01:39] could be interesting to test, it may reduce our footprint on the k8s cluster [10:04:42] seems nice! [10:10:32] could be an interesting thing to apply, but it also seems very complicated (at first sight) [10:10:43] anyway, need to run a little errand, ttl! [10:30:42] (03PS17) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [10:31:17] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [10:32:54] ouuf, finally figured out how to add multiple builders for blubber 😅 [10:45:16] (03PS2) 10AikoChou: outlink: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868131 (https://phabricator.wikimedia.org/T325199) [10:46:45] (03CR) 10AikoChou: outlink: fix mwapi session host headers (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868131 (https://phabricator.wikimedia.org/T325199) (owner: 10AikoChou) [10:58:10] (03CR) 10AikoChou: [C: 03+1] "Really nice work! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [11:04:06] \o [11:04:37] elukey: I think if we want to switch the CT prod instance to use the new image, we need to do it today. Friday seems a bit too risky for that. [11:29:30] (03PS5) 10AikoChou: revertrisk: upgrade to multilingual revertrisk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) [11:59:45] klausman: sure! After lunch? [11:59:54] sounds good [12:00:05] ack [12:00:22] There's two different ways we can do it, a quick and simple one or a thorough one, but we can discuss the specifics later [12:32:16] (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for digging into this, Ilias. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [12:39:18] (03CR) 10Kevin Bazira: "LGTM, besides the comment I left about the buster image version." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [13:05:34] klausman: ack anytime lemme know :) [13:06:02] gimme 5m to make a pot of tee and I'll be there [13:06:14] elukey: lemme know if my patch is ok to merge so I can try it out [13:06:31] 🙏 thanks [13:06:49] (03CR) 10Elukey: [C: 03+1] "You rock, thanks for the patience!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [13:08:37] * isaranto going for lunch! [13:15:29] (03CR) 10Ilias Sarantopoulos: [C: 03+2] blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [13:21:33] (03Merged) 10jenkins-bot: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:12:43] isaranto: green light to deploy the new image :) [14:13:03] once done I'll clean up the old ones from the registry (there is a special ctl command to use on build2001 IIRC) [14:13:14] cool lets hope it works! [14:13:23] klausman: will add the commands to a task for a sanity check review --^ [14:13:48] ack! [14:13:51] I'll also create a patch to remove their declarations from inference services [14:14:53] super [14:15:01] and also integration_config [14:17:44] 10Machine-Learning-Team: Enrich revertrisk image tag with model's package version - https://phabricator.wikimedia.org/T325295 (10isarantopoulos) [14:19:56] 10Machine-Learning-Team: Enrich revertrisk image tag with model's package version - https://phabricator.wikimedia.org/T325295 (10isarantopoulos) [15:12:33] * elukey errand for a bit [15:53:27] Successfully deployed the new revscoring image to production 🎉 [15:54:51] I didn't deploy it to all staging ones cause it would also increase the resources of the pods (which we don't need at the moment) [15:55:56] (03PS1) 10Ilias Sarantopoulos: revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) [15:56:01] (03PS6) 10AikoChou: revertrisk: upgrade to multilingual revertrisk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) [15:57:04] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:57:22] (03PS2) 10Ilias Sarantopoulos: revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) [16:03:12] (03Merged) 10jenkins-bot: revertrisk: upgrade to multilingual revertrisk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [16:03:38] (03CR) 10CI reject: [V: 04-1] revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:06:47] (03CR) 10Ilias Sarantopoulos: "CI will fail until the patch on integration/config is merged https://gerrit.wikimedia.org/r/c/integration/config/+/868437" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:09:56] isaranto: niceeeeee [16:10:29] (03PS3) 10AikoChou: outlink: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868131 (https://phabricator.wikimedia.org/T325199) [16:11:12] (03PS4) 10AikoChou: Create a test folder and add lua scripts for wrk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/866360 (https://phabricator.wikimedia.org/T323613) [16:13:16] (03PS3) 10Ilias Sarantopoulos: revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) [16:14:34] (03CR) 10Elukey: [C: 03+1] revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:15:01] thanks Luca, I have good mentors in the team 😀 [16:15:52] Logging off folks, cu tomorrow [16:15:56] o/ [16:18:49] (03CR) 10CI reject: [V: 04-1] revscoring: delete individual revscoring images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [17:08:52] (03CR) 10Elukey: [C: 03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868433 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [17:11:00] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/866360 [17:11:30] --^ I don't know why it isn't merged, I +2'd [17:11:50] 10Machine-Learning-Team, 10Research-Backlog, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog: Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) [17:18:24] aiko: maybe CI is a little slow, checking [17:18:39] usually I check https://integration.wikimedia.org/zuul/ and look for "inference" [17:19:11] ah these are the new lua scripts [17:19:20] I think that CI is not configured to run for test [17:19:38] so you can probably self V:+2 and merge [17:19:41] aiko: --^ [17:19:55] elukey: ohhh [17:20:25] (03CR) 10AikoChou: [V: 03+2] Create a test folder and add lua scripts for wrk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/866360 (https://phabricator.wikimedia.org/T323613) (owner: 10AikoChou) [17:22:16] going afk for this evening folks [17:22:19] o/ [17:23:33] Have a nice evening Luca! o/ [18:33:09] 10Machine-Learning-Team, 10Research-Backlog, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog: Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CBogen)