[06:30:10] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:43:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:49:17] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:49:56] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) [06:50:22] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) [06:51:38] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:53:01] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Adding Jaime for the backup related hosts [07:28:29] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:35:22] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:45:24] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:04:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:21:49] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:22:49] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:23:39] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:37:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:38:23] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:40:12] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:44:33] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:15] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:53] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:16:13] (03PS6) 10Elukey: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [09:21:58] isaranto: o/ [09:22:01] morning folks [09:22:08] I am going to merge --^ if people are ok [09:22:17] so we can test the revscoring images in staging with changeprop etc.. [09:22:53] I am not 100% sure about outlink since testing locally with docker is not super trivial, error propagation may need some though [09:22:56] *thoughts [09:23:29] for example - if we have transformer -> predictor, and the predictor returns a 500 with message "XYZ", then it is the transformer that talks with the client [09:23:37] and afaics it returns a simple 500, without any msg [09:23:55] so the XYZ msg is only saved in the pod's logs, but the client doesn't see it [09:24:13] there is probably a way to propagate the error, but imho we can do it later [09:24:50] hey! o/ ok, green light by me [09:25:33] we can check how we deal with transformer errors later and have a practice we follow in general [09:25:50] would be interesting to see what the community does. I can ask in the MLOps community as well [09:26:52] yeah I am 100% sure there is a trick [09:27:13] but I think that the new fastapi stuff is way better, so probably it is easier to test when we are on 0.10 [09:27:30] (03CR) 10Elukey: [C: 03+2] Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 (owner: 10Elukey) [09:30:23] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [09:33:36] isaranto: https://docs.openvino.ai/2022.2/ovms_docs_rest_api_kfs.html seems also interesting [09:33:45] (03Merged) 10jenkins-bot: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 (owner: 10Elukey) [09:34:24] afaics that framework allows non-python model servers as well [09:35:02] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:37:01] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:38:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:38:07] I doubt that anybody would be using another language/framework to train a model [09:39:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:40:11] will take a look though 👀 . it is good to know what are the options in case we need it [09:48:45] going afk to check a coworking, be back in a few! [09:54:04] hope it is a good one! 🤞 [10:07:27] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) We can't migrate the puppetdb2002 VM (it's being moved to baremetal, but that is unlikely completed by then), so we'll need t... [10:45:16] (03PS1) 10Ilias Sarantopoulos: docs: add pre-commit info in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883856 (https://phabricator.wikimedia.org/T325198) [10:46:33] (03PS2) 10Ilias Sarantopoulos: docs: add pre-commit info in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883856 (https://phabricator.wikimedia.org/T325198) [10:50:37] 10Machine-Learning-Team, 10Patch-For-Review: Create a pre-commit hook for inference-services repo - https://phabricator.wikimedia.org/T325198 (10isarantopoulos) **Summary**: a set of pre-commit hooks have been added to the inference-services repository. The same hooks are run in CI through Jenkins in all the t... [11:30:20] the coworking looks good! Took more than expected, will have lunch with Filippo and then I'll be online again! [11:31:43] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883856 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [12:39:27] * isaranto afk lunch [12:58:40] (03CR) 10Elukey: [C: 03+1] docs: add pre-commit info in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883856 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [13:11:30] deployed the new images to staging's goodfaith namespace, the event module works now :) [13:11:34] going to test change prop [13:21:26] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] docs: add pre-commit info in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883856 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [13:22:36] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) New error: ` "Hostname/IP does not match certificate's altnames: Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikime... [13:25:58] so the previous changeprop error is gone, but now we have another one [13:31:40] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) Adding Jaime for the backup hosts. [13:36:21] that is basically https://github.com/nodejs/node/issues/37104 [13:36:26] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:36:47] mmm I am not 100% sure why this is not happening in api-gateway though, that IIRC is written in node as well [13:38:15] ah no of course the api-gateway is envoy based [13:38:16] ufff [13:38:56] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:40:32] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:43:11] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:44:42] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [14:08:52] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [15:02:35] the only compromise that I can think of is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883964 [15:07:45] aa ok, now I started to understand the problem... [15:08:00] was meaning to ask but was in the middle of something [15:08:19] it is very annoying, only nodejs does that [15:50:39] * elukey afk for a walk [16:14:10] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [16:39:06] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [16:45:06] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) [16:52:44] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10herron) [17:10:34] wowww change prop works! [17:10:46] I just got a trace in the kserve's goodfaith access log from it [17:11:16] the access log in kserve 0.10 is a little verbose and not very useful, we'll probably need to change it [17:11:27] but the new SANs work :) [17:13:09] 10Machine-Learning-Team, 10Patch-For-Review: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) Finally I managed to make the ChangeProp -> LiftWing connection work! I'll do more tests and then we shoul... [17:14:48] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Eevans) [17:17:42] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:23:17] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:23:41] wow , just saw it great work Luca! [17:24:28] I almost made httpbb work too, but there may be an issue with the POST's paramters [17:24:31] I'll check tomorrow [17:24:38] I managed to run a script on deployment server that queries all the model servers in staging and awaits a 200 response [17:24:44] going afk for the evening folks! Have a nice one [17:24:46] all good over there! [17:24:52] ah super [17:24:55] me2 cya tommorow folks! [18:16:44] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:17:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [20:40:29] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper)