[06:25:12] (03CR) 10Kevin Bazira: [C: 03+1] "Besides the merge conflict, everything else LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [07:55:54] goog morning fols :) [07:56:00] *folks [07:56:10] (can't even write correctly, good start of the week :D) [08:00:56] morning Luca! :D [08:01:16] helllooooooo aiko!!!!!!!! [08:01:24] welcome back! :) [08:27:54] https://github.com/litl/backoff/issues/69 - uffff [08:29:24] ok I am testing locally the code, now I get why I wasn't seeing any issue when testing in staging [08:29:34] I expected an exception that wasn't raised [08:29:37] * elukey cries in a corner [08:38:22] (03PS1) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [08:41:53] (03PS2) 10AikoChou: revertrisk: import correct module for multilingual model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/879868 (https://phabricator.wikimedia.org/T325295) [08:43:08] (03CR) 10AikoChou: [C: 03+2] revertrisk: import correct module for multilingual model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/879868 (https://phabricator.wikimedia.org/T325295) (owner: 10AikoChou) [08:48:15] (03Merged) 10jenkins-bot: revertrisk: import correct module for multilingual model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/879868 (https://phabricator.wikimedia.org/T325295) (owner: 10AikoChou) [09:04:06] ahhhhh I ended up in the same problem as https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/868131 [09:04:10] uffff [09:04:22] so now we share, in the revscoring models, the http session with aiohttp [09:04:31] between eventgate and api-ro [09:05:04] so the host-header is not changed, the api-ro one (en.wikipedia.org) ends up in the call to eventgate and istio-proxy blackholes it [09:10:09] it still bugs me why aiohttp reuses the host header in the same session though [09:14:05] I think that we should move away from a single client session though, it was probably the wrong idea [09:14:19] we already have connection pooling etc.. at the istio-sidecar level [09:14:37] and I am worried that other headers etc.. might be automatically reused [09:14:39] or shared [09:15:59] yeah from what I am reading we should have different sessions for every endpoint [09:21:47] (03PS2) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [09:21:52] like --^ [09:22:04] if we like the idea we'll probably need to apply it also to other model servers [09:27:25] (fixing the code of course @property is not good now) [09:28:45] (03PS3) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [09:28:52] ok this one works --^ [09:31:22] * isaranto commuting - afk for 30' [09:31:22] (going to add a minimum of pydoc) [09:32:22] (03PS4) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [09:37:46] (03CR) 10Elukey: events: fix handling of error responses from eventgate (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [09:37:57] (03CR) 10Klausman: [C: 03+1] events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [09:37:58] I left a comment for the reviewers to decide what's best [09:59:59] after the code change we should in theory have change prop finally work :D [10:00:40] (03CR) 10Klausman: Deployment script examples (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 (owner: 10Ilias Sarantopoulos) [10:03:53] (03CR) 10Klausman: [C: 03+1] events: fix handling of error responses from eventgate (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:04:35] commented re status Excpetions (not 100% sure I understand the question/decision quite right, so lmk if I'm off the mark) [10:04:40] Back! Will ch ck the reviews in a bit [10:07:21] (03CR) 10Elukey: events: fix handling of error responses from eventgate (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:16:44] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research, 10Patch-For-Review: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10dcausse) >>! In T317768#8565463, @Isaac wrote: > @dcau... [10:22:03] (03PS5) 10Ilias Sarantopoulos: test: liftwing manual testing on deployment server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) [10:49:39] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Left a minor comment (up2u) otherwise LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:51:48] 10Machine-Learning-Team, 10Infrastructure-Foundations, 10SRE-tools: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10elukey) [10:51:59] created --^ as well for httpbb [10:57:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: fix handling of error responses from eventgate (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:58:29] (03CR) 10Elukey: "Left a couple of comment, the config files look really nice and tidy, nice! My only concern is that we are replicating httpbb, and we may " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [10:58:36] elukey: I added a suggestion [10:59:35] isaranto: thanks! Going to add it, really nice [11:01:44] (03PS5) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [11:02:17] I am also looking at your comments. I totally agree. We should go with httpbb. I just put the python script in the repo as a reference. I will write this on the ticket as well so that it is clear [11:03:01] regarding the configs, ideally I would like to read the helm chart values so we don't have to maintain two sets of configuration [11:03:19] (03CR) 10Elukey: events: fix handling of error responses from eventgate (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:03:28] (03PS6) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [11:03:59] isaranto: agreed yes [11:07:08] (03PS7) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [11:07:17] ok decided to enable raise_for_status by default :) [11:08:13] (03CR) 10Elukey: events: fix handling of error responses from eventgate (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:08:20] (03PS8) 10Elukey: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) [11:08:30] perfect last version in [11:09:52] this is my doubt though [11:09:53] https://github.com/mediawiki-utilities/python-mwapi/blob/master/mwapi/async_session.py#L91 [11:10:23] if we add the aiohttp's raise_by_status=True in the client session then if mwapi returns a non 200 response we don't get to this point [11:11:01] that is probably fine in our case (model-server's view I mean) [11:11:03] aiko: --^ [11:11:07] thoughts? [11:14:52] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:16:00] it is ok, since not reaching mwapi is a blocker for the server by design. [11:16:18] I mean blocker for serving the model etc [11:20:37] ack let's see how it goes [11:20:47] (03CR) 10Elukey: [C: 03+2] events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:28:42] (03Merged) 10jenkins-bot: events: fix handling of error responses from eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884822 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:31:19] I am trying to debug httpbb, any way I can get the full error message or read httpbb's logs? [11:34:58] I created https://phabricator.wikimedia.org/T328280 with some details [11:35:19] I am not 100% sure if httpbb supports json payloads [11:35:25] maybe we are the first use case [11:38:38] yes I saw the ticket, I get the exact same thing. I will check what and how they support [11:49:16] isaranto: the code seems using only a dictionary for "data", but in theory it seems not enough for json [11:49:38] IIUC the "json" field takes care of serializing everything in a json string before issuing the post [11:56:45] (03PS1) 10Elukey: events: don't use json.dumps when issuing a post to Eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884859 (https://phabricator.wikimedia.org/T325528) [11:56:59] of course I've read the examples in the wrong way [11:57:31] https://github.com/aio-libs/aiohttp/issues/1726 seemed to indicate json.dumps but it was just a proposal, the real implementation doesn't use it [11:57:34] :( [11:58:11] * elukey afk for lunch! [11:58:56] regarding httpb I think you're right. it tries to make a request with just a dit instead of a json string [12:48:49] * klausman lunch [12:49:52] (03CR) 10Klausman: [C: 03+1] events: don't use json.dumps when issuing a post to Eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884859 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [13:04:26] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: don't use json.dumps when issuing a post to Eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884859 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [13:34:46] thanks for the review/patience folks :( [13:35:06] (03CR) 10Elukey: [C: 03+2] events: don't use json.dumps when issuing a post to Eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884859 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [13:36:51] (03Merged) 10jenkins-bot: events: don't use json.dumps when issuing a post to Eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884859 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [13:54:23] events are now working :) [13:54:27] lemme try with changeprop [13:55:56] wooorkssss \o/ [13:56:06] * elukey dances [13:56:12] nice work! [13:56:21] I'll send a code review in a bit to update all the revscoring images [13:56:22] :) [13:56:33] so I think that we demonstrated that the changeprop road works [13:58:21] greeeeat [13:58:46] I am preparing a patch for httpbb to support json payloads in POST [14:08:04] wow nice! [14:28:39] I am confused big on how to do it but I'll submit a patch and we'll see [14:28:47] there are 2-3 ways [14:37:54] I think that Reuven may have more context on this, not sure if we actually test POST requests to mediawiki though (the only post that I can think of is for jobrunners but they are not idempotent so probably not tested with httpbb) [14:58:51] Good morning all [15:06:37] Bonjour! [15:08:44] 'allo 'allo :) [15:19:37] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [16:04:01] 10Machine-Learning-Team, 10Infrastructure-Foundations, 10SRE-tools, 10Patch-For-Review: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10isarantopoulos) In the patch above I convert the dictionary passed in `form_body` field to json if there is the header `Content-Type... [16:06:44] morning! [16:15:51] (03PS1) 10Elukey: Avoid sharing the same aiohttp session in rr and outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884924 [16:16:59] (03CR) 10Elukey: "Still need to carefully test it but lemme know your thoughts :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884924 (owner: 10Elukey) [16:26:19] added a patch for supporting json payload in httpbb https://gerrit.wikimedia.org/r/c/operations/software/httpbb/+/884920 [16:26:24] more info on it here - https://phabricator.wikimedia.org/T328280#8570422 [16:28:18] I added rlazarus as a reviewer since he seems to be the core maintainer of the tool [16:31:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MPhamWMF) [16:31:19] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MPhamWMF) [16:32:38] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM ✔" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884924 (owner: 10Elukey) [16:44:26] klausman: o/ not sure if you saw https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883964, but I wanted to have your opinion on it [16:44:44] without it changeprop is not able to validate the TLS certs exposed by inference :( [16:44:50] looking [16:45:31] For some reason, the Gerrit main dash did not inform me of that [16:46:08] +1'd [16:46:37] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) Next steps: 1) Extend https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883964 to the inference prod endpoints. 2... [16:47:37] (03CR) 10Klausman: [C: 03+1] Avoid sharing the same aiohttp session in rr and outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884924 (owner: 10Elukey) [16:47:53] klausman: ack perfect thanks! I'll do it for production as well [16:48:09] (03PS6) 10Ilias Sarantopoulos: test: liftwing manual testing on deployment server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) [16:48:50] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski) [16:49:17] (03PS7) 10Ilias Sarantopoulos: test: liftwing manual testing on deployment server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) [16:49:27] (03CR) 10Ilias Sarantopoulos: test: liftwing manual testing on deployment server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [16:50:03] (03CR) 10Ilias Sarantopoulos: test: liftwing manual testing on deployment server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [16:51:19] 10Machine-Learning-Team: get a GPU on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Some ideas/constraints/etc..: 1) We could get a new GPU that is 1:1 similar (in height/width/etc..) to the one deployed on hadoop nodes, and with similar power requirements. DE is trying to move the Hadoop one... [16:52:58] 10Machine-Learning-Team, 10Patch-For-Review: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10isarantopoulos) As discussed within the team we want to proceed with httpbb which is a more standard tool for this purpose. The python script has been uploaded to inference s... [16:53:28] going afk folks, cu tomorrow 🤗 [16:53:32] \o [16:53:40] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:54:49] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:56:18] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10LSobanski) [17:11:45] heading out now, seeya tomorrow [17:25:28] going afk folks! Have a nice evening/day :) [17:55:23] 10Machine-Learning-Team: get a GPU on Lift Wing - https://phabricator.wikimedia.org/T327923 (10Isaac) Super excited by this given that Research has been exploring more advanced transformer models that strongly benefit from GPUs not just as training but at prediction time as well. Maybe very naive question but h... [21:08:23] 10artificial-intelligence, 10Code-Review-Workgroup: AI which suggests best reviewers for a patch ("Patch wrangler") - https://phabricator.wikimedia.org/T155851 (10Aklapper)