[06:01:28] Guten Tag o/ [07:08:57] (03CR) 10Kevin Bazira: "Thank you for working on this Aiko. I tried running the patch locally and got this error: https://phabricator.wikimedia.org/P65511" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [07:18:21] (03CR) 10Kevin Bazira: "After digging through the phab task (T356102#9833865), I noticed that this is intended to work with the RRML model. I tested it and it was" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [08:17:44] FIRING: LiftWingServiceErrorRate: ... [08:17:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-language-agnostic-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:22:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:28:58] hello folks! [08:29:06] the errors seem to be related to mwapi.errors.ConnectionError: Cannot connect to host incubator.wikimedia.org:443 ssl:default [None] [08:29:37] we don't have it in our istio list, but I am wondering if those requests are spurious or not [08:30:17] and I also see "INFO:root:Unsupported lang: enwiki." [08:30:40] so maybe it is the wrong wiki id passed as parameter that forces the model to use incubator? [08:32:43] single ip from gcloud, I think it is not something to worry too much about.. ideally we should follow up to see if we can return 400s to the client [09:03:21] Incubator? Are we maybe following a redirect there? [09:08:31] I am wondering if it uses incubator for wiki-ids that it doesn't know [09:08:41] enwiki was probably passed as lang [09:08:44] instead of en [09:08:56] can you send the gcloud IP as a DM so I can dig it out of logstash? [09:09:20] I found the IP in the logstash Istio gateway dashboard [09:09:30] ah, ack [09:09:36] I filtered for ml-serve-codfw, then http 500 [09:09:41] you'll find it quickly [09:09:47] it pops up in the first lines [09:11:06] Got it, starts with 34? [09:12:23] elukey: yeah I think if users use enwiki as lang then it will redirect to incubator [09:12:26] But agreed, if the root cause is requests having a wiki suffix for the lang field, we should fix that [09:13:40] do u mean manipulating the lang field? [09:14:12] As far as I can tell, we currently (try to) talk to incubator, which fails and then we return a 500 [09:14:28] Instead we should not make any backend calls, and return a 4xx series response [09:16:00] https://phabricator.wikimedia.org/P65519 [09:16:22] First request works normally, but adding `wiki` suffix makes it a 500. [09:16:58] This was staging, but I don't think prod would differ substantially [09:17:55] ack [09:18:35] We could even add a useful error message that people should not have `wiki` suffixes on the lang field. [09:18:38] I wanted to bring up something similar to this. that perhaps instead of language we should accept wiki_db to be more aligned with mediawiki [09:19:15] wouldn't that break existing users that use `lang`? [09:20:40] we could add "wikidb" or "wiki_db" but that is a discussion after we solve this. [09:21:19] for now I think that just checking that "lang" is one of the available languages would do. And I think we have some validation . lemme check [09:22:00] I agree we should raise an error when we know it is an unsupported lang, and we don't make a call to mwapi >> we used to do that, but for revertrisk-language-agnostic, for new and unknown wiki, we said we could use default values... so we removed the error response (see https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/176666cfbaa50ba744797085356fcba6ece4ba84%5E%21/#F0, [09:23:57] Ah, so knowledge integrity is the root for icnubator, I guess. Well, if we were able to contact it, I don't think it would get us a useful response, would it? [09:24:51] Maybe the error should not be "unkown lang" as it used to be but rather "invalid lang", as I doubt we will ever have something like enwiki being a valid RHS for that field. [09:25:02] thanks aiko I had forgotten about that. So this is something that we missed back then [09:25:12] So a static check for "has wiki suffix" as a common mistake that people make [09:26:22] basically `if lang.endswith("wiki"): raise WhateverException("Language field should not have a `wiki` suffix, i.e. use `en`, not `enwiki` etc.") [09:26:33] +1 for just checking wiki suffix. We had remove the error as we want to be able to use revertrisk with new wikis that were not represent in the training data with some default feature values [09:27:12] shall I file a patch for the above? raise a 400 - invalidinput wherever we have a wiki suffix [09:27:56] yeah I agree we could solve the issue by checking wiki suffix for now, and discuss if we want to accept wiki_db as input [09:27:59] SGTM [09:28:00] looking at the model card there is no such lang with a wiki suffix https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_revert_risk [09:28:15] +1 for the suffix check, seems an easy one [09:28:36] I think the only "lang" that I know of that contains the string `wiki` is wikidata, and that's a prefix. [09:37:38] (03PS1) 10Ilias Sarantopoulos: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 [09:39:17] (03CR) 10CI reject: [V:04-1] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [09:40:12] (03PS2) 10Ilias Sarantopoulos: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 [09:41:10] (03PS3) 10Ilias Sarantopoulos: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 [09:41:31] it's ready [09:41:48] on it [09:42:23] (03CR) 10Klausman: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [09:44:14] aiko: is it ok if I merge this? [09:44:20] isaranto: we also need to add it to batch_model.py. rrla currently uses batch model [09:44:32] ack! [09:45:05] isaranto: is there a reason why you used a @staticmethod and then self. ? [09:45:09] (03PS4) 10Ilias Sarantopoulos: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 [09:45:44] aiko: done! [09:47:09] elukey: since it is a class method (even though static) it can be accessed via self. (but doesn't have the self argument). tbh this could just go somewhere else in a utils or sth but I just put it there since it is specific to the rr model [09:47:33] could be generic as well I guess (check lang param or similar) [09:47:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:48:26] (03CR) 10Klausman: [C:03+1] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [09:48:29] elukey: thanks for the help once again <3 [09:48:35] (03CR) 10AikoChou: [C:03+1] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [09:49:00] isaranto: okok, I think it may be a little confusing since other methods are not static, like check_supported_wikis, I'd avoid a static method in this case tbh, buuut feel free to go :) [09:51:26] the others cant be static because they need access to the object (self). [09:58:31] sure it was an example, I doubt we'll ever need to call the method via class call [09:58:41] agreedo! [09:58:47] anyway, it was just a curiosity, don't bother what I say :) [09:59:25] (03PS5) 10Ilias Sarantopoulos: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 [09:59:57] I removed the staticmethod and I'm ready to merge1 [10:00:10] :+1: [10:00:15] (03CR) 10Klausman: [C:03+1] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [10:00:32] (03CR) 10Elukey: [C:03+1] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [10:02:07] (03CR) 10Ilias Sarantopoulos: [C:03+2] revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [10:03:24] (03Merged) 10jenkins-bot: revertrisk: add check for wiki suffix in lang param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1050276 (owner: 10Ilias Sarantopoulos) [10:12:17] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1050286 [10:17:50] +1 [10:19:37] deploying to staging! [10:23:02] tested and works as expected, procceeding with prod [10:27:22] all deployed and works as expected [10:28:01] Excellent, thank you! [10:29:05] no, thank u! [10:30:41] ok, fine ;) [10:30:45] * klausman lunch [10:31:20] thank you Ilias for filing the patch and deploying!! [10:32:44] RESOLVED: LiftWingServiceErrorRate: ... [10:32:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-language-agnostic-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:37:21] phew [10:39:14] * aiko lunch 2 [11:46:50] isaranto: I missed a file on the httpbb change: [11:46:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050328 [11:48:07] ah! I should have checked it as well. I falsely assumed that subdirs would be included [11:49:24] Yeah, I had hoped for some globbing in the deploy code, but no [11:54:04] * isaranto lunch [12:35:24] 06Machine-Learning-Team, 06Research, 10Research-engineering: Deployment of model updates - https://phabricator.wikimedia.org/T366528#9930318 (10XiaoXiao-WMF) [12:49:08] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9930380 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=14aea618-f5d5-481c-ab19-4eb0daea0ad6) set by klausman@cumin2002 for 4... [13:24:29] isaranto: I have to re-bother you with the httpbb change because pcc was lying to me. [13:24:42] no prob! [13:47:18] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9930554 (10Jhancock.wm) Swapped A1 and A2 to see if error recurs/moves [13:53:47] klausman: I don't have +2 permissions on that repo so someone should merge it [16:06:40] isaranto: I can do that, I was just looking for a review <3 [16:06:52] Also, ml-server 2007 is back in prod with all its memory [16:07:42] nice! [16:08:38] I reviewed , everything seems fine. the mix of `_` and `-` in the names are confusing (but ok since the match the namespaces) [16:08:56] I meant confusing to read [16:11:40] yeah, agreed. but better consistently confusing than inconistently confusing :) [16:12:01] yes! [16:26:21] heading out now, se yall tomorrow! [16:29:05] heading out as well, ciao folks o/ [17:21:50] (03PS8) 10AikoChou: revertrisk: accept revision data as input and bypass MW API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) [17:21:56] (03CR) 10CI reject: [V:04-1] revertrisk: accept revision data as input and bypass MW API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [17:26:37] (03PS9) 10AikoChou: revertrisk: accept revision data as input and bypass MW API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) [17:36:37] (03CR) 10AikoChou: "Thank you for testing it, Kevin! I updated the patch. It should now work for both RRLA and RRML. Upon reconsideration, I think it's better" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [17:46:19] (03CR) 10AikoChou: "I'm thinking if we should add an env variable so we can enable/disable this feature in the deployment-chart, since it is intended for inte" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [20:42:11] night all! [22:09:57] 06Machine-Learning-Team, 10ORES, 07dark-mode: Damaging edits are not readable in contributions page in dark mode due to ORES highlight - https://phabricator.wikimedia.org/T368680 (10Jdlrobson) 03NEW [22:11:12] 06Machine-Learning-Team, 10ORES, 07dark-mode, 10Web-Team-Backlog (FY2024-25 Q1 Sprint 1): Damaging edits are not readable in contributions page in dark mode due to ORES highlight - https://phabricator.wikimedia.org/T368680#9932852 (10Jdlrobson) p:05Triage→03High @Ebrahim would you be interested in help...