[06:10:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) a:05kevinbazira→03None [07:55:45] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service locally with docker - https://phabricator.wikimedia.org/T323613 (10achou) The poor performance I reported in my last comment was actually due to Macbook with the M1 (Max) processor. I tested the model... [07:56:34] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10achou) [08:29:30] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10achou) I wrote a lua script https://phabricator.wikimedia.org/P42235 that can read a file with multiple inputs and generate different reque... [08:44:28] morning :) [08:53:38] o/ [08:59:47] (03CR) 10Ilias Sarantopoulos: [C: 03+2] asyncio: cast asyncio_aux_workers env var to int on read [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/863259 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [09:03:18] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) @isarantopoulos nice tests :) A while ago I added some "time" decorators to measure (more-or-less) the time taken by various functions using MP, and those give some cl... [09:05:09] o/ morning! [09:05:38] (03Merged) 10jenkins-bot: asyncio: cast asyncio_aux_workers env var to int on read [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/863259 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [09:07:07] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10elukey) @achou really nice tests :) One question - you mentioned "missing" responses, but we don't log the `response` in `get_current_revis... [09:08:48] https://phabricator.wikimedia.org/P42235 [09:09:19] ^^^ the script can be used for revscoring models as well [09:09:32] really nice yes :) [09:09:32] only need to modify line 21-22 and the input file (as revscoring models don't use the lang parameter) [09:10:48] this could be a nice thing to collect in inference-services, maybe under "tests" or similar [09:11:03] so we have it code reviewed and saved somewhere [09:11:54] elukey: yeah! [09:12:09] elukey: thanks for the feedback :) I'll add log to the response in get_current_revision [09:12:26] Morning! [09:14:50] aiko: <3 [09:15:03] morning! [09:20:33] Santhosh of the Language team has confirmed that their software works with the new AWS setup, in principle. Still more testing needed (load, real-world queries etc), but I am confident we've made the deadline [09:20:42] nice! [09:21:08] klausman: do we have plans to change that Docker image to something more conform to our standards? [09:21:24] (and also to add basic monitors/alarms for the service) [09:21:30] "plans" is a bit bold a term, but definitely "strong desire" [09:22:13] it shouldn't take much to fix it, but it is surely something needed, independently from who will manage this servive (that will be us or SRE so we'll have to do it anyway) [09:22:22] Ack. [09:22:47] It's very tempting to leave that effort to 2023-me :) [09:24:42] we will run that image in our prod environment (broader terms I know), that Docker image wouldn't pass a regular code review in here :) [09:24:50] it is not anybody's fault except Meta I know [09:25:08] but I can help in case to make it a little better [09:27:14] That would be appreciated. There are still open questions about some of the blobs we have in there, which I'll address at this week's syncup with Meta. [09:28:01] The big zip file with the actual model should be hosted somewhere we control, for example, and it should be auto-added/downloaded instead of requiring the user to do it. [09:29:17] And deploy.py is doing too many things, IMO. It's a bit like trying to be its own Terraform, which it is not. [09:29:59] sure sure but that is something that we can change anytime in 2023 with a slower pace, what I am worried about is monitoring and not running the sw as root :) [09:32:14] ack [09:44:39] how do you prefer to proceed with the Docker image? Send a code review and I help/test/review etc.. or do you prefer me doing it? [09:45:52] If you have the time to spare, that'd be great [09:48:18] 'll make sure all my changes are pushed to my fork of the GH repo. Not that I did any edits in the Dockerfile, but y'know [09:49:03] I've added the lambda code running on AWS as aws_lambda.py [10:01:23] ack ack [10:12:27] Also, We might want to move the whole misc/ subdir into a WMF-hosted repo separate from stopes. I don't think that the rest of that repo is really necessary for what we do. I'll ask in the Thu meeting, just to be sure. [10:18:15] aiko: o/ [10:18:36] I am trying to install the knowledge_integrity package on stat1004 via pip, but I get an error that the setup.py file is not there [10:18:41] did it happen to you as well? [10:24:45] ah ok I needed to upgrade pip [10:24:46] lol [10:25:38] When it's not DNS as the root cause, it's always PIP [10:32:23] elukey: o/ yeah I think I upgraded pip at first [10:35:38] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, and 2 others: Create transclusion markup for ORES model card classes. - https://phabricator.wikimedia.org/T324448 (10kevinbazira) [10:40:20] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, and 2 others: Create transclusion markup for ORES model card classes. - https://phabricator.wikimedia.org/T324448 (10kevinbazira) Created a transclusion template for the editquality model class and successfully tested it. I noticed... [11:16:11] aiko: if you have time, could you run pyspark for the RR model wiki: es rev-id: 142965340 to see if you get any missing response? [11:16:16] I have a theory that I have to prove [11:19:00] elukey: ah I didn't give you the notebook [11:19:46] elukey: https://drive.google.com/file/d/1wUZfzrLxuJ5Z-Wpk7gZjTnmOmZqpDd3B/view?usp=share_link [11:20:35] how do I import it in jupyter? Never done it :) [11:21:50] the current theory that I have is that some mw api appservers may have some consistency issues for $reason [11:21:51] you need to open a ssh tunnel `ssh -N stat1004.eqiad.wmnet -L 8880:127.0.0.1:8880` [11:22:13] and scp the notebook to stat1004 [11:22:19] okok [11:24:10] I can run it for you if you want :) [11:24:45] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10MunizaA) This is also related to T323023 but could it be that since we're sharing a client session between requests, the host header is not... [11:24:47] what are the consistency issues? [11:30:10] Muniza brought up a good point. I'm gonna verify it [11:31:26] ah yes yes it is a good point indeed! [11:31:59] my theory is that if there is a stale mw api appserver (that for some reason returns missing) we are likely to hit it with a lot of requests [11:32:33] and moreover when we create new aiohttp sessions for each request [11:33:03] (otherwise we'll keep hitting the same api appserver, if we re-create the session we have to pass to the load balancer in theory) [11:33:20] it is just a thought that could explain it, if Muniza's theory doesn't hold [11:33:58] let's test Muniza's theory first, seems very promising :) [11:36:05] ok!! :) [11:37:02] going afk for lunch! [11:44:09] ditto [13:02:51] (03Abandoned) 10Kevin Bazira: editquality: refactor setting of the HTTP host header into its own method [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/805388 (https://phabricator.wikimedia.org/T309623) (owner: 10Kevin Bazira) [13:29:47] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10achou) @MunizaA your theory is right!!!!!! The host header is not getting updated. It's stuck to the first host it uses. This time I got a... [13:33:35] ^ elukey: \o/ [13:42:31] now need to figure out why the host header is not getting updated [14:09:35] aiko: nice! [14:22:09] aiko: indeed if you check https://docs.aiohttp.org/en/stable/client_advanced.html#custom-request-headers it suggests to pass headers as parameter to ClientSession or to the get/post methods directly [14:30:33] Morning all! [14:30:48] morning! [14:46:49] \o [14:54:15] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10klausman) [14:54:52] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Add monitoring+alerting for AWS service - https://phabricator.wikimedia.org/T324467 (10klausman) [14:55:09] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Add monitoring+alerting for NLLB200 AWS service - https://phabricator.wikimedia.org/T324467 (10klausman) [14:55:20] 10Machine-Learning-Team, 10Discovery-Search: Create Model Card for Search MLR - https://phabricator.wikimedia.org/T323794 (10MPhamWMF) p:05Triage→03Low [14:56:20] 10Machine-Learning-Team, 10Wikimedia Enterprise: Write/polish documentation for NLLb200 on AWS - https://phabricator.wikimedia.org/T324468 (10klausman) [14:57:11] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10klausman) a:05klausman→03None [15:46:20] klausman: o/ there are alerts for the kube api, do you have time to check them? [15:46:34] In a moment. [15:47:00] thanks :) [15:47:28] I wonder if this was caused by me testing APIGW stuff :D [15:48:03] You fixed this in the past by restarting pods, right? [15:49:03] yes but let's see what the problem is now [15:49:30] the hammer approach is fine but if it is the same problem [15:50:21] 504s are up [15:51:06] Which is a gateway timeout o.O [15:57:53] elukey: have you ever seen error messages about iptiables in the kube-proxy logs? [15:59:13] klausman: sometimes there was some garbage IIRC, it depends what kind of error.. usually the kubelet and the proxy are very noisy [15:59:39] The proxy on 1001 is pretty quiet, but has: [15:59:48] Dec 05 15:51:42 ml-serve-ctrl1001 kube-proxy[2078293]: E1205 15:51:42.830105 2078293 proxier.go:1428] Failed to execute iptables-restore: exit status 2 (iptables-restore v1.8.7 (legacy): Couldn't load target `KUBE-MARK-DROP':No such file or directory [15:59:50] Dec 05 15:51:42 ml-serve-ctrl1001 kube-proxy[2078293]: Error occurred at line: 5637 [15:59:52] Dec 05 15:51:42 ml-serve-ctrl1001 kube-proxy[2078293]: Try `iptables-restore -h' or 'iptables-restore --help' for more information. [15:59:54] Dec 05 15:51:42 ml-serve-ctrl1001 kube-proxy[2078293]: ) [16:00:12] It doesn't happen often, so I suspect it's unrelated [16:00:24] yeah it seems a weird log [16:00:51] The apiserver is logging about once per sec, so not something I'd find worrying usually [16:01:55] And it doesn't log errors [16:06:04] (if you restart daemons remember to !log on #operations) [16:06:49] oops, sorry [16:08:19] The 504s are now back at zero [16:09:24] ok nice, not sure what happens here, this is a recurrent problem [16:09:24] And the latency alerts are recovering [16:10:31] It's annoying that neither prxoy nor apiserver log anything odd or useful, but restarting the apiserver "helps" [16:12:14] the do log something useful, some of the ops that they serve take huge amount of time [16:12:42] I don't recall exactly how the metrics are calculated, but the kube-api server also calls webhooks [16:13:02] so client -> kube-api -> webhook pod (for validation of CRDs etc..) [16:13:18] my bet is that there is a weird bug with knative's webhook [16:16:45] Sounds credible [16:16:56] Sounds credible [16:17:05] oops, wrong window for [16:21:42] ack, let's keep an eye on the alerts :) [16:21:56] Only one is still firing [16:22:07] how often are the alert conditions evaluated? [16:22:18] the one still firing is for staging though [16:25:46] Oh. WIsh that was more obvious [16:51:27] In other news, most of the LW models can be queried throught the API gateway, but I need some help with testing data. Will write up what I've found tomorrow and share on Slack or here [16:51:44] For today I'm done :) [16:52:14] nice! Have a good evening :) [16:56:56] elukey: o/ currently mwapi doesn't support passing headers to the get/post methods directly https://github.com/mediawiki-utilities/python-mwapi/blob/master/mwapi/async_session.py#L163 [16:57:04] I'll send a patch for it [16:57:37] have a nice evening! [17:00:26] aiko: ack nice, we'll ask for another release then after the patch is reviewed/merged :) [17:30:49] have a good rest of the day folks! [17:30:51] * elukey afk