[04:42:36] 10Machine-Learning-Team, 10ORES, 10Edit-Review-Improvements-RC-Page, 10Growth-Team: Add ability to see good and bad edits to English Wikiquote - https://phabricator.wikimedia.org/T312592 (10Tgr) @calbon does ORES support Wikiquote? [06:11:19] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou) [06:19:14] (03Merged) 10jenkins-bot: outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou) [07:15:33] elukey: thanks! I tested the new version and it works \o/ [07:21:45] kevinbazira: thanks for merging the code! [07:22:06] np! :) [07:59:30] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Upload outlink topic model to storage - https://phabricator.wikimedia.org/T313887 (10achou) [08:03:16] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create outlink topic model inference service - https://phabricator.wikimedia.org/T313888 (10achou) [08:11:09] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Upload outlink topic model to storage - https://phabricator.wikimedia.org/T313887 (10achou) The outlink topic model has been uploaded successfully to Thanos Swift. ` aikochou@stat1007:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/articletop... [08:45:56] o/ [08:46:12] folks what is the relationship between article topic and outlink? [08:46:20] are those separate ? [08:49:27] they are separate [08:50:31] outlink is a model to predict article topic but not using revscoring [08:50:53] ah okok I was trying to understand the difference between s3://wmf-ml-models/articletopic/outlink/20220727080723/model.bin and what Kevin is working on [08:51:12] yep article topic uses revscoring [08:51:26] okok [08:51:36] maybe the path will be confusing [08:51:41] hmm [08:52:20] aiko what do you mean path? [08:53:04] the s3 path above [08:53:12] but I think that once we know the diff it should be ok [08:53:13] I upload the outlink model under s3://wmf-ml-models/articletopic/ because I was thinking it is also a model to predict article topic [08:53:52] oh ... it might be better to specify that it's outlink in the path [08:54:25] It is specified s3://wmf-ml-models/articletopic/outlink/20220727080723/model.bin [08:54:44] It should be ok [08:54:45] great. thanks [08:56:36] possibly will also need it's own namespace on the staging and prod machines. [08:57:18] yep we will need its own namespace [08:58:17] perfect [08:58:24] I was just curious to understand the difference [08:58:27] now it is more clear :) [09:00:08] :) [09:03:58] aiko: qq - with https://phabricator.wikimedia.org/T313888 you mean creating the k8s namespace etc..? [09:05:49] because I am wondering how many pods we'll have in there.. [09:06:23] there is only one outlink model [09:07:05] but the model is big [09:08:22] 863M [09:08:32] I am wondering if our policy of one k8s namespace for each model needs a review or not [09:10:43] I don't get it.. what exactly do you mean? [09:11:36] 10Machine-Learning-Team, 10SRE, 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10elukey) @Papaul host rebooted! It is not running any K8s pods at the moment so if any maintenance is needed, feel free to downtime and go ahead :) For the ML-Team - the node... [09:12:18] aiko: so we add isvc resources to k8s namespaces, that we have been calling 'revscoring-modeltype' up to now [09:12:37] for example, in revscoring-editquality-goodfaith we have all pods running goodfaith etc.. [09:12:39] elukey: is that then same host as last time? [09:12:50] klausman: ? [09:12:56] The memory issue [09:13:33] good morning to you as well, without context I am a little lost :) Papaul asked us to do it in https://phabricator.wikimedia.org/T313822 [09:13:39] Sorry :) [09:13:53] We had a memory issue like the one in that ticket a while back (months). [09:14:19] trying to find it rn [09:14:34] aiko: so for your new model, our policy would be to create something like a new namespace called `articletopic-outlink` or similar, with all settings/secrets/etc.. Now I am wondering if we want to have a generic namespace for single-models or not [09:15:40] .oO(Am I having a déjà-vu?!) [09:15:58] elukey: got it. [09:16:34] 2022-01-18 16:29:43 papaul klausman: hey on ml-serve2001 i have this The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B1. Reboot system to initiate self-heal process. [09:16:46] Same machine, different DIMM [09:17:01] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) [09:17:23] ah okok [09:17:26] elukey: maybe a generic naming is better [09:17:40] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Test async preprocess on kserve - https://phabricator.wikimedia.org/T309623 (10elukey) 05Open→03Resolved This has been worked on in various tasks, we decided to: 1) add async support for the mwapi package 2) move the outlink code (... [09:18:35] elukey: back then, we decided to just reboot and see if it happens again, then possibly try memtest86. We could keep the machine cordoned, disable alerts and do a memtest run. I doubt it's the dimm, but it might be the memory controller. [09:18:50] klausman: sure [09:19:03] I'll take care of it [09:19:36] klausman: if you have time today can you work with Aiko on https://phabricator.wikimedia.org/T313888 ? [09:19:46] Sure [09:20:09] in theory we have a new single model for outlink (non revscoring one) and if we follow our policy we should add the k8s puppet config for namespaces, secrets, etc.. [09:20:16] as we did for Kevin the last time [09:20:29] not sure if we want a generic namespace for these kind of models or not [09:20:35] maybe keeping consistency with the rest is better [09:25:56] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/817732 to move all docker images in prod to kserve 0.8 [09:26:03] I'll do it by steps of course :) [09:26:14] lemme know your thoughts when you have a moment [09:28:44] then shall we just keep consistent with the name articletopic-outlink? [09:29:32] sounds good to me [09:31:36] great, let's do it :) [09:32:39] 10Machine-Learning-Team, 10SRE, 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b087dff3-f32b-4842-9f10-401f09f59c0c) set by klausman@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their ser... [09:34:29] aiko: sounds good to me! The downside is that we create a lot of configs for a single model, but for the moment we can do it as we have been doing so far. If we'll have more and more single models in the future something may be reviewed/changed etc.. [09:45:33] elukey: that's true. We'll need to review our policy at some point in the future. For now it should be fine :) [09:53:22] elukey: have you ever run memtest86+ via the idrac console? [09:57:58] klausman: not that I recall [09:58:30] but the DCOps team has tools to do it, so in case we can ask Papaul to take care of it [09:58:43] will do [10:00:24] it is more a priority to unblock Aiko for outlink, so we can deploy it also to the API gateway etc.. [10:00:42] and to the end-to-end test if everything works as expected (fingers crossed0 [10:00:45] ) [10:00:49] ack [10:01:12] I had hoped I could get this going more quickly, so it can run for a while :-/ [10:01:41] 10Machine-Learning-Team, 10SRE, 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10klausman) Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direc... [10:12:59] elukey: so since these are not revscoring models, what is the deployment procedure? The wikitech article on that is a bit thin [10:13:14] (or I am looking at the wrong one) [10:13:54] klausman: it should be the same one that we used for revscoring-articletopic, but with a different namespace name [10:14:12] I don't think there is a procedure written down on wikitech [10:14:20] So it would still be a revscoring model in the sense of how deployment charts etc are set up? [10:15:09] the kserve-inference chart supports two templates, one for revscoring models and another one for other isvc models [10:15:15] there is nothing in the deployment charts repo that contains the word "atricletopic" [10:15:34] argh. git pull helped [10:16:35] there are multiple steps to follow, first puppet + puppet priavate (+ fake private) and finally deployment-charts [10:22:01] elukey: https://gerrit.wikimedia.org/r/c/labs/private/+/817744 <- does this look like what you meant? [10:22:36] I'd do a similar eddit on the pm private repo, and then send the main puppet PR [10:35:47] klausman: yeah, do you recall the three code reviews that you deployed for Kevin some days ago? Basically same thing, but with different naming [10:36:16] aiko: is "articleoutlink" (one word) ok as namespace name? [10:36:19] alright, [10:36:57] 11:28 then shall we just keep consistent with the name articletopic-outlink? [10:37:00] I think so :) [10:37:19] ah, the dash [10:37:30] I have no particular feelings either way [10:37:47] It just seemed very midly more consistent with articletopic [10:39:21] yep yep makes sense, let's see what Aiko and Kevin think [10:39:26] I am +1 on the name [10:39:31] going afk for lunch, ttl! [10:39:33] \o [10:48:33] +1 on "articletopic-outlink" and it is not a revscoring model, so we don't need to add revscoring- in the name [10:48:47] Sure, will adjust accordingly [10:58:46] Ok, private bits all done. Made puppet change for elukey to review, and now lunch :) [11:06:03] klausman: the name should have been "articletopic-outlink" not "article-outlink" [11:06:11] Oh [11:06:15] welp. [12:14:05] thanks for the merge klausman [12:14:19] I am going to run the deployment now ... [12:18:15] both eqiad and codfw prod deployments have been completed successfully. [12:18:15] checking pods now ... [12:20:46] all new pods on prod are up and running. [12:20:46] NAME READY STATUS RESTARTS AGE [12:20:46] euwiki-articletopic-predictor-default-pgxdr-deployment-7d8rh9bs 3/3 Running 0 2m7s [12:20:46] huwiki-articletopic-predictor-default-hq58s-deployment-8b4rmwfn 3/3 Running 0 2m5s [12:20:47] hywiki-articletopic-predictor-default-gkxc6-deployment-664bt75x 3/3 Running 0 2m4s [12:20:58] :+1: [12:30:31] Morning all! [12:35:33] \o [13:12:56] klausman: reviewed changes :) [13:13:13] one question, tho :) [13:13:45] helmfile.d/ml-services/article-outlink/values.yaml <- in change 817751, I commented out a section at the bottom of that file, because I was usnure about S3 and names [13:14:19] Will this model not having the revscoring- prefix influence how the expected names/paths are derived? [13:14:49] Also not sure about the whole predictor: subsection [13:15:13] that part will be taken care by Aiko when deploying the isvc in theory [13:15:56] there will be as always an "inference" sections [13:15:59] *section [13:16:08] but instead of revscoring_inference_services we'll use inference_services [13:16:19] the former uses a template and the latter another one [13:16:29] the latter is more flexible even if a little more verbose [13:16:32] Roger that. I'll leave the commented section in for reference? [13:16:39] yes yes it is fine [13:16:55] Ok, will merge now and then do the namespace dance [13:17:10] klausman: puppet first :) [13:17:21] sure :) [13:17:38] The puppet change is already merged :) [13:17:41] also remember to run puppet on the kubernetes control plan nodes of all clusters to pick up the new users [13:18:06] okok, but it needs to be rolled out on deploy1002 and all control plane nodes [13:18:13] ack [13:19:53] I am going to rollout the kserve 0.8 docker images in a few [13:20:02] will start from the small namespaces [13:21:01] klausman: just to coordinate - when you diff/sync admin_ng please use -l name=namespace to limit the changes, since the kserve chart is updated but deployed only on staging [13:21:52] aye [13:22:05] also just noticed I made a mistake in deployment_server.yaml [13:22:46] ah yes the naming [13:22:52] just noticed it as well :( [13:26:12] eh, it's a quick fix [13:31:46] https://phabricator.wikimedia.org/P31996 <- looks good? [13:32:24] yep [13:32:32] ok, syncing staging [13:33:14] # kubectl get namespaces -A [13:33:16] NAME STATUS AGE [13:33:18] articletopic-outlink Active 20s [13:33:20] (plus the rest) [13:33:30] Will now sync prod in eqiad and codfw [13:34:38] and all anmespaces done [13:36:24] super [13:48:02] klausman: The subdirectory name under helmfile.d/ml-services is still article-outlink.. will it be updated? [13:48:07] klausman: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services [13:48:23] dammit [13:54:14] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) The ORES extension runs PHP code that calls ORES for damaging and goodfaith only (but others are supported, see the `extension.json` file). The function that returns the HTTP URL to hit is: ` /**... [13:54:22] positive news about the ORES extension --^ [13:54:34] should be doable to move/add support for LiftWing [13:54:45] Really? [13:55:23] yeah, there may be some tweaks needed as well to parse the response (if it differs from the ORES one too much0 [13:55:41] the main question mark is if we want to have another extension and deprecate the ORES one as well :) [13:56:05] Honestly I need to look into how much the community cares about the ORES extension [13:56:34] https://www.mediawiki.org/wiki/Topic:Wx1wkmwnerl0pwmj [13:56:50] Amire80 brings up a good point, do people actually use the extension? [13:57:07] If so, how can we make it better fit their needs [13:57:16] aiko: all fixed now [13:57:24] If not, we can uncluster Wikipedia:RecentChanges a bit by removing it [13:59:28] klausman: great, thank you :) [13:59:47] I'll take a quick break before meetings [13:59:57] chrisalbon: yes definitely, we can do anything [14:00:22] I'm just not interested in continuing to support something simply because we've done so in the past [14:00:33] Either its useful to the community, in which case lets make it even more useful [14:00:41] or its not useful, in which case we kill it [14:01:43] 100% agree, not sure how to get the feedback from the community though [14:01:52] I'll figure that part out [14:02:25] anyway, while we do that, if it takes a while, we can move mediawiki to liftwing with few PHP code afaics :) [14:08:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) [14:09:06] created --^ also for the revscoring models [14:10:29] taking a little break before meetings :0 [14:10:31] :) [14:31:26] 10Machine-Learning-Team, 10SRE, 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) The reboot fixed the DIMM error for now: ` The self-heal operation successfully completed at DIMM DIMM_A2. Wed 27 Jul 2022 09:06:24 The self-heal operation succes... [14:52:36] folks the ml-serve-codfw cluster (prod) has been upgraded to kserve 0.8 [14:52:55] I'll wait a bit before doing eqiad so we can see if there are any issues during the next deploys of isvcs [14:52:58] aiko, kevinbazira --^ [14:53:57] since all isvc images are on kserve 0.8 now in helmfiles, it is probably better to avoid deploying to ml-serve-eqiad [14:55:24] elukey: ack! [14:58:36] elukey: thank you for the upgrade on codfw. so we can continue deploying on both codfw and eqiad as you monitor? [15:01:01] no worries, I've seen the no deploy on eqiad message :) [15:27:56] aiko: do you want to deploy outlink? [15:28:23] (just merged the change, running puppet on deploy1002, after that feel free to deploy to staging anytime) [15:29:34] elukey: yes, thanks for merging the change! [15:29:43] done! green light :) [15:30:41] (afk for a bit, will read later) [15:53:46] outlink was deployed to staging [15:54:03] NAME READY STATUS RESTARTS AGE [15:54:03] outlink-topic-model-predictor-default-9wzcw-deployment-5fd2xzl6 3/3 Running 0 107s [15:54:03] outlink-topic-model-transformer-default-dpr9l-deployment-75bdb2 3/3 Running 0 107s [15:54:09] very nice! [15:54:21] What is the endpoint for staging cluster? [15:54:42] I want to test the model [15:56:40] like for prod, we'll curl https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict to test the model, do we have something similar for staging? [16:00:20] it's just inference-staging instead of inference, IIRC [16:01:21] I.e. https://inference-staging.svc.codfw.wmnet:30443/ with the path on the URL being idenntical between all three environments [16:04:31] thanks Tobias!! [16:09:54] nice :) [16:09:56] does it work? [16:11:37] aiko: btw I forgot the link yesterday https://kserve.github.io/website/0.8/modelserving/servingruntimes/ [16:11:55] this is the new format for isvcs, not entirely sure if we need it [16:12:17] there are some pre-baked docker images that kserve provides for various model types, that we haven't imported in our docker registry yet [16:12:35] https://kserve.github.io/website/0.8/modelserving/servingruntimes/#previous-schema has a not about previous configs [16:13:04] but we don't specify any format/framework so it should be ok [16:14:13] elukey: sadly.. doesn’t work. There is a "mwapi.errors.ConnectionError: Cannot connect to host en.wikipedia.org:443 ssl:default [Connection reset by peer]" [16:14:38] elukey: I guess it is because I don't use the internal endpoint https://api-ro.discovery.wmnet. What do you think? [16:16:52] ah yes definitely [16:18:36] ok.. I'll fix it tomorrow [16:18:44] super :) [16:19:22] thanks for the link. I'll check it tomorrow as well :) [16:28:38] going afk as well! [16:31:16] bye Luca! [16:32:26] night all!