[06:52:34] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10kevinbazira) [07:27:28] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10elukey) Definitely, +1 for the extra documentation steps. In this case, the error is in the `kserve-container`: ` kubectl logs fawiki-articlequality-predictor-default-fkl... [07:30:53] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10elukey) In the logs of the last failed pod I can see: ` kubectl logs fawiki-articlequality-predictor-default-fkl9p-deployment-5bcjbk -n revscoring-articlequality storage-... [08:00:59] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10elukey) >>! In T322614#8378383, @elukey wrote: > IIUC `20221107044250` is the last model, maybe you picked up `20220509071250`from the logs of the pod that was terminatin... [08:04:10] Good morning! Is this the correct page to check about ci/cd? https://wikitech.wikimedia.org/wiki/Deployment_pipeline [08:05:21] morning! Yep that is the starting page basically [08:05:41] I can give you a quick high level overview [08:07:04] - We have a tool called "Blubber", that is responsible to take a yaml description of what a Docker image should do and that produces a Docker file as result. The idea is to hide the complexity of a Dockerfile to users, that can concentrate only on what they want. Usually there is a ".pipeline" config in every repo that uses blubber (even for inference-services) [08:08:16] - Then we have Jenkins - when you send a code review for a repo with blubber enabled, Jenkins tries to build the docker images to see if everything looks good. If you then +2, CI will take care of publishing the docker images to https://docker-registry.wikimedia.org/ [08:08:39] great , thanks! I’ll take a look around and come back with more questions [08:08:53] - Then we have the 'deployment-charts' repository, where we store our K8s Helm/Helmfile configs. We deploy docker images via those configurations. [08:08:58] sure :) [08:39:57] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), and 2 others: Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10taavi) 05In progress→03Resolved All done. Thank you! [09:21:12] morning :) [09:36:31] morning :) [09:38:12] aiko: I read https://python.plainenglish.io/how-to-manage-exceptions-when-waiting-on-multiple-asyncio-tasks-a5530ac10f02 and I am now wondering if the asyncio.gather may be the cause of our troubles [09:38:57] it is the big difference between the revscoring model servers that I can see [09:39:04] what troubles? [09:40:17] latency without MP and connection terminated (the task that you opened) [09:42:27] RR model uses asyncio.gather as well https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/revision.py#L271 but latency seems not a problem [09:44:11] ack didn't know it [09:44:17] but you see connection termination as well? [09:44:46] yep, it happened to RR as well [09:45:04] it may not be gather-related but the above link has a point [09:45:23] in theory our coroutines are all mw api calls that should terminate by its own without issues [09:45:36] so they shouldn't pollute the ioloop etc.. [09:55:20] I've ran benthos with articlequality for an hour and no issue, meanwhile with damaging I am seeing some sporadic conn termination [09:55:39] it is true that we are making 3 mw api calls for damaging and 1 for articlequality [09:55:51] but it still doesn't make a lot of sense to me [10:00:06] klausman: o/ [10:00:27] https://labels.wmflabs.org/campaigns/ and related return 500s [10:00:38] Nov 08 09:54:40 wikilabels-03 uwsgi-wikilabels-web[18075]: psycopg2.OperationalError: could not connect to server: No route to host [10:00:41] Nov 08 09:54:40 wikilabels-03 uwsgi-wikilabels-web[18075]: Is the server running on host "wikilabels.db.svc.eqiad.wmflabs" (172.16.3.117) and accepting [10:00:44] Nov 08 09:54:40 wikilabels-03 uwsgi-wikilabels-web[18075]: TCP/IP connections on port 5432? [10:13:58] taavi: o/ do you know where wikilabels.db.svc.eqiad.wmflabs is defined? [10:14:35] elukey: that sounds like something managed by wmcs-wikireplica-dns. why? [10:15:03] taavi: it is in the wikilabels config, afaics we use it to get the db host [10:15:59] you should update your config to reference the new name directly, or we can create a per-project .svc. zone for you to manage via horizon [10:16:56] I'll find where it is configured, thanks [10:18:01] ah i think it is in https://github.com/wikimedia/wikilabels-wmflabs-deploy [10:18:02] sigh [10:20:02] klausman: ---^ let's open a task about it and decide what's best [10:47:56] kevinbazira_: o/ [10:48:10] did you see my updates in the task? We can discuss the fawiki issue in here if you want [10:48:11] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10achou) I tested outlink with benthos for around 9 hours the other day ([[ https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=1667772000000&orgId=1&to=1667804400000&var-backen... [10:58:07] I think that no route to host is caused by a Puppet-enfoced thing. I'll take a closer look [10:58:36] did it work before? [10:58:40] I think I may have missed a spot that should really say wikilabels-database-03 instead of that svc record [10:58:59] That page used to work, yes [10:59:12] but using the old db no? [10:59:15] I see https://github.com/wikimedia/wikilabels-wmflabs-deploy/blob/master/config/00-main.yaml#L20 [10:59:27] so once we torn down the old instance, then it stopped working [10:59:38] There's about four different layers of db config in that uwsgi app [11:00:30] mmm sure but in the code it seems that we use that .svc. endpoint [11:00:52] ...yes [11:01:10] not really urgent, let's open a task to fix it [11:01:31] Kevin may need it during the coming weeks [11:01:35] Thing is that that string (wikilabels.db.svc.eqiad.wmflabs) is nowhere in Puppet, hence I missed it [11:02:08] the wikilabels repo is a bit of a mess, hopefully we'll deprecate it soon-ish [11:02:20] it being configured from a GH repo torpedoed me a biut there [11:06:40] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), and 2 others: Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10klausman) This still needs a fix to https://github.com/wikimedia/wikilabels-wmflabs-deploy/blob/mast... [11:06:51] elukey: added a comment to T307389 and opened https://github.com/wikimedia/wikilabels-wmflabs-deploy/pull/57 [11:08:01] klausman: looks good, but keep in mind that the 'deploy' branch is the one used. IIUC the policy is to merge to master and then merge to 'deploy' too [11:08:11] ack, Will do that [11:14:03] elukey: is deployment happening from deploy1002? or somewhere else? [11:22:46] Yeah, I can't deploy this, apparently. Not sure what's wrong with the setup. It wants to ssh to wikilabels-staging-02.wikilabels.eqiad.wmflabs (which works manually), but gets a DNS error. [11:32:23] anyhoo, lunch [11:41:15] klausman: sorry I was afk - so the wikilabels::web class contains a git::clone, the content is pulled directlu [11:41:22] *directly [11:42:17] I'm still not sure how deployment is supposed to work. [11:42:56] in theory most of the times we have ensure => latest, so a puppet run would pull the latest HEAD [11:43:05] in this case I think it is a manual pull [11:43:16] deploy1002 can't talk with wmcs [11:43:47] I just did a sudo -u www-data git pull in /srv/wikilabels/config [11:43:56] then a restart of uwsgi-wikilabels-web [11:44:02] on what machine? staging? [11:44:08] aaand https://labels.wmflabs.org/campaigns/ works [11:44:15] nope on wikilabels03 [11:44:30] we can probably cancel staging [11:44:40] I mean deleting it, we don't really use it [11:45:07] I've git pulled and restarted on staging as well, just now [11:45:50] If things work now, grand. And back to watching dough rise :) [11:48:42] it seems all working afaics [11:50:22] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10achou) The new model size looks very suspicious, maybe you could check if you have uploaded the model properly. ` aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cf... [12:02:52] * elukey lunch [13:18:06] 10Machine-Learning-Team, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10hashar) a:03hashar [14:25:09] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10EChetty) [14:45:34] elukey o/ [14:46:04] hey Kevin :) [14:46:10] yes, I saw your updates on the task. We'll discuss the isvc update in the meeting. [14:46:50] there is a workflow I have come up with that I can't paste here as it won't be easily legible [14:47:14] kevinbazira_: this is the part that I don't get, in theory the helmfile sync is the only step that you need to do [14:47:31] I did that :) [14:48:19] but the existing pod went into a "CrashLoopBackOff" issue. [14:48:29] sure but it is due to an error in the model binary [14:48:47] (see what Aiko pointed out) [14:48:49] the model binnary is fixed ... [14:49:05] ah so you re-uploaded it ? [14:50:44] I see yes [14:51:14] ok so in theory kserve should try to re-create the isvc, that should pull the new binary etc.. [14:51:32] but if it doesn't work, an SRE can kill the pod to force its re-creation [14:51:50] kubectl delete pod fawiki-articlequality-predictor-default-fkl9p-deployment-5bcjbk -n revscoring-articlequality [14:51:55] I just did it, let's see [14:52:27] oh ok, waiting ... [14:53:32] done! [14:53:34] it is running now [14:53:40] you can test it [14:54:27] phew! yep, I see it's running now. Thank you for restarting the pod. [14:54:41] QQ: only SREs have rights to restart pods? [14:55:49] yes exactly, in this case it was an explicit delete pod [14:56:10] it should be an admin action not needed often (hopefully) [14:57:40] ok, thanks for the clarification. [14:58:03] so, regarding updating an existing isvc/pod will the helmfile sync suffice without need to delete the pod? [14:58:47] yep yep correct, we only have to do it if a crash loop happens etc.. (and if it doesn't resolve by itself) [15:00:02] great, thank you for clarifying on this. Let me add a note to the docs regarding updating an existing isvc. [15:17:00] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10kevinbazira) Updating the fawiki-articlequality isvc on LiftWing has succeeded and it's now up and running: ` $ time curl "https://inference.svc.eqiad.wmnet:30443/v1/mode... [15:44:32] 10Machine-Learning-Team: Move Wikilabels Postgres Instances to VMs - https://phabricator.wikimedia.org/T312564 (10klausman) 05In progress→03Resolved [16:01:25] * elukey groceries [17:28:20] * elukey afk! [17:28:25] have a nice evening folks [18:13:41] \o [18:14:17] o/ [19:38:59] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10Trizek-WMF)