[07:03:26] Good morning! [07:12:33] 10Machine-Learning-Team: Retrain fawiki articlequality model - https://phabricator.wikimedia.org/T317531 (10kevinbazira) Thank you for catching that, @achou. Yes, it is the articlequality model. Now that the results indicate that the new model takes into account both ref tags and sfn templates, I will prepare t... [07:24:14] morning folks! [07:28:34] aiko: enwiki multi-process seems to go well - https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=now-6h&orgId=1&to=now&var-backend=enwiki-goodfaith-predictor-default-rp4d7-private&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-quantile=0.75&var-response_code=All [07:28:53] there was a spike at some point to ~20s, but the rest looks really consistent [07:31:07] ilias: o/ what is the shell username that you chose when you created the Wikitech credentials? [07:31:12] (or developer account) [07:31:38] ah I see, isaranto [07:31:42] found you :) [07:44:51] I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/853090, once it gets reviewed and Chris approves the access in the phab task we should be good to go :) [08:09:30] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) I'll keep reporting improvements to the Istio connection termination issues in https://phabricator.wikimedia.org/T322196, that seems a more specific task. I have been testing... [08:10:28] the new MP code seems working nicely, I think that the broken process pool errors were due to not enough `limit` resources (the ceiling of the cpu usage that the pod may use in its cgroup) [08:10:55] but, in my tests I have used request: 2s and limit: 5s for the enwiki-goodfaith pod [08:11:12] that may be a lot if we extend it to all other model servers [08:11:41] I still need to wrap my head around request/limit vs total available resources in the system [08:12:25] I'll also start testing other model servers, to compare results [08:12:36] maybe goodfaith is the slowest [08:22:49] articlequality is performing way better [08:27:05] hey klausman o/ are you up to merging the ml staging role rename or should I take care of that? [08:54:03] * elukey afk for a bit! [09:26:28] jayme: I can do that [09:26:38] cool, thanks! [09:26:40] jayme: has the private/labs side of things been done? [09:27:07] No. I've not touched that to not interrupt things [09:27:35] ack. [09:30:57] I think that should only be `mv worker/staging.yaml staging/worker.yaml` and `mv master/staging.yaml staging/master.yaml` in private/hieradata/role/common/ml_k8s, right? Plus the actually-private side on the puppetmaster [09:35:37] jayme: I'll merge the private change, run pcc on the main one, then do the puppetmaster changes, then run PCC again and merge if everything is fine [09:37:01] the second pcc run is probably not needed as the "real" private data is not used for it [09:38:21] For additional safety you could also stop puppet in the ml staging nodes to prevent it from running between you merging the "real" private change and the actual puppet change. But I don't think that is strictly required [09:39:06] Error: Function lookup() did not find a value for the name 'profile::kubernetes::master::controllermanager_token' (file: /srv/jenkins/puppet-compiler/37952/production/src/modules/profile/manifests/kubernetes/master.pp, line: 2) on node ml-staging-ctrl2001.codfw.wmnet [09:40:37] where is that from? the pcc looks good to me [09:40:54] cmdline was pcc 852158 ml-staging2001.codfw.wmnet,ml-staging-ctrl2001.codfw.wmnet [09:41:04] https://puppet-compiler.wmflabs.org/pcc-worker1002/37952/ml-staging-ctrl2001.codfw.wmnet/prod.ml-staging-ctrl2001.codfw.wmnet.err [09:41:24] https://puppet-compiler.wmflabs.org/pcc-worker1002/37952/ml-staging-ctrl2001.codfw.wmnet/prod.ml-staging-ctrl2001.codfw.wmnet.err is the Jenkins run [09:41:33] gah, wrong paste [09:41:37] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37952/ [09:43:37] uh, that's odd. The run failed but pcc V+1'ed the change [09:44:10] yeah. I've disabled Puppet while I'm figuring out what's going on [09:46:13] klausman: have you "puppet-merge"'d the labs/private change? [09:46:19] yes [09:47:29] I've also tried running pcc directly with the Jenkins webui (instead of on my machine) just in case I got something muddled up there. Same result [09:51:27] morning :) [09:51:32] Rebased 852158, trying again [09:51:35] \o hey aiko [09:51:48] yeah, same here [09:52:00] Same error [09:52:12] elukey: wow, nice!! The latency looks way better!! [09:54:59] I suspect the "it fails, but +1's anyway" part is unrelated to the not-found tokens, but I can't be 100% sure [10:03:14] elukey: maybe we can just extend it to large wikis e.g. wikidata, dewiki, eswiki, frwiki, not all other wikis, if we don't have enough resources [10:59:00] aiko: yep it could be an option! [10:59:48] aiko: articlequality seems performing way better [10:59:57] maybe it is a problem only for goodfaith [11:00:03] or better, editquality [11:00:12] I'll try all models with benthos extensively and report back in the task [11:00:15] (revscoring models) [11:11:37] (03PS1) 10AikoChou: Add logging for BAD_REQUEST responses [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) [11:33:02] (03CR) 10Elukey: [C: 03+1] "Non blocking comment, the rest LGTM :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) (owner: 10AikoChou) [11:35:23] * elukey lunch! ttl [11:46:20] heading to lunch as well [12:19:04] (03PS2) 10AikoChou: Add logging for BAD_REQUEST responses [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) [12:20:53] (03CR) 10AikoChou: Add logging for BAD_REQUEST responses (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) (owner: 10AikoChou) [12:26:42] (03PS3) 10AikoChou: Add logging for BAD_REQUEST responses [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) [12:41:50] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) (owner: 10AikoChou) [12:55:01] (03Merged) 10jenkins-bot: Add logging for BAD_REQUEST responses [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/853258 (https://phabricator.wikimedia.org/T320374) (owner: 10AikoChou) [13:38:54] weird, the pipeline hasn't started publishing images [13:43:46] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research, 10Shared-Data-Infrastructure: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Ottomata) FYI, we have deployed a `rc0.media... [14:08:58] klausman: o/ [14:09:03] FYI today I noticed https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-30d&to=now [14:09:18] there were several alerts in alerts.w.o that we completely missed :( [14:09:35] the "fix" was a restart of kube-api on ml-serve-ctrl2002 (active at the time) [14:09:59] it was related to LIST ops for knative taking ages to complete [14:10:07] Hmm. [14:10:19] I wonder what happened on the 25th that caused that [14:10:25] my understanding is that with newer knative versions this will go away, probably the 0.18's webhook is not that great [14:10:39] I deployed some things, nothing else in the SAL [14:10:55] but let's keep an eye on alerts.w.o more frequently from now on [14:11:03] Aye [14:11:04] (I also missed the IRC notifications apparently) [14:11:46] we both di :-/ [14:11:49] did* [14:12:28] Morning all [14:12:33] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) Articlequality testing went really great, latencies are stable and acceptable without multi-processing. [14:12:38] morning! [14:13:15] Morning, Chris [14:13:22] Hello! [14:14:44] Klausman I don’t know if my comment was in the right ticket for [14:14:50] NLLB [14:15:09] I saw it either way :) [14:15:34] Cool. If it isn’t the right ticket let’s make a right ticket so we can follow and groom the work [14:16:03] ack. I have poked Prabhat about syncing up today to see where we're at, and what his thoughts on the whole thing are [14:19:58] Cool. The way Deb described the plan, you would be the person actually doing the migration. Pau would check if the model actually works. Prabhat would the the AWS expert ready to answer questions. [14:20:31] I don’t know what input you had in that plan but that is how it was laid out to me [14:22:00] Also I was very firm that moving the model from their AWS to our AWS has to be completely separate from any future work about bringing the model into our infrastructure [14:22:12] Aye. [14:22:15] This isn’t regular work it’s an emergency [14:22:29] And it shouldn’t be treated as an addition to our regular work [14:22:45] There is the question who will maintain it in the long-term, and adding better monitoring etc., but that is less urgent of a question [14:24:26] Let’s just get it off their AWS and then worry about that later. I know that sounds wild but the deadline is Jan 1st, so that gives maybe 1 month of regular work (ie not Christmas etc) to get it on out AWS [14:24:33] That should be the goal. [14:25:21] If it crashes on our AWS then that is the risk WMF took by getting ourselves in this pickle [14:25:31] And a lesson for everyone [14:31:23] Ack. [14:34:45] chrisalbon: qq - so we are onboarding the service and it will be owned by us? [14:42:11] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/853298 [14:43:11] ^ update images for outlink and rr model [14:44:34] aiko: will check in a bit, there is an outage in progress (so no deployments) [14:45:54] elukey: ack! 😲 [15:23:29] aiko: green light :) [15:37:44] chrisalbon: could you approve https://phabricator.wikimedia.org/T322350? [15:37:54] thanks Luca :) [15:38:37] Done [15:39:05] super [15:42:41] Thanks! [16:03:37] tried to deploy new images to staging but both ran into Init:CrashLoopBackOff [16:04:35] not sure why.. checking pod [16:17:03] istio-validation: [16:17:03] State: Waiting [16:17:03] Reason: CrashLoopBackOff [16:17:03] Last State: Terminated [16:17:03] Reason: Error [16:17:03] Exit Code: 126 [16:17:25] yeah saw that, weird [16:17:29] I am trying to kill the pod [16:17:55] ilias: your shell access is propagating, next week you'll be able to ssh to prod etc.. [16:20:44] interesting, the crash happens with the nsfw pod as well (just tried to delete it) [16:21:37] Thanks for all the help every1 :) [16:21:50] should I also delete outlink pods? [16:23:53] aiko: nono I think it was a change that we rolled out today [16:24:20] klausman: I see a kube-apiserver.service on ml-staging2001 [16:24:24] started 6h ago [16:24:47] I think that was probably started by an agent run [16:25:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/852158/4/manifests/site.pp [16:25:30] my last r-p-a command tghere was at 05:50 UTC [16:25:30] this is wrong [16:25:48] the worker has became a master [16:25:56] I completely missed that, will prep a fix [16:26:11] same thing for me uff [16:27:04] and pcc didn't show anything? [16:27:25] Well, it might've been lost in the noise of the private repo thing [16:30:24] yeah pcc makes sense now [16:30:46] so we need to clean up the old systemd units.. I'll take 2002 [16:31:34] grmbl, merge conflict [16:32:04] what units need cleaning? [16:33:03] change is merged [16:34:48] klausman: all the ones that the other change brought up [16:34:55] like kube-apiserver [16:35:04] puppet will not do it by itself [16:35:13] apiserver, scheduler, controller-manager I know of [16:36:05] yep [16:38:07] Those three I have stopped on 2001 [16:38:15] (and disabled) [16:38:42] let's also remove the files and systemctl daemon-reload [16:39:11] rm /lib/systemd/system/kube-scheduler.service /lib/systemd/system/kube-apiserver.service /lib/systemd/system/kube-controller-manager.service [16:39:49] done & done [16:40:10] ah there are also packages sigh [16:40:17] `systemctl status kube-scheduler.service kube-apiserver.service kube-controller-manager.service` should say "not found" for all three [16:40:41] apt purge kubernetes-master should be enough, right? [16:40:45] yeah [16:41:05] and also config files under /etc/default [16:41:37] I would've hoped that purge nixes them, but oh well [16:41:57] they are created by puppet IIRC [16:42:35] and there are files also under /etc/kubernetes [16:42:50] elukey@ml-serve2001:~$ ls /etc/kubernetes [16:42:50] kubelet_config kubelet-config.yaml kubeproxy_config kube-proxy-config.yaml ssl [16:42:56] this is a regular worker node [16:43:33] elukey@ml-staging2002:/etc/kubernetes$ sudo rm controller-manager_config infrastructure-users kube-scheduler-config.yaml scheduler_config [16:43:42] same [16:45:38] deleted the pods that were not starting [16:45:55] from deploy1002? [16:46:13] yeah [16:46:21] ack. [16:46:23] istio-validation now passes [16:46:47] let's see if storage-init etc.. work [16:48:29] aiko: new pods running [16:48:33] if you want to test [16:53:05] elukey: nice!! [16:54:23] elukey: outlink is still Init:CrashLoopBackOff. should i delete them? [16:56:58] ah yes I didn't check it [17:02:03] went ahead and deleted them for you [17:02:33] elukey: I got Error from server (Forbidden): pods "outlink-topic-model-predictor-default-ccssc-deployment-548gh6pb" is forbidden: User "articletopic-outlink" cannot delete resource "pods" in API group "" in the namespace "articletopic-outlink" [17:02:50] ahh okok [17:03:04] makes sense yes, only me and Tobias can do it [17:03:11] new pods running! [17:03:42] elukey: ohhh I see :) [17:04:30] thanks Luca and Tobias for fixing it! [17:05:04] aiko: it is interesting to see that the revscoring model servers that make only one MW Api call are fast without MP, the others have some slowdown [17:06:35] elukey: that's interesting. is it articlequality model? [17:07:15] articlequality and articletopic do 1 api call, and they are good without MP [17:07:28] draft quality and edit quality make 3 api calls [17:07:40] (didn't check drafttopic yet) [17:08:14] drafttopic probably is the same as articletopic [17:08:21] * elukey nods [17:08:45] we can see if MP can be enabled on selective use cases, as you said earlier on [17:09:55] yep sounds good [17:15:13] I think it makes sense that draft quality and edit quality can benefit from MP, because they have a more complex feature set (includes parent revision and user info) [17:20:33] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) articletopic seems to work well without multi-processing (need to test it but I think that drafttopic will follow the same trend). [17:28:45] going afk for the weekend! [17:28:54] o/ [17:43:41] o/ have a lovely weekend folks! [18:18:06] \o heading out as well [18:26:22] Have a great weekend /o [23:09:05] (03PS1) 10Reedy: Helpers: Fix string interpolation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/853453 (https://phabricator.wikimedia.org/T314096) [23:16:59] (03CR) 10Zabe: [C: 03+2] Helpers: Fix string interpolation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/853453 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [23:27:28] (03Merged) 10jenkins-bot: Helpers: Fix string interpolation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/853453 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy)