[06:49:32] good morning :) [07:19:17] morning, Luca! [07:59:32] Buon giorno :) [08:01:49] elukey: \o Say, how are AS numbers for k8s allocated/managed? [08:02:12] https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations Just adding them here and that's it? [08:04:19] (03PS14) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [08:04:28] (03PS15) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [08:06:27] klausman: salve, yes I think so, but better to follow up with Arzhel or Cathal since it has been a while [08:06:37] there should be a note in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New in theory [08:06:42] (if not let's add it at the end) [08:06:50] ack, will do some reading [08:11:47] The question is also which comes first: the deployment charts bits, or the AS addition linked from the new-setup page (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055) ? [08:16:21] the AS/homer config is needed to make calico pods working [08:16:35] so deployment-charts should come first [08:17:16] alright. I alreayd got a change for that, just missing AS# and v6 prefix [08:17:57] (03PS16) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [08:54:02] elukey: is there a way to test the helmfile/deployment changes? [08:57:14] klausman: there is only the diff from CI [08:57:23] Hrm. [08:57:36] And with my stuff all being additions, that won't be super useful [08:59:46] Oh that has to be one of the least useful errors I've ever seen [09:01:46] I hope it's just the ref to a nonexistent file [09:04:11] you can run the CI locally with `rake run_locally['default']` if you want [09:08:33] I'm probably missing packages as that fails [09:08:42] https://integration.wikimedia.org/ci/job/helm-lint/7238/console CI diff [09:09:43] let's also reserve the ipv6 subnet [09:09:59] otherwise we'll forget, it should be a simple /64 pick from the reserved pool [09:10:01] Is that just from the same prefix as the existing v6 net in codfw? [09:10:49] https://netbox.wikimedia.org/ipam/prefixes/229/prefixes/ [09:11:00] i.e. 2620:0:860:302::/64 [09:12:41] In theory yes, but double check with netops to be sure [09:46:52] elukey: what do you think re: 302 and 303 as mentioned in #-traffic? [09:55:01] all good for me, any subnet is fine :) [09:57:57] Alsright [09:57:59] -s [10:13:34] I did the reservations in netbox, now going to hunt down lunch [10:16:05] I see allow-all-icmp in the CI diff for the GlobalNetworkPolicies, that we haven't overrode, so I think that the allow-pod-to-pod override should work [10:16:16] roger [10:17:44] * elukey lunch too [10:43:46] (03PS2) 10AikoChou: outlink: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785112 (https://phabricator.wikimedia.org/T306029) [10:51:22] (03CR) 10Klausman: [C: 03+1] outlink: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785112 (https://phabricator.wikimedia.org/T306029) (owner: 10AikoChou) [11:29:59] (03CR) 10AikoChou: outlink: handle http bad request (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785112 (https://phabricator.wikimedia.org/T306029) (owner: 10AikoChou) [11:53:57] (03CR) 10AikoChou: "The articlequality-transformer pipeline fails because there is no code to run. Sent a patch in integrations/config: https://gerrit.wikimed" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785848 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [13:12:01] Morning all! [13:15:49] morning! [13:16:14] chrisalbon: qq - I see a late technical meeting in my calendar, is it on purpose or a leftover? [13:16:39] What is it called? [13:16:48] ML Team Technical Meeting [13:16:59] I see a one hour mid week planning [13:17:05] Yeah you can delete that [13:17:06] and then half an hour for the tech meeting [13:17:09] super [13:17:14] I deleted it so I’m not sure why it is appearing [13:17:36] maybe it was only in my calendar [13:19:29] I see it too [13:19:48] x3 [13:23:17] Alright maybe I deleted myself from the meeting [13:23:25] Listen, nobody ever said I was smart [13:26:28] (03CR) 10Elukey: outlink: handle http bad request (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785112 (https://phabricator.wikimedia.org/T306029) (owner: 10AikoChou) [13:37:20] 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES batch scoring in Lift Wing - https://phabricator.wikimedia.org/T306986 (10achou) [13:51:02] elukey: is there any time-sensitive stuff that needs doing post submit (like the LVS+pybal change)? [13:51:51] klausman: for admin-ng? Nono, just run-puppet-agent on deploy1002 if you want to deploy the rbac rules etc.. [13:52:01] (to a git pull is forced for deployment-charts) [13:52:10] ack [13:52:19] then if anything is wrong you'll notice it as soon as you try to use helmfile [13:52:46] grrr, jenkins being a Jerk again [14:03:19] afk for a bit to get groceries, bbiab [14:18:31] klausman thanks for the merge. starting deployment now ... [14:30:23] both eqiad and codfw deployments have been completed successfully. [14:32:21] all 4/4 new pods are up and running: [14:32:22] NAME READY STATUS RESTARTS AGE [14:32:22] wikidatawikiwiki-damaging-predictor-default-d79ph-deployme4lvb6 3/3 Running 0 8m51s [14:32:22] wikidatawikiwiki-goodfaith-predictor-default-c7w4h-deploym96ljd 3/3 Running 0 4m [14:32:22] zhwikiwiki-damaging-predictor-default-pj2xt-deployment-6d6f2wps 3/3 Running 0 8m50s [14:32:22] zhwikiwiki-goodfaith-predictor-default-7g8hp-deployment-85jg2f7 3/3 Running 0 3m58s [14:33:00] Yay! [14:41:26] kevinbazira: nice! Are all the models also returning scores without erroring etc..? [15:28:03] kevinbazira: merged the fixes for the "wiki" substring [15:28:18] thanks. deploying now .. [15:40:21] hmmm... the wikidatawiki editquality predictors are running into a CrashLoopBackOff: [15:40:21] NAME READY STATUS RESTARTS AGE [15:40:21] wikidatawiki-damaging-predictor-default-p9pqp-deployment-5frwd7 1/3 CrashLoopBackOff 5 4m57s [15:40:21] wikidatawiki-goodfaith-predictor-default-fm2xw-deployment-kf6x4 1/3 CrashLoopBackOff 4 2m15s [15:40:21] investigating what the cause could be ... [15:47:09] looks like the wikidatawiki model is not being mounted in the predictor pod [15:47:09] $ kubectl logs wikidatawiki-damaging-predictor-default-p9pqp-deployment-5frwd7 -n revscoring-editquality-damaging kserve-container [15:47:09] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [15:48:11] what does the storage-initializer's log say? [15:52:22] surprisingly the storage-initializer seems to have mounted the model ... hmmm [15:52:22] kubectl logs wikidatawiki-damaging-predictor-default-p9pqp-deployment-5frwd7 -n revscoring-editquality-damaging storage-initializer [15:52:22] [I 220427 15:32:45 storage:85] Successfully copied s3://wmf-ml-models/damaging/wikidata/20220214192318/ to /mnt/models [15:52:48] kevinbazira: so the URL looks weird, i'd have expected "wikidatawiki" in there [15:52:55] damaging/wikidatawiki/etc.. [15:54:20] yep, that's weird ... [15:57:12] trying to delete the pod to see if it is running a stale config [15:57:23] great. thanks! [16:04:53] ah ok so it breaks also for goodfaith [16:05:03] lemme check the templates [16:05:21] We did just merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/786982 [16:05:35] ("just"==during the meeting) [16:06:26] yes it does break goodfaith too ... the templates' regex might have the answer [16:07:27] yeah in revscoring_services.yaml there is a bug for wikidata [16:08:39] we don't append "wiki" if "wiki" or "wiktionary" are in the variable [16:10:38] so we should add wikidata to the condition [16:11:16] elukey should I add it or you're adding it? [16:11:29] kevinbazira: sending a code change in a bit [16:11:50] ok [16:12:16] I think that the issue is that the "wiki" in the regex matches wikidata, so we have to avoid it [16:12:26] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/787004/ [16:12:46] the .*wiki matches "wiki", meanwhile we want to match something like enwiki [16:12:56] so a .+ should do the trick in theory [16:13:00] what do you think? [16:13:30] I'll also add a test in the fixtures [16:14:26] let's see if CI agrees [16:14:57] if you use .+ doesn't that eliminate other wikis that start with substring "wiki"? [16:15:32] do we have any? [16:15:41] checking ... [16:16:16] I think that we have always a prefix in front [16:16:58] yep [16:18:54] ok so let's see the diff from the CI, if it makes sense [16:20:59] 18:15:37 + - name: STORAGE_URI [16:20:59] 18:15:37 + value: s3://wmf-ml-models/goodfaith/wikidatawiki/202106140666/ [16:21:03] seems good! [16:21:19] great. let me check the pod [16:21:39] 18:16:48 - value: s3://wmf-ml-models/damaging/wikidata/20220214192318/ [16:21:42] 18:16:48 + value: s3://wmf-ml-models/damaging/wikidatawiki/20220214192318/ [16:21:48] nono I still have to merge it :) [16:21:55] oh ok ... [16:21:56] this is the diff from the CI/Jenkins [16:22:22] ok to merge kevinbazira ? [16:22:32] (if so can you +1 ? [16:22:33] ) [16:25:04] ok I'll go ahead : [16:25:06] :) [16:32:39] :) I had a question about the version: "202106140666" added to revscoring_inference_no_transformer.yaml [16:32:39] is this an inference version or a model id/number that is used in thanos-swift? [16:32:39] I am asking because these are the wikidatawiki model ids I see in thanos-swift: [16:32:39] s3://wmf-ml-models/damaging/wikidatawiki/20220214192318/ [16:32:39] s3://wmf-ml-models/goodfaith/wikidatawiki/20220214192321/ [16:41:59] kevinbazira: it is the model id/number [16:43:58] ah no you mean https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/787004/2/charts/kserve-inference/.fixtures/revscoring_inference_no_transformer.yaml right? [16:44:15] this is inside the .fixture, it is only to test the rendering of the template [16:44:24] so if we modify anything we'll see a diff in the CI [16:44:28] nothing more [16:46:23] oh ok, so it's only used for testing ... I get it. [16:47:40] you should have seen me hunting for the 202106140666 id/number in thanos-swift 🤦‍♂️ [16:48:37] sorryyy [16:48:55] np... thank you for the clarification :) [17:05:17] ok so the ml-serve-eqiad cluster looks good, but I don't see the wikidata pods up [17:05:29] I think that with the new kserve-inference chart pods have been all recycled [17:05:38] and it may take time for the wikidata isvc to come up [17:06:31] the isvc is in a weird state indeed [17:08:36] this is weird [17:08:37] Container failed with: /opt/lib/python/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found [17:08:56] ah no [17:08:57] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [17:09:19] that is revision 2, related to the second deployment done by Kevin [17:09:26] why is there no revision 3? [17:09:50] I'll restart working on this tomorrow morning, really strange [17:09:58] have a nice rest of the day folks! [17:15:44] thanks for your help elueky. see you tomorrow. [18:34:24] Night all! [18:34:34] I'm just going to sit here alone... answering emails [18:34:50] Why did I become a manager even