[07:34:55] o/ [07:43:59] yeah I see that the istio validation init container fails [07:44:15] tried to quickly restart the kubelet and delete the pod but the quick solution didn't work :D [07:54:34] good morning :) [08:02:09] kevinbazira: o/ Hi Kevin, have you looked into the library https://pypi.org/project/async-mediawiki/ that you found? [08:02:49] do you think it can be used for our revscoring models or new model? [08:05:12] the last commit is from 2021 (https://github.com/Gelbpunkt/aiowiki) that is a little concerning [08:05:24] but mwapi is far worse from this pov sooo :) [08:05:46] we may need to send pull requests in the future, so knowing if the project is still active would be good [08:06:04] I'd suggest to quickly evaluate if there are missing things from mwapi etc.. [08:06:07] and if the lib works [08:06:32] and possibly open a github issue explaining our use case, to see if the upstream author is still open to pull requests etc.. [08:06:35] does it make sense? [08:06:35] o/ [08:06:39] aiko: It was a quick glance the time I brought it up. I have not dug into it in depth. Please feel free to give it a shot. [08:09:48] interesting - https://preliminary.istio.io/latest/docs/ops/diagnostic-tools/cni/#diagnose-pod-start-up-failure [08:09:58] this seems the problem with the istio-validation container on staging [08:09:59] mmmm [08:11:05] ahhh ok we are missing the config in /etc/cni/net.d/10-calico.conflist [08:12:02] elukey: thanks for the suggestion! [08:14:26] np :) [08:14:34] Filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/809534/ for the calico cni plugin [08:16:28] ok so merging, running puppet and restarting the kubelets [08:18:38] ok istio-validation now works [08:19:01] but the storage initializer says [08:19:02] botocore.exceptions.NoCredentialsError: Unable to locate credentials [08:19:16] that is correct, we haven't really added credentials in puppet private yet [08:19:34] so we have two roads: [08:20:00] 1) we use a separate swift/s3 account for ml-staging, so separate credentials models etc.. [08:20:14] 2) we use the same as the ml-serve clusters [08:28:48] the last task was https://phabricator.wikimedia.org/T280773 [08:29:06] and IIRC we haven't really created a read-only account for ml-serve [08:29:13] so, another idea [08:29:36] 3) create a read-only account shared by staging and production [08:44:21] elukey: who should I be asking for what to do about T307389? [09:08:18] taavi: ah yes definitely, apologies if nobody answered :( kevinbazira what do you think about the task above? [09:17:42] taavi: I'll ping people today, if you don't hear anybody please ping us again tomorrow [09:18:40] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10kevinbazira) Sorry for the late response. @elukey regarding cleaningup/deprecati... [09:19:07] elukey: https://phabricator.wikimedia.org/T307389#8036241 [09:19:25] I've added my take to the task [09:23:36] ack thanks! Let's wait for Chris' response but I think that we can remove all those vms [09:30:39] kevinbazira: It's not super urgent. But if you can find time to quickly evaluate the library you found (like Luca suggested), that would be great. Let me know how it goes or update it in the task https://phabricator.wikimedia.org/T309623 :) [09:57:05] ok. I'll look into it when I get a minute. [10:32:51] elukey: \o [10:33:07] Staging is missing credentials for S3/Thanos, it seems: https://phabricator.wikimedia.org/P30609 [10:36:57] Sorry, Swift, not Thanos [10:37:14] (or maybe I am once again confusing things) [11:27:08] <- Lunch [12:32:17] (03PS2) 10AikoChou: WIP - outlink: use tornado async http client to fetch outlinks [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [13:02:25] elukey: I can't for the life of me figure out how the credentials for Thanos/Swift are wired up in puppet/private/deployment-charts [13:03:57] klausman: o/ did you see the backscroll? :) [13:04:21] Oh, it got lost in me scrolling past :-S [13:05:06] Well, at least it vindicates you for the missed backscroll yesterday :D [13:05:23] I think using a readonly account for prod+staging is the right approach [13:05:56] Still, how do revscoring and so on get the credentials? [13:05:59] yes yes fully vindicated :D [13:07:02] so kserve is the only one that needs to get the credentials, and in our templates/yamls we fetch them from known variables [13:07:20] that we set in the puppet private repo, that in turns creates the helmfile private yaml configs on deploy1002 [13:08:10] Still not sure I'd be able to find the spot in our templates where it is wired up. [13:08:15] we can go through the whole chain if you want, it took me a while the first time :D [13:08:27] Just a pointer to the right file would probably do it [13:08:30] let's see if I can come up with some pointers [13:09:47] ah ok found it [13:10:09] so in deployment-charts -> charts -> kserve-inference, there is serviceaccount.yam;l [13:10:22] and we have .Values.inference.swift_s3_secret_name [13:11:25] then, if you look in helmfile.d -> ml-services -> revscoring-articlequality -> helmfile.yaml [13:11:46] at line 40 there is "service_secrets" [13:12:06] Also line 25 [13:12:15] it uses the "secrets" chart, that is in deployment-charts, and the values came from [13:12:18] - "/etc/helmfile-defaults/private/ml-serve_services/revscoring-articlequality/{{ .Environment.Name }}.yaml" [13:12:32] this specific file is on deploy1002, and it gets populated by puppet using private values [13:13:06] so the inference service is created to use a service account that can read a Secret resource, containing the swift credentials for the storage-initializer [13:13:09] I presume the file on deploy1002 still is managed from puppetmaster /srv/private? [13:13:41] Or does it only live on deploy machines? [13:15:33] it is managed by puppet (public) with values coming from puppet private [13:15:41] Also, should I file a ticket similar to T280773 for a r-o account? [13:16:27] the file on puppet private is hieradata/role/common/deployment_server/kubernetes.yaml [13:16:36] yep I think we should file a task [13:19:58] once we get the new credentials we set them up in puppet private and deploy them via helmfile [13:20:02] and we should be ok [13:21:53] Does the `:` in `mlserve:prod` have any particular meaning or is it just convention? [13:22:31] (03PS1) 10Elukey: Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 [13:22:52] I am wondering if/how the "readonly" part should be reflected in the name [13:23:09] klausman: IIRC other accounts follow a similar pattern, they are all stored in puppet [13:23:14] mlserve_prod-ro and mlserve:prod-ro? [13:23:58] we could create mlserve:ro and keep mlserve:prod, not super clear but it wouldn't cause a change in the actual account [13:24:14] at some point in the future we'll have to migrate our accounts to the MOSS cluster (we are still on thanos) [13:24:22] That would also work, even if mlserver:prod then is a bit misleading [13:24:24] so we may be able to change namings [13:24:54] I just don't want to change a million things at the same time [13:25:39] I agree, we can just add one account at the moment [13:25:52] Ok, will make a ticket [13:30:32] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Totally makes sense, thanks for the clarifications! Given the work to do on the Python side (no time for the new lib sorry), I am... [13:32:42] 10Lift-Wing, 10SRE-swift-storage, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10klausman) [13:34:29] (03PS2) 10Elukey: Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 [13:41:39] (03PS3) 10Elukey: Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 [13:44:13] (03CR) 10Elukey: "Cherry picked the change in deployment prep, tested manually all use cases and ran httpbb's test suite. Nothing weird found :)" [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 (owner: 10Elukey) [13:45:05] all right it seems that Releng found a way to make github -> gerrit mirroring to work! [13:45:14] I filed a change for the ores submodule, all tested, looks good [13:45:20] small changes that are pending to be deployed [13:47:46] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Ya that sounds good! FWIW, if you are just trying to test stuff, you are welcome to produce directly to a test kafka topic. The... [13:50:07] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) >>! In T301878#8037150, @Ottomata wrote: > Ya that sounds good! > > FWIW, if you are just trying to test stuff, you are welcome to... [13:50:53] klausman: not sure if you followed the last updates in https://phabricator.wikimedia.org/T310980 but we may or may not be able to use bullseye with cassandra :D [13:51:12] if Eric Evans uploads Cassandra 4.x to our repos we could try it [13:51:29] we'd be the first ones, but we'll not need to migrate/upgrade/whatever for a long time :D [13:51:39] lemme know your thoughts when you have a moment [13:51:52] (eqiad is running on buster + cassandra 3.11) [13:52:22] * elukey bbiab [14:08:09] Hmmm. [14:09:00] Much as I like being slightly ahead of the cruve, I'm not super keen on being the first to try and run Cassandra on Bullseye. On the third hand, we would only be the first to do so _in WMF,_ not the world [14:13:10] (03CR) 10Klausman: [C: 03+1] Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 (owner: 10Elukey) [14:40:21] yeah I get the feeling [14:40:41] cassandra 4.x has a ton of improvements (also in performance) [14:41:03] but it may of course give us some headache that we (as WMF) havent' faced before [14:41:11] but the good side of it is that Eric is a Cassandra committer :D [14:43:33] -- [14:44:06] if people are ok we could deploy ORES tomorrow for https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/809597/, should be a tiny patch [14:47:19] (03PS4) 10Elukey: Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 [14:47:21] (03PS1) 10Elukey: scap: increase ores canary targets from 1 to 4 [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809617 [14:47:34] klausman: I have also filed --^ to increase the canary nodes to 4 [14:47:45] 2 for each DC, may be a good compromise over 18 nodes [15:56:59] all right going afk, have a nice rest of the day folks! [15:58:10] bye Luca! :)