[05:02:23] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/790700 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:06:26] (03Merged) 10jenkins-bot: articlequality: add wmf-certificates to blubber.yaml [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/790700 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [07:43:59] hello folks, I checked some ores metrics and logs, nothing weird that I can see after the last reimages (to confirm what Tobias checked yesterday) [07:52:15] started the reimage of ores2009 [08:14:26] o/ good morning Luca! :) [08:17:44] elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/790983 [08:17:58] elukey: I updated the articlequality image to most recent build which includes wmf-certificates and the extended feature output [08:20:55] aiko: good morning, change ready to deploy whenever you want [08:22:39] elukey: yesterday I tried to deploy but I can't [08:22:48] ah right, still that problem [08:22:49] mmm [08:23:55] aiko: let's try this [08:24:00] - ssh to deploy1002 [08:24:12] - in your home dir, mkdir helm_cache [08:24:33] - export HELM_CACHE_HOME=/home/aikochou/helm_cache [08:24:37] then try to deploy [08:24:41] lemme know if it works [08:24:46] ok [08:28:23] elukey: I don't need to put anything in that helm_cache dir? [08:28:29] nope [08:29:22] basically helmfile uses "helm" behind the scenes, and helm uses some caching dirs. SRE changed those during the past days, and the new ones have permission for groups that we are not in [08:29:38] see [08:29:39] elukey@deploy1002:~$ env | grep HELM [08:29:39] HELM_CONFIG_HOME=/etc/helm [08:29:39] HELM_CACHE_HOME=/var/cache/helm [08:29:39] HELM_DATA_HOME=/usr/share/helm [08:30:04] in your case it should be the CACHE_HOME dir the issue [08:30:29] and with `export etc..` you should override it (maybe double check it with the command that I used above just to be sure) [08:33:30] elukey: doesn't work.. still the same error [08:33:56] and the env var is correctly overridden? [08:34:11] yep [08:35:38] weird [08:35:45] if you run the helmfile with --debug? [08:35:48] do you get more info? [08:35:56] let me see [08:37:09] ohh yeah there are more info [08:37:12] helm.go:88: [debug] open /home/aikochou/helm_cache/repository/wmf-stable-index.yaml: no such file or directory [08:37:17] no cached repo found. [08:37:58] ahhhh interesting [08:39:12] at this point we can try the helm repo update aiko [08:39:20] ok [08:39:26] it shouldn't fail, and it should populate that custom cache dir [08:39:57] (last famous words) [08:40:11] yes helm repo update works [08:40:35] and you can now sync? [08:40:43] yes! [08:40:45] super :) [08:40:48] deployed [08:40:57] :))) [08:41:05] ok so this is a workaround, we'll have to follow up with SRE to have a permanent and more stable solution [08:41:17] when you have a moment could you please update your task with the workaround? [08:41:24] nice, thanks Luca!! [08:41:27] np :) [08:41:43] Ok! no problem I'll update the task [08:53:47] ores2009 reimaged, will pool it later [09:01:59] something is not right, the new pod doesn't start successful. Checking logs.. [09:04:01] botocore.exceptions.NoCredentialsError: Unable to locate credentials in storage-initializer [09:06:36] It seems storage-initializer can't pull the model from s3 to local [09:12:13] elukey: \o that leaves 6-8 to do, right? [10:16:28] yep I see a secret in the diff now [10:16:52] syncing it, very weird that it got removed [10:18:03] aiko: the pod is up :) [10:18:07] going afk, ttl! [10:18:29] about to start reimage etc of 2006 [10:35:46] elukey: do you think the swift-s3-credentials issue is related to the recent workaround (changing HELM_CACHE_HOME)? [10:35:58] I'll continue deployment to codfw and see if that happens again [11:27:37] 2006 reimaged, examining and if good, repooling [11:58:16] Pooling 2006 now will start work on 2007 in a bit [12:02:23] 10Lift-Wing, 10Machine-Learning-Team: Unable to run helmfile and check pods - https://phabricator.wikimedia.org/T307927 (10achou) The following is a workaround for now: 1. ssh to deploy1002 2. in your home dir, `mkdir helm_cache` 3. `export HELM_CACHE_HOME=/home/aikochou/helm_cache` 4. `echo $HELM_CACHE_HOME`... [13:10:44] Morning all! [13:10:54] Yay! It’s working! [13:26:16] Heyo Chris [13:31:33] I'm doing the final codfw ORES now (2008, since Luca already did 2009 this morning) [13:39:19] klausman: nice :) [13:39:36] Hope that mention of your firstname didn't ping ya :) [13:39:47] nono :) [13:40:36] aiko: it may be related, do you recall if you have seen a Secret resource in the helmfile diff getting removed? [13:46:39] elukey: I just checked helmfile diff for codfw, and yeah I see the Secret has been removed [13:46:58] aiko: ah snap then it is an issue [13:48:36] elukey: why does it get removed? [13:48:56] aiko: no idea, it may be some internal helm behavior that I am not aware [13:50:46] elukey: ok, I'm going to update the task to add this [13:51:10] elukey: does this secret come from the private repo? [13:55:31] klausman: yes exactly [13:55:47] Is it still present on the puppetmasters? [13:55:56] it is rendered in a special helmfile private yaml, that is rendered on deploy1002 [13:56:05] it is yes, I was able to see the diff to add it bak [13:56:06] *back [13:56:38] I think that the special HELM_CACHE_HOME that Aiko uses is not the same as the one that puppet creates [13:56:45] that is /var/cache/helm [13:57:20] so the quick solution is to add Aiko and Kevin to the `deployment` group [13:58:00] so they can read /var/cache/helm, but serviceops suggested to create a special dir for us etc.. [13:58:06] it may be cumbersome [13:59:07] 10Lift-Wing, 10Machine-Learning-Team: Unable to run helmfile and check pods - https://phabricator.wikimedia.org/T307927 (10achou) We just found the above workaround may cause an issue that the swift-s3-credentials Secret resource got removed for some reason: ` aikochou@deploy1002:/srv/deployment-charts/helmfi... [14:00:33] ah no another issue sigh [14:00:34] elukey@deploy1002:~$ ls -l /etc/helmfile-defaults/private/ml-serve_services/revscoring-articlequality/ml-serve-codfw.yaml [14:00:37] -rw-r----- 1 mwdeploy deployment 407 Feb 17 20:15 /etc/helmfile-defaults/private/ml-serve_services/revscoring-articlequality/ml-serve-codfw.yaml [14:00:40] aiko: --^ [14:01:03] ok now it makes sense, you cannot read the file with private values anymore [14:05:41] ooh I see :( [14:05:58] basically https://gerrit.wikimedia.org/r/c/operations/puppet/+/791036 [14:06:13] let's see what SRE thinks about it (also klausman lemme know your thoughts :) [14:06:33] havin' a look [14:07:19] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Unable to run helmfile and check pods - https://phabricator.wikimedia.org/T307927 (10elukey) >>! In T307927#7920993, @achou wrote: > We just found the above workaround may cause an issue that the swift-s3-credentials Secret resource got removed for so... [14:07:37] since some sudo is involved, we'll need to wait a proper signoff from SRE [14:07:40] it may take a few days [14:07:49] aiko: I can deploy to codfw for you in the meantime [14:08:31] (done) [14:13:30] elukey: thanks!! [14:42:10] ores2008 is done and I'll pool it in a minute, then take a quick break before the meeting [14:43:16] klausman: nice, all codfw done :) [14:43:44] Yep! [14:44:14] I am debating with myself whether we could do eqiad within two days if we do two machines at a time, but we can discuss that during the meeting [14:44:52] klausman: I think so, we can take two nodes out at the same time, we left 2001 and 2002 down for days without issues [14:49:07] My main concern is doing major work on a Friday, which can still result in outages stretching into the weekend. [14:56:30] yep I agree, probably we can take it easy and do some nodes tomorrow [14:56:37] and finish on Monday, there is really no rush [14:57:11] Ack [15:18:09] going afk folks, have a nice rest of the day :) [15:21:03] bye Luca! :) [15:56:27] Bye Luca! [20:44:37] chrisalbon: https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ This might be good news for any future GPU efforts we might have [20:44:52] The jury is still out on stability, functionality etc [20:46:29] AH. Note: [20:46:33] Will the source for user-mode drivers such as CUDA be published? [20:46:35] These changes are for the kernel modules; while the user-mode components are untouched. So the user-mode will remain closed source and published with pre-built binaries in the driver and the CUDA toolkit.