[06:57:59] hi folks! [06:58:14] I am going to do some work this morning, starting very easy [07:01:42] good morning folks :) [07:07:41] aiko: hello! Nice work on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/788747 [07:09:41] elukey: Hi Luca! thanks :) [07:26:56] kevinbazira: o/ I deployed the change to fix the wikidata pods in codfw too, it went fine [07:27:37] nice. thanks elukey 👏👏👏 [07:29:01] I noticed Tobias mentioning some broken pods in eqiad, I am trying to clear them out [07:35:28] klausman: o/ I was able to clean up the broken pod in eqiad, did something like the following [07:35:46] kubectl describe pod blabla -n revscoring-editquality-goodfaith | grep -i replica [07:35:49] then [07:36:06] kubectl delete rs $what-found-above -n revcoring-...-... [07:36:58] all the broken pods had a duplicate in running state, so I think that the kserve controller (or maybe knative) may have left things in a weird state [07:52:12] 10Lift-Wing, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): Upload articlequality model binaries to storage - https://phabricator.wikimedia.org/T307417 (10kevinbazira) 13/13 articlequality models were uploaded successfully to Thanos Swift. Here are their stor... [07:52:13] need to run a quick errand, bbiab [07:55:12] 10Lift-Wing, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): Create articlequality inference services - https://phabricator.wikimedia.org/T307418 (10kevinbazira) a:03kevinbazira [08:17:28] I'm going to deploy new editquality and draftquality images for single wiki to test the feature https://phabricator.wikimedia.org/T301766 [08:20:08] elukey: when I ssh to deploy1002 and check the current pods, it shows: You don't have permission to read the configuration for revscoring-draftquality/ml-serve-codfw (try sudo) [08:20:41] elukey: I typed kube_env revscoring-draftquality ml-serve-codfw [08:21:15] elukey: and the same for eqiad [09:00:01] aiko: ah yes you'd need to use kube_env $kubernetes-namespace ml-serve-{eqiad,codfw} [09:00:23] mmmm sorry so you are already using draftquality [09:00:24] weird [09:01:28] ah okok [09:01:35] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add a new helmfile config for outlinks topic model - https://phabricator.wikimedia.org/T307895 (10achou) [09:01:55] aiko: you are probably using `kubectl get pods -A` right? Try to replace `-A` with `-n revscoring-draftquality` [09:02:14] -A tries to list all pods in all namespaces, and you don't have perms [09:04:39] elukey: no no, I got the permission issue when I typed kube_env revscoring-draftquality ml-serve-codfw [09:08:10] ah! [09:10:47] aiko: indeed there is a perm issue, but were you able to check pods before? [09:11:13] even on other namespaces I mean [09:14:04] elukey: yeah I was able to check pods when we deployed articlequality and draftquality before. But now I can't.. very weird [09:17:17] aiko: so https://gerrit.wikimedia.org/r/c/operations/puppet/+/790288 should fix it, I am testing it atm, but it will need some consensus from other folks [09:17:24] I can check pods for you i nthe meantime [09:17:44] you should be able to deploy but not check pods [09:19:22] elukey: ok! I will proceed with deployment and let you know when I'm done [09:31:20] elukey: thanks for that cleanup. I figure it was something like that, but wasn't sure how much I'd break randomly deleting stuff :) [09:31:43] It's a bit odd that diffing from the deploy server didn't show it [09:32:01] :) [09:32:22] in theory the ReplicaSets are managed by the controllers and k8s, those don't show up when we diff [09:32:36] since only the k8s resources modified (like say isvc etc..) are shown [09:32:42] in this case I think it is a bug in kserve or similar [09:33:13] Yeah, I suspect with all the churn of drain+reboot+uncordon, we tickled a race condition or something like that [09:43:14] klausman: I may have found a way to make the ores wheels working [09:43:48] so what I did was running `git lfs install --system` on ores2001 [09:43:55] it added the following to /etc/gitconfig [09:44:04] [filter "lfs"] clean = git-lfs clean -- %f smudge = git-lfs smudge -- %f process = git-lfs filter-process required = true [09:44:22] and then I have rm -rf /srv/deployment/ores and ran puppet [09:44:28] now I see the submodule full of zip files [09:44:32] Hmm. [09:44:43] Do you think that maybe --global was just -renamed_ to --system? [09:46:25] this is the weird part, I checked --global between 2.3.x and 2.7.x releases and code, trying to find a reference to it [09:46:28] but found none [09:46:47] even `man git-lfs-install` on say ores1001 doesn't mention it [09:46:50] Yeah, one would expect something along the lines of "Btw, we renamed this optin" in the changelog [09:47:11] there was a mention to --system IIRC [09:47:31] /etc/gitconfig on ores1001 (stretch) doesn't have the filters though [09:47:50] so I guess that somehow the config is added in a place that please the whole deployment [09:47:50] Might that defaults changed, and now wee need those options [09:48:52] we could add to our ORES roles something liek git::systemconfig [09:49:16] since we manage it via puppet, and then apply the config to buster+ nodes [09:49:21] so when reimaging we should be fine [09:49:39] Yeah. MAybe test with 2002 as well, see if it fixes both machines. Since they clearly weren't 100% the same coming out of the install [09:49:57] ahahahah yeah so on 2002 will not work for sure [09:50:18] As in: add it to puppet for Buster nodes, see if it auto-fixes (well, semi-auto) 2002, and if it does, reimage 2003 [09:50:28] Why? [09:55:07] I was joking since 2001 worked perfectly at first and 2002 failed miserably :D [09:55:13] Oh. Phew. [09:55:21] I'd reimage 2001 or 2002 to avoid too many hosts down if you agree [09:55:26] YArp [09:55:30] I am testing 2002 now [09:55:39] Maybe do the puppet change first, reimage 2002, see what happens [09:56:06] If you wante me to help/review/shoulder-surf, lmk [09:56:22] yep yep definitely [10:00:49] worked on 2002 [10:02:38] Nice. [10:33:12] klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/790297/ this is the idea [10:33:37] Looking... [10:34:49] LGTM! [10:35:59] ok so I'd just disable puppet everywhere, run on a couple, and re-enable [10:36:53] :+1: [10:48:25] 2001 and 2002 up, everything seems to work afaics [10:48:57] so we can reimage one of them and see how it goes just to be sure [10:49:16] I am going to log off in a bit, will check after lunch but will take the day light for today [10:49:30] klausman: if you want to reimage please go ahead, otherwise we can do it tomorrow :) [10:56:29] I'll do a reimage of 2002 in a bit. [10:56:42] Basically retrace my steps from last time, see if now the outcome is favorable. [11:00:00] <- Lunch, bbiab [11:45:13] Starting reimage of 2002 now [11:50:50] Installer running [12:08:33] post-install reboot [12:13:40] fingers crossed :) [12:15:08] Now hitting first puppet run [12:16:03] it will take a bit :) [12:16:20] Yah, making a pot of tea in the meantime [12:29:22] going out for a walk :) [12:39:44] Puppet run is complete, now waiting for the reimage-cookbook to complete [12:39:51] tehn scap deploy and a reboot [12:40:34] Hmm, the cookbook does a reboot of its own, so I'll skip a manual one :) [12:45:32] resinstall+reboot complete, doing a scap deploy [12:46:03] and complete [12:46:26] # file /srv/deployment/ores/deploy/submodules/wheels|grep -v Zip [12:46:27] /srv/deployment/ores/deploy/submodules/wheels: directory [12:46:29] Looking good! [12:47:07] elukey: once you're back, can you share some other ways in which to see if 2002 now works fine? I presume looking at logstash [13:12:45] klausman: \o/ [13:12:53] so we can check few things [13:12:59] 1) systemctl status uwsgi-ores [13:13:07] 2) systemctl status ores-celery-worker [13:13:15] and their logs files, under /srv/logs/ores [13:13:27] we can also check in logstash the ORES error log board [13:13:37] but it should contain what we see in the logs [13:13:54] we can add traffic to say 2001 or 2002 and then observe if they log anything horrible [13:14:02] celery-ores-worker.service* [13:14:10] ah yes sorry [13:14:14] always mess that up [13:14:19] np, tab completion to the rescue :) [13:15:34] "directing traffic" would be done how? [13:17:27] yes sorry, I mean pooling [13:17:31] sudo -i pool on the node [13:17:35] and then checking logs [13:22:39] Morning all! [13:24:59] morning chris [13:25:09] I'll pool 2002 and look for noise [13:27:24] Looking good so far. Requests happen, no visible errors on-machine or on logstash [13:30:17] super [13:30:21] Now I see a few JSON decoding errors (liekly malformed requests), but not many [13:30:38] Two in total, so far [13:31:21] https://phabricator.wikimedia.org/P27771 One example [13:32:10] If I remember Python JSON decoding errors correctly an error at 1,1 means the input was empty [13:32:28] (whereas an actually valid empty JSON input would be `{}`) [13:33:58] LEt's keep this running for the rest of the hour, wait for explosions/noise/complaints, then pool 2001, give it also 30m and then decide how to proceed. If all is well, I can reimage two more nodes tomorrow (one by one, not in parallel), and if that goes well, pick up the pace for the rest. [13:34:18] yep if it is wikibase-related we have a fix on revscoring for it that needs to be deployed [13:34:33] +1 [13:35:00] I am going away in a bit, but please go ahead anytime, the plan looks good [13:37:14] Alrighty! [13:37:20] Take care, Luca. [13:42:08] <3 [13:50:56] With LW Kernel updates done and now a plan for the ORES updates, I think I got a grip on things (and can thus now strangle them :D ) [14:00:42] Pooling 2001 now [14:25:58] So far similar to 2002: a few malformed-requests errors, but nothing worse. [15:10:05] 10Lift-Wing, 10Machine-Learning-Team: Unable to run helmfile and check pods - https://phabricator.wikimedia.org/T307927 (10achou) [16:17:25] Looks like the change for ORES makes things work fine, will proceed with more machines tomorrow, as planned.