[05:43:45] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Upload editquality model binaries to storage - https://phabricator.wikimedia.org/T301413 (10kevinbazira) a:03kevinbazira [05:48:34] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Upload editquality model binaries to storage - https://phabricator.wikimedia.org/T301413 (10kevinbazira) 76/76 editquality models were uploaded successfully to Thanos Swift. [05:51:04] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Add editquality isvc configurations to ml-services helmfile - https://phabricator.wikimedia.org/T301415 (10kevinbazira) Inference services were created for all the 76 editquality mo... [05:53:14] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Epic, 10Machine-Learning-Team (Active Tasks): Migrate editquality models - https://phabricator.wikimedia.org/T301409 (10kevinbazira) [05:55:14] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Epic, 10Machine-Learning-Team (Active Tasks): Migrate editquality models - https://phabricator.wikimedia.org/T301409 (10kevinbazira) a:03kevinbazira The migration of editquality models has been completed. 76/76 editquality [[ https://p... [06:34:14] good morning folks [06:34:25] I am planning to start the ores2001 reimage in a few [06:39:25] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:03:35] of course one puppet/erb conditional doesn't work [07:03:42] uff [07:03:47] trying to fix it before the reimage [07:26:13] all right reimage of ores2001 started [07:26:16] fingers crossed [08:11:26] the first puppet run on an ores node takes ages [08:11:54] also a ton of packages and configs on HDDs is probably not super fast [08:27:07] ores2001 up and running! [08:27:28] before repooling uwsgi, I'll leave it running as it is to check if any celery error comes up [08:27:45] (celery should connect to redis to pick up tasks independently from the "local" uwsgi instance) [08:28:00] * elukey bbiab [08:50:20] host repooled, nothing horrible in the uwsgi logs afaics [08:50:23] celery logs are clear [08:51:09] so far it seems that it worked :) [08:53:44] (logstash error logs are clear too) [08:58:02] \o [08:58:23] https://as13030.net/status/?ticket=4266277 <- not a great way to start the week :) [08:58:34] (AS13030/init7 is my ISP) [08:59:10] Also, nice work on the first ORES host, Luca. Lmk if I can help [09:19:30] klausman: o/ [09:19:52] for ORES let's keep an eye together on the logs and logstash, for atm all seems good from what I can see [09:20:06] if you could double check as well I'd be very happy and reassured :D [09:20:56] it seems too good to be true [09:21:01] :D [09:21:09] Let me dig up my logstash bookmarks [09:21:24] but yeah the only error logs on the node that I can see are [09:21:25] json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) [09:21:38] and this is the wikidata itemsquality issue that Aiko is working on [09:21:47] we need to deploy a new version of revscoring to fix it [09:21:53] That is indeed suspiciously well-working [09:22:36] we tested all models carefully in deployment-prep, so in theory we should be safe [09:25:23] AFAICT the only errors are the JSON one you mentioned and a few timeouts [09:26:03] e.g. `Feature extraction error for model 1085764976 and revision damaging due to: Timed out after 15 seconds.` [09:26:21] Which is probably benign/normal [09:26:47] these are the ones that I found as well, I was trying to see if they are reproduced in other nodes [09:27:09] the model/revision logging is twisted, we need to release a patch to fix it [09:27:21] but sadly it doesn't tell much about what wiki it was [09:27:50] Yeah, that should be fixed if possible [09:28:14] So color me cautiously optimistic about this migration :) [09:28:32] IIRC there was no way to add the wiki when I added the logging, at least the info was not available [09:29:21] ok yeah I see the feature extraction errors on some past logs on ores2* nodes [09:29:41] let's keep an eye on their numbers [09:34:24] klausman: I think that we could reimage ores2002 too to have a bigger set of nodes to check [09:34:27] what do you think? [09:34:47] I also wrote the procedure that I followed in https://phabricator.wikimedia.org/T303801#7895223 [09:36:59] in case, change is ready :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/788293/ [09:37:52] Should I do the actual deed? [09:38:23] if you want yes! [09:38:28] +1'd that change [09:38:56] the scap repo should not give any issue during the first puppet run, but in case we'll need to scap deploy --limit [09:39:05] klausman: if you have time please go ahead with the procedure [09:39:08] ack. [09:39:43] to depool/repool the node, I used the shortcut `sudo -i depool/pool` on the node itself [09:39:48] Just one question: the reimage-to-buster would happen without updating the netboot cmdline? Or are those already done for all ROES nodes? [09:39:50] but it is the same if done via conftool etc.. [09:40:19] klausman: so the partman recipe is the same and we are good, the buster-specific bits are set via cookbook [09:40:28] ah, I see [09:40:55] So you have an example cmdline for that? [09:41:12] 2002 is depooled now [09:41:29] ah so one thing that I noticed only now since you mentioned netboot [09:41:42] on ores2001 I see one md0 raid1 array [09:41:49] on ores2002 two, md0 and md1 [09:42:16] so I believe this is due to the fact that ores now uses the "standard" partman recipes created by SRE a while ago, but surely after the last ores node reimage [09:42:28] looks good from my side but lemme know your opinion as well [09:42:30] just to be sure [09:42:57] I used `sudo cookbook sre.hosts.reimage ores2001 --os buster` on cumin1001 (forgot to add the task though) [09:43:03] I think it should be fine. WIll keep an eye on the reinstall in any case [09:43:36] +1 perfect [09:43:48] remember to puppet merge etc before the reimage [09:44:01] so the first puppet run will pick up the change [09:44:32] should I merge the change gerrit-side or will you? [09:45:00] going to run pcc quick to verify the two changes in the diff [09:45:03] then you can proceed [09:45:34] (celery's systemd unit changes, and another config bit for ores as well, string -> int) [09:45:47] (otherwise celery refused to start sigh) [09:46:05] all good! https://puppet-compiler.wmflabs.org/pcc-worker1002/35024/ores2002.codfw.wmnet/index.html [09:46:10] klausman: green light to merge! [09:46:36] ack [09:49:39] About to run: sudo cookbook sre.hosts.reimage -t 303801 --os buster ores2002 [09:49:52] +1 [09:50:44] Running [09:51:20] the first puppet run took ages to me, I thought it was stuck but it was simply pulling a million packages in [09:51:36] Server is POSTing... [09:52:20] * elukey brb [09:52:38] You always run away when things get Interesting :D [09:53:20] (I kid, of course) [09:55:49] and we're in the installer [09:57:19] https://phabricator.wikimedia.org/P27336 New layout lgtm [10:06:23] yep all good [10:06:24] Man, isntalls on spinning rust sure take longer than on SSDs :D [10:06:33] yeah :D [10:07:42] apt's insistance on using O_SYNC/fsync() doesn't help [10:07:57] what could possibly go wrong on HDDs [10:08:29] I mean, during an install, it would be easier to not sync every write and just at the end run a sync, and maybe check all file checksums [10:08:54] yes yes I was joking about how slow it is now :) [10:09:10] If an install gets interrupted, the system is likely toast anyway [10:09:44] \o/ Installing grub [10:10:10] the first puppet run will take a lot more, brace yourself :D [10:10:53] in theory we can do a couple of nodes per day if nothing comes up, a gentle pace that will allow us to have all on Buster during the next couple of weeks [10:15:41] rebooted [10:16:58] So I have to do the puppet cert dance? [10:17:36] Doesn't look like it. [10:17:53] nope all taken care by the cookbook [10:18:01] puppet running now [10:18:02] it is only for vms that we need the special handling [10:19:37] and the cookbook thinks the install filed since it thinks the puppet run failed (it didn't) [10:19:59] weird, it should retry in theory [10:20:50] https://phabricator.wikimedia.org/P27339 [10:21:40] this may be a bug, the "asking operator etc.." didn't work right? [10:22:02] I told it to skip the step (I can always do a manual run), but that was apparently still a failure [10:22:11] ahhh okok [10:22:44] Question is: do I have to remove downtimes etc manually? [10:23:05] in theory yes, but we can let them expire natually [10:23:09] *naturally [10:23:09] ack. [10:24:23] have you checked the puppet error? [10:24:42] ah no it seems running [10:25:19] klausman: did you kick off another run? [10:25:29] yep [10:25:46] Maybe I did so too early and that's what "broke" the cookbook [10:25:52] okok I am following from the mgmt console [10:35:20] created https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/788308, we'll see what Riccardo has to say about it :) [10:43:22] Puppet run was done, but with non-0 exit, so rerunning it [10:44:36] ah interesting, I think that we have to do the scap deploy --limit [10:44:56] part of what it does it to kick off some post deployment scripts, that create the venv etc.. [10:45:16] I see. Is a reboot also maybe useful? [10:45:23] yep definitely [10:45:33] Ok will do that in a moment [10:45:36] we can do both, scap deploy --limit ores2002.codfw.wmnet from deploy1002 [10:45:38] then a reboot [10:45:41] klausman: --^ [10:45:44] Ack [10:46:05] does scap have to run in a particular directory? [10:46:21] puppet completed with 0 exit status [10:47:06] yes, /srv/deployment/ores/deploy [10:47:19] as root or normal user? [10:49:07] trying as normal user seems to have worked [10:49:40] rebooting now [10:50:40] when you have a moment, Riccardo asked a question in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/788308/ about how it was run the cookbook [10:51:58] replied [10:52:41] ahh so you were running puppet before the cookbook [10:52:47] ores2002 is back [10:54:34] still same issue with celery [10:54:35] mmmm [10:54:59] How is that manifesting? [10:55:14] the celery-ores systemd unit fails [10:55:26] Maybe we need a post-reboot deploy? [10:55:48] nono it says that /srv/deployment/ores/deploy/venv/bin/celery is missing [10:56:07] so there is some issue with scap [10:56:15] I am going to try another deployment [10:56:19] I also can't SSH into deploy1002 anymor?! [10:56:40] probably moritz is rebooting it [10:56:46] ah [10:57:33] as for celery saying that /srv/deployment/ores/deploy/venv/bin/celery isn't there: that is actually the truth [10:57:47] the dir exists, but no celery binary [10:58:27] yep I think that the post deploy scripts that scap launches are not creating the venv correctly [10:58:36] so I tried again with scap deploy --limit blabla --force [10:58:53] since I think without --force it doesn't do anything if the target node is already up-to-date with the code [10:58:58] meanwhile we want to run the scripts [10:59:05] but it failed for a zip issue on wheels [10:59:12] retrying [10:59:26] it says [10:59:26] Processing ./submodules/wheels/billiard-3.6.4.0-py3-none-any.whl [10:59:26] Exception: [10:59:36] raise BadZipFile("File is not a zip file") [10:59:36] zipfile.BadZipFile: File is not a zip file [10:59:52] but it doesn't make any sense, it worked for ores2001 [11:00:07] let's use the hammer, I am going to clean up the /srv/deployment/ores dir on ores2002 and redeploy [11:00:28] ack [11:01:20] so lovely to use git-lfs [11:05:17] Was that ironic? [11:05:24] yep :) [11:05:54] I have used it personally, and it does have some rough edges, but it hasn't bitten me (yet) [11:15:21] of course I made a mistake on deploy1002 sigh [11:19:00] What happened? [11:22:06] Oh. [11:22:10] Just saw it. [11:22:12] yep [11:22:40] Well, shit happens. You're not the first and certainly not the last to do something like that [11:23:00] Plus, it's recoverable. [11:23:10] sure sure, not happy about it, very silly mistake [11:23:11] Anything I can do to help with that? [11:25:22] nah thanks [11:25:48] then I'll go find some lunch, bbiab [14:10:14] Morning all! [14:10:54] Any issues with the reimage? [14:11:05] hi! [14:11:15] ores2001 perfect, 2002 problems with git-lfs [14:11:26] Lol fuck [14:11:34] then I caused a problem on a deployment-server by mistake and lost hours :( [14:13:19] Ooophf. You all got this! [14:14:04] we got some issue while reimaging 2002 though, going to reimage it [14:14:08] (again) [14:14:09] let's see [15:08:00] seems the same problem [15:09:13] so on 2002, the submodules/wheels repo doesn't contain all zip files (the wheels) but some git-lfs ones [15:09:18] on 2001 all zip files [15:11:38] it is probably my fault, some wheels in https://gerrit.wikimedia.org/r/plugins/gitiles/research/ores/wheels/+/refs/heads/python37/billiard-3.6.4.0-py3-none-any.whl are git-lfs entities [15:11:42] others are binaries [15:12:39] pulled from https://git-lfs.github.com/spec/v1 [15:12:52] mmm I never synced the gerrit wheels in github [15:12:58] I recall that there was a procedure [15:20:08] so github was mirrored via diffusion to gerrit, but this is for model repos [15:20:14] the ores wheels one didn't have this IIRC [15:20:25] kevinbazira: do you recall? [15:21:56] ok so https://github.com/wikimedia/research-ores-wheels/commit/a5de707d2ff292c8d7d31c51ba81531ce6e9911b [15:22:12] this is an example, on the gerrit mirror the last time Aaron pushed the wheels [15:42:02] ok no so https://github.com/wikimedia/editquality is the principal repo [15:42:22] https://github.com/wikimedia/research-ores-wheels is a gerrit mirror [15:52:02] something is weird [15:52:11] I can't really understand the ores wheels set up [15:52:26] and why on 2001 it worked [15:54:21] for example, on deploy1002 [15:54:22] elukey@deploy1002:/srv/deployment/ores/deploy/submodules/wheels$ file billiard-3.6.4.0-py3-none-any.whl [15:54:25] billiard-3.6.4.0-py3-none-any.whl: ASCII text [16:17:26] I remember git annex (not LFS) having a separate pull command that worked in a non-obvious way. Is this maybe happening on 2002? [16:27:49] no idea, it needs more work, for the moment it seems that even on the scap repo some files are lfs some files are not [16:28:18] we have also upgraded scap to avoid a settings for git-lfs, but of course nothing like this happened in deployment-prep [16:28:34] I am curious to see if we get the same reimaging ores2001 [16:29:24] when I reimaged ores2001, the scap repo on deploy1002 didn't get updated (I forgot it), so I did it afterwards and ran scap deploy to the target node [16:29:36] meanwhile 2002 had the right code since the beginning to pull from [16:29:41] maybe there is some weird race condition [16:34:23] mmm so the last commit on master, that we are running on stretch nodes, has a mixture of lfs and binary wheels [16:34:31] but it was mirrored to github [16:35:01] maybe we just need to force the mirroring on github [16:43:06] on ores2001 [16:43:07] ii git-lfs 2.7.1-1+deb10u1 amd64 Git Large File Support [16:43:40] same on 2002 [16:43:54] same scap version [16:50:41] Listen, I don't want to interrupt your flow, but it is okay to leave this. Go for a run. Have a glass of wine. Read a good book. And come back to this tomorrow. [16:51:20] ah yes definitely [16:51:30] I was just braindumping if others have ides :) [16:51:34] *ideas [16:52:20] You can come to the Tech Dept budget meeting I'm attending in 8 minutes and let your brain rest [16:55:17] thanks for the great opportunity but I'll keep going with the SRE meeting :D [16:55:24] lol [16:56:02] Also I probably wont be at the DSE K8s Sync meeting because it is 2:30am my time [16:56:20] lol [16:56:22] So you'll be representing us Luca [16:58:31] If we do it, klausman might be best situated to do the work but I think Olja wanted a small initial meeting just to see if the idea is even viable at a basic level [16:58:52] definitely [16:59:46] You got volunteered to do it because you know k8s the best on this team, unfortunately for you. [17:00:13] super fine, I have been talking about this with Ben and Joseph in the past months [17:00:35] In theory it is just a matter of creating phab tasks [17:00:40] and split the work [17:01:23] Cool, the only thing, which I am sure you are aware, is that we might need to buy new hardware and the deadline is closing fast. [17:01:35] yes yes +1 [17:02:29] Its okay to say "We don't know exactly what we need but our quick guess is X hardware" and we'll put in the procurement order before the deadline. Not ideal but whatever. [17:04:16] elukey: one oddity I found: [17:04:26] oot@ores2002:/srv/deployment/ores/deploy/submodules/wheels# git lfs pull [17:04:28] Skipping object checkout, Git LFS is not installed.B/s [17:04:37] But it happens on both 2002 and 2001 [17:05:56] Plus, 2001 has one modified file (numpy) [17:11:53] klausman: yeah weird [17:12:12] I think we'll need to ask to people with some context, I have zero idea atm [17:12:19] Same. [17:12:21] going afk for today, have a nice rest of the day folks! [17:12:31] Night elukey! [17:12:42] I will ruminate on it some tonight, maybe I'll have a bright idea by tomorrow (also, fresh eyes etc) [17:12:52] Thanks both of you! [17:17:53] Random question: what IRC client do you all recommend? [17:18:32] Really depends on what you want out of it, and with what kind of environment (GUI, text temrinal, web ui) you are comfortable [17:19:57] Popular choices for text/terminal clients are weechat (which I use) and irssi. [17:20:08] I have no strong preference in regards to the environment. I just want something "good". I am new to this IRC world. Right now I am using the web client of Libera chat. [17:20:12] Thaks [17:20:18] For GUI, I've heard good things about X-Chat and Quassel on Linux, but my knowledge there is limited. [17:21:00] I have no idea about GUI clients for Windows. Well, in the 1990s, mIRC was popular, but 20+ years old "knowledge" is not very useful today :) [17:22:14] I think I will go with WeeChat. Thanks. [18:36:40] hey all, do you know what 'wikilabels' is? I have a pair of stretch VMs (clouddb-wikilabels-01,02.clouddb-services, officially maintained by the Cloud VPS admins but not in practice) that seem to have a postgres database related to that [18:44:40] 10Machine-Learning-Team, 10Data-Services, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10Majavah) p:05Triage→03Medium [19:31:10] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10Majavah)