[08:05:53] morning [08:13:14] o/ [08:31:55] I think I found why sometimes the jobs run without memory optimized, it seems that the tags are not being applied, and the memoptimized runners are part of the regular pool (so when I tested it and it got a memopitmized runner, was just chance) [08:32:44] quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/45 [08:33:15] good morning, while reimaging an instance I ended up with a stuck DNS entry pointing to the old IP address. May one delete it for me if that is possible? ;) integration-cumin.integration.eqiad1.wikimedia.cloud. 58 IN A 172.16.1.230 [08:33:55] (I deleted that instance like half an hour ago but immediately created a new one with the same hostname and I guess that has confused something) [08:37:54] looking [08:42:21] hashar: what would be the new ip? [08:42:37] I ended up deleting the "new" instance [08:42:40] so I will create a new one ;) [08:42:57] oh, okok [08:43:13] but if I create a new one, the DNS entry is not updated with the new ip [08:43:36] dns is a bit complicated in our setup (many moving parts), so takes a bit to debug (and andre.w is the one with most experience with it xd) [08:44:00] I thought there some guidance against recreating instances with the same name [08:44:38] yep, historically it has been tricky (and it seems it's still not 100% ok, though it should afaik) [08:45:28] let me try to cleanup dns leaks, though I'm not sure it's detected as one [08:46:50] I guess next time I should delete the instance and wait for things to settle :) [08:49:41] it did consider it as a leak :/ [08:49:42] 0320dada-360f-473b-a8aa-9131fb7cd68d is linked to missing instance integration-cumin.integration.eqiad1.wikimedia.cloud. [08:50:05] was it an old VM? [08:54:36] dcaro: I guess so yes, at least that was the hostname [08:55:22] blancadesal: are you ready for the toolforge k8s upgrade? [08:55:39] dcaro: look like the DNS entry is gone! you are a hero :) [08:56:45] hashar: very old VMs will leak the dns entry when deleted and we run manually a script to clear those up from time to time (what I just did), new ones should not leak anything :), so if you see that issue again, raise up as it should be looked into [08:57:26] dcaro: awesome thank you. I have some other instances to create later today but they will come with a different hostname ;) [08:57:40] arturo: yep [08:57:41] I just wanted to retain the short `integration-cumin` as a convenience [08:58:03] 👍 [09:00:23] arturo: we're in the meet [09:00:48] dcaro: it worked. Thank you! [09:18:46] and the puppet self config ends up being broken: CSR retrieved from the master does not match the agent's public key. https://phabricator.wikimedia.org/T370130 [09:32:24] solved :) [10:06:49] I also found out cloud/instance-puppet is not updated anymore and filed https://phabricator.wikimedia.org/T370136 [10:11:28] Oh yes, we noticed the other day :/, but forgot to open a ticket (something else was on fire) [10:11:29] thanks! [10:12:09] it's not "functional" so everything still works, but it does not reflect the changes in the DB anymore [10:31:23] hmm [10:36:10] at least there is still a ssh user hitting th erepo in Gerrit ;) [10:36:14] anyway, lunch time!! [10:37:20] oh I also filed a flavor request in order to rebuild the CI instance that are building Debian packages [10:37:35] they used `g2.cores2.ram4.disk40` cause they are from 2019, and that flavor no more exist [10:37:39] https://phabricator.wikimedia.org/T370127 [10:37:39] :) [11:10:40] arturo: fyi, nfs-21 was stuck, we just rebooted it and I'm upgrading it now [11:11:56] hashar: we are looking into it :) [11:13:42] blancadesal: ok [11:16:26] arturo: do you want to me to do your remaining ones? I'm done with the non-nfs ones [11:21:16] I just got paged [11:21:20] by harbor [11:21:36] 13:20 <25B100+ wmcs-alerts> FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [11:22:11] the prometheus ssh session I had open is now stuck [11:22:38] that's the same one that went down tonight [11:23:21] arturo: I'm finishing the remaining worker nodes, then I'm going for lunch [11:23:37] console does not show a prompt for me [11:23:43] (on tools-prometheus-6) [11:25:33] blancadesal: ack, thanks [11:25:40] dcaro: shall I just force-reboot? [11:25:45] on it [11:26:07] back online [11:26:41] wow, it has 32G ram [11:26:53] the alert did not show up in alerts.w.o, I guess because the VM died [11:27:26] prometheus is booting up, lots of stuff to load [11:27:57] okok, prometheus is up and running now [11:29:23] maybe we need a dedicated prometheus alert [11:29:49] and have a bit longer `for` in the harbor one [11:30:19] last log before prometheus died [11:30:20] Jul 16 11:25:36 tools-prometheus-6 sssd[66645]: Child [1190843] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason. [11:32:27] https://www.irccloud.com/pastebin/GhUWGPqZ/ [11:32:31] from sssd.log [11:32:56] and sssd_wikimedia_org.log [11:32:58] https://www.irccloud.com/pastebin/rzxdiODl/ [11:33:11] that error is repeated many times before too (so not new) [11:33:41] arturo: I'll let you do the debugging and stop stepping on your toes xd [11:33:57] ok [11:34:04] I don't have a lot more information at the moment [11:35:18] arturo: worker nodes all done [11:35:30] blancadesal: ok, thanks! [11:36:10] arturo: maybe we can finish the ingress nodes after the toolforge meeting? [11:36:47] ok [11:36:50] if not, tomorrow [11:37:12] 👍 [11:41:14] * dcaro lunch [11:46:39] I just created T370143 [11:46:39] T370143: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143 [13:58:59] dcaro: so is adding the special flavor just a case of adding it here? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/modules/cloudvps_flavors/main.tf [13:59:39] yes, we try to modify flavors via tofu now [13:59:47] blancadesal: let's do ingresses now? [13:59:49] blancadesal: yep [14:00:44] ok, I'll send a patch later then. could someone please +1 the request? T370127 [14:00:45] T370127: Request new flavor for integration project - https://phabricator.wikimedia.org/T370127 [14:00:57] arturo: ok for ingresses [14:01:25] blancadesal: +1d [14:01:26] +1'd [14:01:30] thanks [14:03:01] blancadesal: so I'll just follow the instructions in the notes etherpad [14:03:06] kubectl -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=2 [14:03:17] ok [14:03:17] then wait for the pod to go away -- it can take a while [14:04:33] ✅ done, now waiting for pod to terminate [14:05:55] unrelated, I see some ingress pods were OOMkilled. They have a request of 2GB memory. Given they run on dedicated VMs, I would just give them more memory [14:05:56] so will we start with the node without the controller, or does that matter? [14:07:24] we can start with that one, it will be faster, otherwise we will need to wait for another fat pod to relocate [14:07:39] so, start with tools-k8s-ingress-9 [14:07:51] shall I do it? [14:07:56] blancadesal: yes, go [14:09:20] cookbook done [14:09:34] ok, then go to the next! [14:10:19] 8 [14:10:43] ack [14:11:58] I've got a couple of easy MRs https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/170 https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/169 [14:12:06] 7: last one [14:13:33] arturo: done, logs look ok [14:13:44] I'll scale the replicas back up [14:13:51] cool [14:13:56] T370162 [14:13:56] T370162: toolforge: ingress-nginx pods get OOMkilled, consider scaling up - https://phabricator.wikimedia.org/T370162 [14:15:05] arturo: how do you detect they get killed? [14:16:08] mmm, we just lost the pods, they got recreated, so they lost the info [14:16:28] the Pod resource has something like 'reason for last termination' which contained 'OOMkilled' [14:16:44] I see [14:18:14] that's it for the ingress nodes [14:18:27] that's it for the 1.25 upgrade :-) [14:18:36] * arturo closes a bunch of tickets [14:19:40] next up: 1.26 upgrade :)) [14:20:32] the king is dead, long live the king [14:21:39] this test is currently failing [14:21:44] https://www.irccloud.com/pastebin/1crhTTf4/ [14:22:24] seems unrelated to the k8s upgrade itself [14:24:04] blancadesal: you can check if there's such a job, is that in tools? [14:24:18] tools, yep [14:25:05] there's two jobs right now [14:25:07] https://www.irccloud.com/pastebin/CzSqLVBw/ [14:25:15] are you running any tests? [14:25:56] not right now, I can see the one you are listing just terminating [14:26:01] gone [14:26:33] oops – false: I was still running the tests xd [14:27:19] xd [14:27:42] and I was too -- sorry. Just cancelled my loop [14:27:56] might have been a "collision" then [14:27:56] ok, seems to be working now [14:28:14] we might want to add some check to avoid running it in parallel on the same tool [14:28:36] (or do something smart to be able to run in parallel in the same tool) [14:31:16] we need to always be smarter xd [14:44:48] argh, I somehow pushed this to main? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/commit/8554c52543ed188584216c2c9e55cf2cab84c51d 🤦 [14:47:20] branch not protected :-( [14:47:24] I'll update the settings [14:48:13] it is protected, just checked [14:48:21] it's only protected against force-pushing [14:48:38] and against members who are not 'maintainers' [14:49:08] I think we need this [14:49:09] there should be some settings though, at least in github there are [14:49:10] https://usercontent.irccloud-cdn.com/file/BX4QgVly/image.png [14:49:46] how does 'no one' work in the case of merging? [14:50:21] anyway, should I revert and open a PR as normal folks do? [14:52:14] blancadesal: I would force-revert, to don't leave a weird history [14:52:21] and then yes, open a PR [14:52:35] the merge case should covered by the other permission, no? [14:52:37] https://usercontent.irccloud-cdn.com/file/rXSrWRcx/image.png [14:53:24] ah yeah, that looks right [14:53:26] unless you are extra sure that the commit should stay in the history :-) [14:53:36] I'd revert and rewrite the history [14:53:53] ok [14:54:45] so 1) enable push by maintainers, 2) enable force push, 3) do the force push to rewrite history and remove the commit, 4) put the settings back into disabling push & force push [14:57:14] might break some scripts though if they are not handling history rewrites (ex. doing rebases, pull, etc. instead of reset --hard) [14:59:31] it's toooo late [14:59:36] to apologize [14:59:45] to late! 🎵 [14:59:47] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/15 [14:59:48] .... [14:59:53] now that's in my head too xd [15:00:00] I'm too suggestionable [15:00:17] I will keep exploiting that xd [15:00:54] about the repo settings, I think all our other repos might need a similar change [15:02:03] I think we don't allow force pushing by default on the repos, so only when really needed you go, change the setting, force-push, and revert the setting so there's no accidental force push [15:02:24] though I'm ok to change if everyone is ok [15:02:28] yup, but it's totally possible to accidentally push to main [15:03:05] oh, I thought it wasn't [15:04:01] that's what just happended :/ I also thought it wasn't possible [15:04:27] so all the repos are like that? [15:05:12] yes, also in gerrit in general, unless manually disabled [15:05:19] at least the ones I've sampled are like that [15:05:53] on my local clones of gerrit repos I run [15:05:54] "git remote set-url --push origin no_push_use_review" [15:06:00] https://gitlab.com/gitlab-org/terraform-provider-gitlab [15:06:01] so `git push` wont work [15:06:04] xd [15:06:19] so when the tofu patch gets merged what happens? gitops magic and the flavor becomes available, or are there extra steps? [15:06:51] yes, extra steps [15:06:59] we have the magic that exists before the gitops magic [15:07:21] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu [15:10:31] thanks [15:10:54] please review when you can: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/15 [15:13:14] I think I need to up my commit message game btw https://usercontent.irccloud-cdn.com/file/RXePEHhr/Screenshot%202024-07-16%20at%2017.12.28.png [15:13:34] xd [15:14:05] at cern I grep'ed the history with a set of swearwords, and the count was >100 [15:14:08] xd [15:14:22] (puppet repo, we had no CI/linting before) [15:14:33] xd [15:15:00] there were a lot of "Do this" -> "Now for real" -> "again" -> "XXXX" ... [15:15:39] :)) [15:23:21] after the `tofu apply`, is there a way to check that the flavor now is indeed available, or is tofu's success message to be trusted? [15:24:47] wmcs-openstack flavor list or similar I guess [15:26:31] https://www.irccloud.com/pastebin/FOjyn7SW/ [15:26:45] (you can also run it directly in a cloudcontrol) [15:30:22] dcaro: thanks [15:40:58] dcaro: I'll deploy https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/423 tomorrow morning, I won't be around for much longer today to test it [15:41:12] blancadesal: ack, np [16:00:49] * arturo offline [17:13:42] xd, doing tests creating venvs I think I'm killing nfs (on toolsbeta) [17:13:46] https://www.irccloud.com/pastebin/9emnBmK5/ [17:14:57] I'm running webservice shell to generate the venv, but if right after I do a source of bin/activate (in a script) it does not find it [17:16:28] if right after `dcaro@toolsbeta-bastion-6:~/toolforge-deploy$ sudo -i -u toolsbeta.automated-toolforge-tests rm -rf /data/project/automated-toolforge-tests/venv` I log in as the tool, the directory is there :/, if I log out and run the rm again, then it deletes it [17:16:40] nfs shenanigans [17:18:52] https://www.irccloud.com/pastebin/PxvoPy6g/ [17:19:09] hmpf [17:21:17] hmm if I ls $HOME in-between, then venv appears, probably it's just cached that the path does not exist or something [17:29:22] done [17:29:46] andrewbogott: as yesterday, I'm leaving ceph adding/removing the single osd for load, feel free to ping me if anything goes awry [17:29:50] * dcaro off [17:29:54] s/ping/page [17:31:00] there should be an alert downtime (added by the cookbook) so no alerts should trigger ( Downtiming alert from cookbook - Adding hosts ['cloudcephosd1034.eqiad.wmnet'] to the cluster - dcaro@urcuchillay), but if you see any not silenced also send me a message and I'll check tomorrow [17:35:30] ok!