I think I found why sometimes the jobs run without memory optimized, it seems that the tags are not being applied, and the memoptimized runners are part of the regular pool (so when I tested it and it got a memopitmized runner, was just chance)
quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/45
good morning, while reimaging an instance I ended up with a stuck DNS entry pointing to the old IP address. May one delete it for me if that is possible? ;) integration-cumin.integration.eqiad1.wikimedia.cloud. 58 IN A
(I deleted that instance like half an hour ago but immediately created a new one with the same hostname and I guess that has confused something)
looking
hashar: what would be the new ip? I ended up deleting the "new" instance
so I will create a new one ;)
but if I create a new one, the DNS entry is not updated with the new ip
dns is a bit complicated in our setup (many moving parts), so takes a bit to debug (and andre.w is the one with most experience with it xd)
I thought there some guidance against recreating instances with the same name
yep, historically it has been tricky (and it seems it's still not 100% ok, though it should afaik)
let me try to cleanup dns leaks, though I'm not sure it's detected as one
I guess next time I should delete the instance and wait for things to settle :)
it did consider it as a leak :/
0320dada-360f-473b-a8aa-9131fb7cd68d is linked to missing instance integration-cumin.integration.eqiad1.wikimedia.cloud.
was it an old VM? dcaro: I guess so yes, at least that was the hostname
blancadesal: are you ready for the toolforge k8s upgrade?
dcaro: look like the DNS entry is gone! you are a hero :)
hashar: very old VMs will leak the dns entry when deleted and we run manually a script to clear those up from time to time (what I just did), new ones should not leak anything :), so if you see that issue again, raise up as it should be looked into
dcaro: awesome thank you. I have some other instances to create later today but they will come with a different hostname ;)
I just wanted to retain the short `integration-cumin` as a convenience
arturo: we're in the meet
dcaro: it worked. Thank you! and the puppet self config ends up being broken: CSR retrieved from the master does not match the agent's public key. https://phabricator.wikimedia.org/T370130
I also found out cloud/instance-puppet is not updated anymore and filed https://phabricator.wikimedia.org/T370136
Oh yes, we noticed the other day :/, but forgot to open a ticket (something else was on fire)
it's not "functional" so everything still works, but it does not reflect the changes in the DB anymore
at least there is still a ssh user hitting th erepo in Gerrit ;) oh I also filed a flavor request in order to rebuild the CI instance that are building Debian packages
they used `g2.cores2.ram4.disk40` cause they are from 2019, and that flavor no more exist
https://phabricator.wikimedia.org/T370127
arturo: fyi, nfs-21 was stuck, we just rebooted it and I'm upgrading it now
hashar: we are looking into it :)
arturo: do you want to me to do your remaining ones? I'm done with the non-nfs ones
I just got paged
by harbor
13:20 <25B100+ wmcs-alerts> FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
the prometheus ssh session I had open is now stuck
that's the same one that went down tonight
arturo: I'm finishing the remaining worker nodes, then I'm going for lunch
console does not show a prompt for me
(on tools-prometheus-6)
blancadesal: ack, thanks
dcaro: shall I just force-reboot? on it
back online
wow, it has 32G ram
the alert did not show up in alerts.w.o, I guess because the VM died
prometheus is booting up, lots of stuff to load
okok, prometheus is up and running now
maybe we need a dedicated prometheus alert
and have a bit longer `for` in the harbor one
last log before prometheus died
Jul 16 11:25:36 tools-prometheus-6 sssd[66645]: Child [1190843] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason. https://www.irccloud.com/pastebin/GhUWGPqZ/
from sssd.log
and sssd_wikimedia_org.log
https://www.irccloud.com/pastebin/rzxdiODl/
that error is repeated many times before too (so not new)
arturo: I'll let you do the debugging and stop stepping on your toes xd
I don't have a lot more information at the moment
arturo: worker nodes all done
blancadesal: ok, thanks!
arturo: maybe we can finish the ingress nodes after the toolforge meeting? if not, tomorrow
I just created T370143
T370143: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143
dcaro: so is adding the special flavor just a case of adding it here? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/modules/cloudvps_flavors/main.tf
yes, we try to modify flavors via tofu now
blancadesal: let's do ingresses now?
blancadesal: yep
ok, I'll send a patch later then. could someone please +1 the request? T370127
T370127: Request new flavor for integration project - https://phabricator.wikimedia.org/T370127
arturo: ok for ingresses
blancadesal: +1d
+1'd
blancadesal: so I'll just follow the instructions in the notes etherpad
kubectl -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=2
then wait for the pod to go away -- it can take a while
✅ done, now waiting for pod to terminate
unrelated, I see some ingress pods were OOMkilled. They have a request of 2GB memory. Given they run on dedicated VMs, I would just give them more memory
so will we start with the node without the controller, or does that matter?
we can start with that one, it will be faster, otherwise we will need to wait for another fat pod to relocate
so, start with tools-k8s-ingress-9
shall I do it? blancadesal: yes, go
cookbook done
ok, then go to the next!
8
I've got a couple of easy MRs https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/170 https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/169
7: last one
arturo: done, logs look ok
I'll scale the replicas back up
T370162
T370162: toolforge: ingress-nginx pods get OOMkilled, consider scaling up - https://phabricator.wikimedia.org/T370162
arturo: how do you detect they get killed? mmm, we just lost the pods, they got recreated, so they lost the info
the Pod resource has something like 'reason for last termination' which contained 'OOMkilled'
that's it for the ingress nodes
that's it for the 1.25 upgrade :-)
* arturo closes a bunch of tickets
next up: 1.26 upgrade :))
the king is dead, long live the king
this test is currently failing
https://www.irccloud.com/pastebin/1crhTTf4/
seems unrelated to the k8s upgrade itself
blancadesal: you can check if there's such a job, is that in tools?
tools, yep
there's two jobs right now
https://www.irccloud.com/pastebin/CzSqLVBw/
are you running any tests? not right now, I can see the one you are listing just terminating
gone
oops – false: I was still running the tests xd
and I was too -- sorry. Just cancelled my loop
might have been a "collision" then
ok, seems to be working now
we might want to add some check to avoid running it in parallel on the same tool
(or do something smart to be able to run in parallel in the same tool)
we need to always be smarter xd
argh, I somehow pushed this to main? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/commit/8554c52543ed188584216c2c9e55cf2cab84c51d 🤦
branch not protected :-(
I'll update the settings
it is protected, just checked
it's only protected against force-pushing
and against members who are not 'maintainers'
I think we need this
https://usercontent.irccloud-cdn.com/file/BX4QgVly/image.png
there should be some settings though, at least in github there are
how does 'no one' work in the case of merging? anyway, should I revert and open a PR as normal folks do?
blancadesal: I would force-revert, to don't leave a weird history
and then yes, open a PR
the merge case should covered by the other permission, no? https://usercontent.irccloud-cdn.com/file/rXSrWRcx/image.png
ah yeah, that looks right
unless you are extra sure that the commit should stay in the history :-)
I'd revert and rewrite the history
so 1) enable push by maintainers, 2) enable force push, 3) do the force push to rewrite history and remove the commit, 4) put the settings back into disabling push & force push
might break some scripts though if they are not handling history rewrites (ex. doing rebases, pull, etc. instead of reset --hard)
it's toooo late
to apologize
to late! 🎵
https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/15
....
now that's in my head too xd
I'm too suggestionable
I will keep exploiting that xd
about the repo settings, I think all our other repos might need a similar change I think we don't allow force pushing by default on the repos, so only when really needed you go, change the setting, force-push, and revert the setting so there's no accidental force push
though I'm ok to change if everyone is ok
yup, but it's totally possible to accidentally push to main
oh, I thought it wasn't
that's what just happended :/ I also thought it wasn't possible
so all the repos are like that? yes, also in gerrit in general, unless manually disabled
at least the ones I've sampled are like that
on my local clones of gerrit repos I run
"git remote set-url --push origin no_push_use_review"
https://gitlab.com/gitlab-org/terraform-provider-gitlab
so `git push` wont work
so when the tofu patch gets merged what happens? gitops magic and the flavor becomes available, or are there extra steps? yes, extra steps
we have the magic that exists before the gitops magic
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
please review when you can: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/15
I think I need to up my commit message game btw https://usercontent.irccloud-cdn.com/file/RXePEHhr/Screenshot%202024-07-16%20at%2017.12.28.png
at cern I grep'ed the history with a set of swearwords, and the count was >100
(puppet repo, we had no CI/linting before)
there were a lot of "Do this" -> "Now for real" -> "again" -> "XXXX" ...
after the `tofu apply`, is there a way to check that the flavor now is indeed available, or is tofu's success message to be trusted? wmcs-openstack flavor list or similar I guess
https://www.irccloud.com/pastebin/FOjyn7SW/
(you can also run it directly in a cloudcontrol)
dcaro: thanks
dcaro: I'll deploy https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/423 tomorrow morning, I won't be around for much longer today to test it
blancadesal: ack, np
xd, doing tests creating venvs I think I'm killing nfs (on toolsbeta)
https://www.irccloud.com/pastebin/9emnBmK5/
I'm running webservice shell to generate the venv, but if right after I do a source of bin/activate (in a script) it does not find it
if right after `dcaro@toolsbeta-bastion-6:~/toolforge-deploy$ sudo -i -u toolsbeta.automated-toolforge-tests rm -rf /data/project/automated-toolforge-tests/venv` I log in as the tool, the directory is there :/, if I log out and run the rm again, then it deletes it
nfs shenanigans
https://www.irccloud.com/pastebin/PxvoPy6g/
hmm if I ls $HOME in-between, then venv appears, probably it's just cached that the path does not exist or something
andrewbogott: as yesterday, I'm leaving ceph adding/removing the single osd for load, feel free to ping me if anything goes awry
s/ping/page
there should be an alert downtime (added by the cookbook) so no alerts should trigger ( Downtiming alert from cookbook - Adding hosts ['cloudcephosd1034.eqiad.wmnet'] to the cluster - dcaro@urcuchillay), but if you see any not silenced also send me a message and I'll check tomorrow