[07:26:33] <godog>	 greetings
[07:41:58] <godog>	 I'm taking a look at nfs-12, please do not reboot
[07:51:47] <godog>	 mmhh ok that's essentially background noise from lsof by wmf-auto-restart that gets stuck
[07:53:41] <volans>	 hola
[07:59:53] <dcaro>	 morning!
[08:01:41] <dhinus>	 hello!
[08:14:53] <godog>	 karma rebalanced quite quickly here, note I closed the task seven days ago https://phabricator.wikimedia.org/T336845#11166242
[08:29:53] <dcaro>	 tools-prometheus-9 went down it seems
[08:30:20] <dcaro>	 we had some issues like that in the past, and enabled the query log on prometheus side to get the queries that potentially killed it
[08:30:26] <dcaro>	 (I think it's still enabled)
[08:46:47] <hashar>	 hello https://codesearch.wmcloud.org/ is unresponsive , I don't know anything about it but the instance is `codesearch9.codesearch.eqiad1.wikimedia.cloud` and my guess is the services are down there
[08:46:51] <volans>	 antoine opened T404163 for codesearch and AFAICT the host is stuck (ssh hangs). Is that something we usually look into or not?
[08:46:52] <stashbot>	 T404163: CodeSearch is unresponsive - https://phabricator.wikimedia.org/T404163
[08:47:03] <hashar>	 there should be a frontend on port 3003 and a backend on 3002
[08:47:10] <volans>	 I would just restart the VM if that's ok
[08:47:22] <hashar>	 the service should be sufficient :]
[08:47:44] <hashar>	 I don't know anything about that system unfortunately
[08:47:54] <volans>	 hashar: not being able to ssh because it hangs makes it harder...
[08:47:57] <volans>	 ;)
[08:48:03] <volans>	 to just restart the services
[08:49:06] <hashar>	 ahh "ssh hangs"
[08:49:09] <hashar>	 sorry I missed that one
[08:49:13] <hashar>	 :]
[08:50:03] <dcaro>	 volans: you can use the vm_console cookbook to get a console to the vm if you want to debug stuff
[08:50:42] <volans>	 k, looking
[08:51:25] <hashar>	 looking at https://sal.toolforge.org/codesearch  , the instance has an history of becoming unresponsive and requiring a reboot
[08:51:28] <hashar>	 so I guess nothing unusual
[08:53:06] <volans>	 dcaro: it hangs too unfortunately, doesn't get to give me the prompt (but it works fine with the other instance in the project)
[08:53:30] <volans>	 I guess hard reboot it is at this point?
[08:53:33] <dcaro>	 that sounds like VM overwhelmed yep, little else to do
[08:57:34] <godog>	 volans: for more context, codesearch being down/overwhelmed is a pretty regular occurrence FWIW
[08:57:52] <godog>	 T403434 T403323 etc
[08:57:53] <stashbot>	 T403434: Codesearch down/unreachable (2025-09-02) - https://phabricator.wikimedia.org/T403434
[08:57:53] <stashbot>	 T403323: Codesearch down/unreachable (2025-08-30) - https://phabricator.wikimedia.org/T403323
[08:57:54] <volans>	 hashar: vm rebooted, indexes are starting up, should be up in a few
[08:58:50] <dcaro>	 hmpf... my loki in lima-kilo stopped working :/
[08:59:00] <dcaro>	 alloy stopped being able to send data to loki-tools backend
[08:59:05] <dcaro>	 `│ ts=2025-09-10T08:58:25.339304338Z level=warn msg="error sending batch, will retry" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"        │`
[09:01:16] <hashar>	 volans: awesome thank you very much!
[09:05:59] <akosiaris>	 👋
[09:23:06] <dhinus>	 o/
[09:34:00] <dcaro>	 found a bugging error in our functional tests when checking logs, easy review: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961
[09:35:26] <godog>	 I got https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186937 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186938 out to fix their related tasks, please take a look
[09:35:32] <godog>	 it is just grief at the moment
[09:40:07] <dcaro>	 godog: you can test it by cherry-picking it to one of the puppteservers (ex. toolsbeta)
[09:41:35] <godog>	 oh yeah good point dcaro ! hadn't though of that, what's the rollback like in that case after cherry-pick ?
[09:41:40] <dcaro>	 hmpf... that test is also failing in prod now, maybe the logs on loki are not working, looking
[09:42:12] <dcaro>	 godog: you can just `git reset --hard gerrit/production` on the puppetserver
[09:42:24] <dcaro>	 (or `HEAD^` if you only cherry-picked one commit)
[09:42:33] <godog>	 ack, thank you dcaro 
[09:45:06] <dcaro>	 logs are working in tools/prod, so maybe just slow?
[09:45:08] <dcaro>	 https://www.irccloud.com/pastebin/dONnK53O/
[09:53:24] <godog>	 could be yeah
[09:55:00] <dcaro>	 hmm... it's been >5m, and I still don't see the logs in loki :/, something is going on
[09:55:09] <dcaro>	 I'll open a task
[10:00:22] <dcaro>	 T404176
[10:00:23] <stashbot>	 T404176: [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176
[10:00:29] <dcaro>	 some logs were lost too, looking
[10:20:59] <dcaro>	 there's something weird going on between k8s writting the logs in files, and alloy picking them up
[10:21:09] * dcaro lunch
[12:24:02] <godog>	 something else reminded me of this talk and wanted to share, probably one of the best I've seen https://www.youtube.com/watch?v=SxdOUGdseq4
[12:24:54] <dhinus>	 classic <3
[12:25:16] <godog>	 yeah great speaker Rich
[12:35:30] <dcaro>	 uff... my laptop took some extra time upgrading firmware... scary
[12:37:07] <dcaro>	 hmm... okta'd again too xd
[12:41:39] <dcaro>	 godog: thanks for sharing! (watching in the background)
[13:08:33] <dcaro>	 my lima-kilo is failing to start :,(, it's getting `sudo: unable to resolve host toolslocal: Temporary failure in name resolution` type of errors (in case anyone has seen that already)
[13:09:39] <dcaro>	 hmm... just adding the line `127.0.0.1 toolslocal` in the /etc/hosts of the VM fixes it, maybe a race condition somewhere?
[13:12:42] <dcaro>	 I suspect that the sssd config might be messing up with sudo
[13:21:07] <dcaro>	 I think I found it, `-hosts:          files myhostname dns`, that `myhostname` should not be removed from nsswitch.conf
[13:23:19] <dhinus>	 I've run wmcs.toolforge.component.deploy for maintain-kubeusers, which worked fine but one test is broken: jobs-api/continuous-job-smoke-test.bats
[13:23:28] <dhinus>	 is that still the logs issues from before?
[13:23:39] <dcaro>	 that might be me messing things up, do you have the output/error?
[13:24:04] <dcaro>	 I should stop using automated-toolforge-tests tool for manual tests :/
[13:24:30] <dhinus>	 yes: https://phabricator.wikimedia.org/P83144
[13:24:57] <dcaro>	 that might be me yep
[13:25:07] <dcaro>	 let me use a different tool for my tests and leave that one free
[13:26:50] <dcaro>	 this is the hostname fix https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/274
[13:27:03] <dcaro>	 (easy review, unless you want to rebuild, then takes a bit)
[13:30:50] <dhinus>	 when did this break?
[13:31:41] <dcaro>	 I noticed just now
[13:31:51] <dcaro>	 rebuilt my lima-kilo, as loki was failing inside
[13:32:55] <dhinus>	 interesting, testing a quick rebuild to see if I also get the error
[13:33:37] <dhinus>	 "TASK [basic_system : Install base packages]" completed successfully
[13:33:45] <dcaro>	 that did not work for me already
[13:33:48] <dhinus>	 (with the main branch, without your patch)
[13:34:00] <dcaro>	 the task right after the one changing nsswitch already hung for me
[13:34:07] <dcaro>	 (rebuilt it 3 times with that failure)
[13:34:56] <dcaro>	 let me retry
[13:35:09] <dcaro>	 can you try using a different name? (I use `toolslocal`)
[13:35:19] <dcaro>	 maybe that's what triggers the issue
[13:35:49] <dcaro>	 started using different names when testing upgrades (adding `v128` or such for different k8s versions)
[13:37:08] <dhinus>	 where do you set the hostname?
[13:37:38] <dhinus>	 "myhostname" looks like a placeholder, so I'm not convinced that could fix the issue
[13:37:39] <dcaro>	 yep still hanging
[13:37:53] <dcaro>	 `./start_devenv.sh --name toolslocal`
[13:37:57] <dhinus>	 trying
[13:38:46] <dcaro>	 I'm trying without `--name`
[13:39:44] <dcaro>	 yep, without `--name` it works
[13:39:49] <dhinus>	 with --name it bails out immediately with something that looks related to the cache
[13:40:16] <dhinus>	 ah maybe because I had the other VM running
[13:40:53] <dhinus>	 yeah now it's starting
[13:41:42] <dcaro>	 oh yep, you have to manually stop the other
[13:42:24] <dhinus>	 still working fine with name=toolslocal
[13:42:59] <dcaro>	 :/
[13:43:16] <dcaro>	 it's quite reproducible on my setup, linux/qemu?
[13:43:56] <dcaro>	 oh, I just upgraded my laptop a couple hours ago
[13:44:01] <dcaro>	 maybe that's what changed
[13:44:09] <dhinus>	 yeah it could be a macos vs linux thing, but I'm still very surprised that "myhostname" can fix it
[13:44:27] <dhinus>	 unless it causes something else as a side effect
[13:44:29] <dcaro>	 that's a subsystem/library of nsswitch
[13:44:35] <dhinus>	 ahhhh
[13:44:40] <dhinus>	 I misread the help page
[13:44:47] <dhinus>	 then fine for me to have it
[13:44:55] <dcaro>	 confusing naming xd
[13:45:27] <dcaro>	 https://man7.org/linux/man-pages/man8/nss-myhostname.8.html
[13:46:12] <dhinus>	 "This resolves well-known hostnames like "localhost""
[13:46:38] <dhinus>	 gotcha. I was reading https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html that was less clear
[14:00:39] <dcaro>	 I think that we are collecting logs from containers (k8s pods) each ~10s, and if your cronjob was quick enough, it does not get picked up (10s is the default here https://grafana.com/docs/alloy/latest/reference/components/local/local.file_match/#arguments)
[14:01:11] <dcaro>	 though we probably will want to retain the logs a bit more instead/in addition to increasing that frequency
[14:05:45] <dhinus>	 dcaro: interesting edge case, I deployed https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 but then I had to rebase it before merging
[14:05:53] <dhinus>	 does it mean my change got overwritten and I should deploy again?
[14:06:12] <dcaro>	 that was me that forgot to merge a change yesterday
[14:06:17] <dcaro>	 (I deployed, but did not merge)
[14:06:28] <dhinus>	 so I overwrote _your_ changte
[14:07:06] <dcaro>	 hmm, now that you say it maybe xd, as both were maintain-kubeusers right?
[14:07:11] <dhinus>	 yes
[14:07:25] <dcaro>	 sorry about that then
[14:07:45] <dcaro>	 you rebased then? we can just redeploy
[14:07:52] <dhinus>	 that's fine, I will deploy again and we should be set
[14:08:24] <dcaro>	 yes please
[14:08:31] <dcaro>	 the default quota change was reverted
[14:08:35] <dcaro>	 https://www.irccloud.com/pastebin/xxGGJobC/
[14:09:16] <dhinus>	 deploying now
[14:10:09] <dhinus>	 this is why I'm not a fan of "deploy before merge", unless you have a proper locking system :)
[14:10:23] <dhinus>	 maybe we could implement one, like a check in the deploy cookbook
[14:11:27] <dcaro>	 usually the MR is created by the scripts, and would have merged both into one, but as this was a manual change in toolforge-deploy, the branches were different
[14:11:45] <dcaro>	 +1 for improving though
[14:11:53] <dhinus>	 ah the script already can merge two changes? that's cool
[14:12:16] <dhinus>	 quota changes are a bit of an edge case
[14:12:23] <dhinus>	 inside the edge case :)
[14:13:10] <dcaro>	 yep, it was quite useful the last 'first of the month' automatic upgrades, so I could batch the pre-commit + python deps upgrades into one deploy
[14:13:38] <dhinus>	 nice
[14:15:15] <dcaro>	 https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#cleanup-for-finished-jobs this might be helpful for the logs issues
[15:00:43] <dhinus>	 andrewbogott: are floating ip required for magnum load balancers? T404150
[15:00:44] <stashbot>	 T404150: Additional floating IPs for gitlab-cloud-runner testing in testlabs project - https://phabricator.wikimedia.org/T404150
[15:01:09] <andrewbogott>	 dhinus: should not be, you can specify that you want one or don't in the template.
[15:01:40] <andrewbogott>	 For the moment best practice is to stick a nova-proxy in front of octavia unless someone has a special use case
[15:02:48] <dhinus>	 thanks. do you mind replying to that task?
[15:04:15] <andrewbogott>	 sure
[15:05:11] <dcaro>	 jobs-emailer has been flaky lately too 
[15:05:14] <dcaro>	 https://usercontent.irccloud-cdn.com/file/KqaNBahO/image.png
[15:05:23] <dcaro>	 it triggered an alarm (email at least)
[15:05:44] <dcaro>	 hmm... might be related to the prometheus issue?
[15:49:14] <dcaro>	 hmm... i think loki is not working in lima-kilo
[15:49:20] <dcaro>	 anyone has been able to use it lately?
[15:51:47] <dcaro>	 hmpf... keep getting errors like `│ ts=2025-09-10T15:51:21.827061407Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-tools-memberlist: lookup loki-tools-memberlist on 10.96.0.10:53: read udp 192.168.211.217:51318->10.96.0.10:53: i/o timeout"` on loki-tools
[15:55:27] <dcaro>	 hmm, kube-dns is failing `│ [ERROR] plugin/errors: 2 loki-tools-memberlist. A: read udp 192.168.211.244:44714->172.18.0.1:53: i/o timeout`
[16:06:16] <bd808>	 thanks for the tofu-cloudvps release taavi <3
[16:18:39] <dcaro>	 this looks suspicious `loki loki-tools-0 loki level=info ts=2025-09-10T16:16:37.075520977Z caller=table_manager.go:136 index-store=tsdb-2024-04-01 msg="uploading tables"`, 2024-04-01? (april fool's day? xd, loki is playing with me)
[16:21:42] <dcaro>	 hmpf... I'm rebuilding my lima-kilo from scratch, failing to make loki work after it started failing
[16:42:36] <dcaro>	 the dns errors happen also on a fresh lima-kilo install :/
[16:45:38] <dcaro>	 and alloy also fails 🤦‍♂️
[16:45:41] <dcaro>	 `alloy alloy-5btwq alloy ts=2025-09-10T16:45:23.481523511Z level=warn msg="error sending batch, will retry" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"`
[16:45:57] <dcaro>	 minio is erroring too, though the pods don't get restarted
[16:46:00] <dcaro>	 I'll open a task
[16:46:18] <dcaro>	 anyone rebuilt lima-kilo lately? Anyone seeing those errors?
[16:50:12] <dcaro>	 created T404226 to followup, this is blocking me from running the functional tests
[16:50:12] <stashbot>	 T404226: [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226
[16:50:32] <dcaro>	 Raymond_Ndibe: if you have time, can you try that too? see if it happens on mac also?
[18:12:31] * dcaro off
[18:13:47] <dcaro>	 alloy won for today... `│ ts=2025-09-10T17:01:24.485001373Z level=error msg="final error sending batch" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"` final error until next time
[18:13:56] <dcaro>	 cya!
[18:46:42] <godog>	 win 20