[07:26:33] greetings [07:41:58] I'm taking a look at nfs-12, please do not reboot [07:51:47] mmhh ok that's essentially background noise from lsof by wmf-auto-restart that gets stuck [07:53:41] hola [07:59:53] morning! [08:01:41] hello! [08:14:53] karma rebalanced quite quickly here, note I closed the task seven days ago https://phabricator.wikimedia.org/T336845#11166242 [08:29:53] tools-prometheus-9 went down it seems [08:30:20] we had some issues like that in the past, and enabled the query log on prometheus side to get the queries that potentially killed it [08:30:26] (I think it's still enabled) [08:46:47] hello https://codesearch.wmcloud.org/ is unresponsive , I don't know anything about it but the instance is `codesearch9.codesearch.eqiad1.wikimedia.cloud`Β and my guess is the services are down there [08:46:51] antoine opened T404163 for codesearch and AFAICT the host is stuck (ssh hangs). Is that something we usually look into or not? [08:46:52] T404163: CodeSearch is unresponsive - https://phabricator.wikimedia.org/T404163 [08:47:03] there should be a frontend on port 3003 and a backend on 3002 [08:47:10] I would just restart the VM if that's ok [08:47:22] the service should be sufficient :] [08:47:44] I don't know anything about that system unfortunately [08:47:54] hashar: not being able to ssh because it hangs makes it harder... [08:47:57] ;) [08:48:03] to just restart the services [08:49:06] ahh "ssh hangs" [08:49:09] sorry I missed that one [08:49:13] :] [08:50:03] volans: you can use the vm_console cookbook to get a console to the vm if you want to debug stuff [08:50:42] k, looking [08:51:25] looking at https://sal.toolforge.org/codesearch , the instance has an history of becoming unresponsive and requiring a reboot [08:51:28] so I guess nothing unusual [08:53:06] dcaro: it hangs too unfortunately, doesn't get to give me the prompt (but it works fine with the other instance in the project) [08:53:30] I guess hard reboot it is at this point? [08:53:33] that sounds like VM overwhelmed yep, little else to do [08:57:34] volans: for more context, codesearch being down/overwhelmed is a pretty regular occurrence FWIW [08:57:52] T403434 T403323 etc [08:57:53] T403434: Codesearch down/unreachable (2025-09-02) - https://phabricator.wikimedia.org/T403434 [08:57:53] T403323: Codesearch down/unreachable (2025-08-30) - https://phabricator.wikimedia.org/T403323 [08:57:54] hashar: vm rebooted, indexes are starting up, should be up in a few [08:58:50] hmpf... my loki in lima-kilo stopped working :/ [08:59:00] alloy stopped being able to send data to loki-tools backend [08:59:05] `β”‚ ts=2025-09-10T08:58:25.339304338Z level=warn msg="error sending batch, will retry" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded" β”‚` [09:01:16] volans: awesome thank you very much! [09:05:59] πŸ‘‹ [09:23:06] o/ [09:34:00] found a bugging error in our functional tests when checking logs, easy review: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [09:35:26] I got https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186937 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186938 out to fix their related tasks, please take a look [09:35:32] it is just grief at the moment [09:40:07] godog: you can test it by cherry-picking it to one of the puppteservers (ex. toolsbeta) [09:41:35] oh yeah good point dcaro ! hadn't though of that, what's the rollback like in that case after cherry-pick ? [09:41:40] hmpf... that test is also failing in prod now, maybe the logs on loki are not working, looking [09:42:12] godog: you can just `git reset --hard gerrit/production` on the puppetserver [09:42:24] (or `HEAD^` if you only cherry-picked one commit) [09:42:33] ack, thank you dcaro [09:45:06] logs are working in tools/prod, so maybe just slow? [09:45:08] https://www.irccloud.com/pastebin/dONnK53O/ [09:53:24] could be yeah [09:55:00] hmm... it's been >5m, and I still don't see the logs in loki :/, something is going on [09:55:09] I'll open a task [10:00:22] T404176 [10:00:23] T404176: [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176 [10:00:29] some logs were lost too, looking [10:20:59] there's something weird going on between k8s writting the logs in files, and alloy picking them up [10:21:09] * dcaro lunch [12:24:02] something else reminded me of this talk and wanted to share, probably one of the best I've seen https://www.youtube.com/watch?v=SxdOUGdseq4 [12:24:54] classic <3 [12:25:16] yeah great speaker Rich [12:35:30] uff... my laptop took some extra time upgrading firmware... scary [12:37:07] hmm... okta'd again too xd [12:41:39] godog: thanks for sharing! (watching in the background) [13:08:33] my lima-kilo is failing to start :,(, it's getting `sudo: unable to resolve host toolslocal: Temporary failure in name resolution` type of errors (in case anyone has seen that already) [13:09:39] hmm... just adding the line `127.0.0.1 toolslocal` in the /etc/hosts of the VM fixes it, maybe a race condition somewhere? [13:12:42] I suspect that the sssd config might be messing up with sudo [13:21:07] I think I found it, `-hosts: files myhostname dns`, that `myhostname` should not be removed from nsswitch.conf [13:23:19] I've run wmcs.toolforge.component.deploy for maintain-kubeusers, which worked fine but one test is broken: jobs-api/continuous-job-smoke-test.bats [13:23:28] is that still the logs issues from before? [13:23:39] that might be me messing things up, do you have the output/error? [13:24:04] I should stop using automated-toolforge-tests tool for manual tests :/ [13:24:30] yes: https://phabricator.wikimedia.org/P83144 [13:24:57] that might be me yep [13:25:07] let me use a different tool for my tests and leave that one free [13:26:50] this is the hostname fix https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/274 [13:27:03] (easy review, unless you want to rebuild, then takes a bit) [13:30:50] when did this break? [13:31:41] I noticed just now [13:31:51] rebuilt my lima-kilo, as loki was failing inside [13:32:55] interesting, testing a quick rebuild to see if I also get the error [13:33:37] "TASK [basic_system : Install base packages]" completed successfully [13:33:45] that did not work for me already [13:33:48] (with the main branch, without your patch) [13:34:00] the task right after the one changing nsswitch already hung for me [13:34:07] (rebuilt it 3 times with that failure) [13:34:56] let me retry [13:35:09] can you try using a different name? (I use `toolslocal`) [13:35:19] maybe that's what triggers the issue [13:35:49] started using different names when testing upgrades (adding `v128` or such for different k8s versions) [13:37:08] where do you set the hostname? [13:37:38] "myhostname" looks like a placeholder, so I'm not convinced that could fix the issue [13:37:39] yep still hanging [13:37:53] `./start_devenv.sh --name toolslocal` [13:37:57] trying [13:38:46] I'm trying without `--name` [13:39:44] yep, without `--name` it works [13:39:49] with --name it bails out immediately with something that looks related to the cache [13:40:16] ah maybe because I had the other VM running [13:40:53] yeah now it's starting [13:41:42] oh yep, you have to manually stop the other [13:42:24] still working fine with name=toolslocal [13:42:59] :/ [13:43:16] it's quite reproducible on my setup, linux/qemu? [13:43:56] oh, I just upgraded my laptop a couple hours ago [13:44:01] maybe that's what changed [13:44:09] yeah it could be a macos vs linux thing, but I'm still very surprised that "myhostname" can fix it [13:44:27] unless it causes something else as a side effect [13:44:29] that's a subsystem/library of nsswitch [13:44:35] ahhhh [13:44:40] I misread the help page [13:44:47] then fine for me to have it [13:44:55] confusing naming xd [13:45:27] https://man7.org/linux/man-pages/man8/nss-myhostname.8.html [13:46:12] "This resolves well-known hostnames like "localhost"" [13:46:38] gotcha. I was reading https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html that was less clear [14:00:39] I think that we are collecting logs from containers (k8s pods) each ~10s, and if your cronjob was quick enough, it does not get picked up (10s is the default here https://grafana.com/docs/alloy/latest/reference/components/local/local.file_match/#arguments) [14:01:11] though we probably will want to retain the logs a bit more instead/in addition to increasing that frequency [14:05:45] dcaro: interesting edge case, I deployed https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 but then I had to rebase it before merging [14:05:53] does it mean my change got overwritten and I should deploy again? [14:06:12] that was me that forgot to merge a change yesterday [14:06:17] (I deployed, but did not merge) [14:06:28] so I overwrote _your_ changte [14:07:06] hmm, now that you say it maybe xd, as both were maintain-kubeusers right? [14:07:11] yes [14:07:25] sorry about that then [14:07:45] you rebased then? we can just redeploy [14:07:52] that's fine, I will deploy again and we should be set [14:08:24] yes please [14:08:31] the default quota change was reverted [14:08:35] https://www.irccloud.com/pastebin/xxGGJobC/ [14:09:16] deploying now [14:10:09] this is why I'm not a fan of "deploy before merge", unless you have a proper locking system :) [14:10:23] maybe we could implement one, like a check in the deploy cookbook [14:11:27] usually the MR is created by the scripts, and would have merged both into one, but as this was a manual change in toolforge-deploy, the branches were different [14:11:45] +1 for improving though [14:11:53] ah the script already can merge two changes? that's cool [14:12:16] quota changes are a bit of an edge case [14:12:23] inside the edge case :) [14:13:10] yep, it was quite useful the last 'first of the month' automatic upgrades, so I could batch the pre-commit + python deps upgrades into one deploy [14:13:38] nice [14:15:15] https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#cleanup-for-finished-jobs this might be helpful for the logs issues [15:00:43] andrewbogott: are floating ip required for magnum load balancers? T404150 [15:00:44] T404150: Additional floating IPs for gitlab-cloud-runner testing in testlabs project - https://phabricator.wikimedia.org/T404150 [15:01:09] dhinus: should not be, you can specify that you want one or don't in the template. [15:01:40] For the moment best practice is to stick a nova-proxy in front of octavia unless someone has a special use case [15:02:48] thanks. do you mind replying to that task? [15:04:15] sure [15:05:11] jobs-emailer has been flaky lately too [15:05:14] https://usercontent.irccloud-cdn.com/file/KqaNBahO/image.png [15:05:23] it triggered an alarm (email at least) [15:05:44] hmm... might be related to the prometheus issue? [15:49:14] hmm... i think loki is not working in lima-kilo [15:49:20] anyone has been able to use it lately? [15:51:47] hmpf... keep getting errors like `β”‚ ts=2025-09-10T15:51:21.827061407Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-tools-memberlist: lookup loki-tools-memberlist on 10.96.0.10:53: read udp 192.168.211.217:51318->10.96.0.10:53: i/o timeout"` on loki-tools [15:55:27] hmm, kube-dns is failing `β”‚ [ERROR] plugin/errors: 2 loki-tools-memberlist. A: read udp 192.168.211.244:44714->172.18.0.1:53: i/o timeout` [16:06:16] thanks for the tofu-cloudvps release taavi <3 [16:18:39] this looks suspicious `loki loki-tools-0 loki level=info ts=2025-09-10T16:16:37.075520977Z caller=table_manager.go:136 index-store=tsdb-2024-04-01 msg="uploading tables"`, 2024-04-01? (april fool's day? xd, loki is playing with me) [16:21:42] hmpf... I'm rebuilding my lima-kilo from scratch, failing to make loki work after it started failing [16:42:36] the dns errors happen also on a fresh lima-kilo install :/ [16:45:38] and alloy also fails πŸ€¦β€β™‚οΈ [16:45:41] `alloy alloy-5btwq alloy ts=2025-09-10T16:45:23.481523511Z level=warn msg="error sending batch, will retry" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"` [16:45:57] minio is erroring too, though the pods don't get restarted [16:46:00] I'll open a task [16:46:18] anyone rebuilt lima-kilo lately? Anyone seeing those errors? [16:50:12] created T404226 to followup, this is blocking me from running the functional tests [16:50:12] T404226: [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226 [16:50:32] Raymond_Ndibe: if you have time, can you try that too? see if it happens on mac also? [18:12:31] * dcaro off [18:13:47] alloy won for today... `β”‚ ts=2025-09-10T17:01:24.485001373Z level=error msg="final error sending batch" component_path=/ component_id=loki.write.loki_tools component=client host=loki-tools.loki.svc.cluster.local:3100 status=-1 tenant=tool-tf-test error="Post \"http://loki-tools.loki.svc.cluster.local:3100/loki/api/v1/push\": context deadline exceeded"` final error until next time [18:13:56] cya! [18:46:42] win 20