[10:05:19] I'm having an issue with my Cloud VPS, project `procbot`. The instances are stuck in the error state, with error message "libvirtError". If I try to stop/start/reboot them, I get error: "You are not allowed to soft reboot instance: procbot-k8s-b-5l2c6rwuicm2-master-0" (etc) [10:26:34] that sounds weird :/, can you open a task with the details to follow up/paste logs and such? [10:32:01] will do. in what phab project should I file it? [10:34:44] you can use cloud-vps [10:34:55] and your procbot project [10:38:31] oh, your VMs are magnum-managed? [10:39:23] yea, magnum-managed [10:39:44] from the logs it seems they failed to connect to the ceph cluster, looking [10:43:42] For some context, if it helps: This issue happened after I tried to reboot the nodes to fix an issue where the disk pressure of my nodes seemingly becomes unavailable. This happens every few months; a few pods in the `kube-system` namespace start deathrolling and need some reboots of nodes to resolve. It's mainly the CSI cinder plugin pods initially which start failing, which I think are root cause. After that, more pods [10:43:42] (incl my workload pods) start failing / cannot be scheduled due to disk pressure constraints. I've not managed to figure out why this happens. On this occassion, there were CSI/disk-pressure errors again and the logs of the CSI plugin pods had lots of `W0113 00:47:18.956363 1 connection.go:173] Still connecting to unix:///csi/csi.sock` [10:46:31] from the libvirt side, it seems it was using the old ceph mon ips, stopped and started (to recreate the libvirt process), let's see if it comes up at least [10:51:14] it seems libvirtd is misbehaving in cloudvirt1058 :/, gtg for a bit, maybe dhinus can help around [11:12:01] I can have a look [11:39:45] maybe related: "instance tools-k8s-worker-nfs-7 is down", now up again [11:53:08] that node is on a different hypervisor so probably not related [11:53:27] I looked at the logs in cloudvirt1058 and I'm not seeing anything strange [12:01:42] both instances in the "procbot" project are now in status "Shutoff". I'll try restarting them from Horizon. [12:05:34] that didn't work, in Horizon it's now stuck in "powering on" [12:12:31] "virsh list" shows State: paused for that vm. I tried "resume" and "shutdown" and they both failed [12:12:37] error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags) [12:34:47] I'm tempted to drain the whole hypervisor and reboot it, but I'll wait for a.ndrew to be online in case he's seen this befoer [12:34:59] yep, I had the same instinc [12:35:15] I don't find any past occurrences of that error in phab [12:35:21] *instinct [12:37:59] proc: if you could create that phab task describing your observed behavior, it would be really helpful [12:46:02] just checked, I don't see any more VMs in a similar error status, only one canary, but seems to be for a different reason (bad seecret uuid at the libvirt level, not related to ceph/rbd) [13:32:54] hmmm that vm just disappeared from "virsh list" [13:33:05] but horizon says it's still in cloudvirt1058 :/ [13:35:55] :S [13:45:49] doing now dhinus . I have a large dump of cluster state I took yesterday with `kubectl cluster-info dump`. apparently too large to upload to phab (22MB). is there anywhere else I should upload it? (alternatively, if you can access my cloud vps instance, it's the file titled `dump` on my bastion instance) [13:55:10] https://phabricator.wikimedia.org/T383560 [13:56:15] proc: thank you! the filename of the dump should be enough. you could also try storing it in a object-storage bucket in your project [13:56:49] https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide [15:13:22] dhinus: you can try just migrating those specific VMs, unless you tried that already [15:18:43] oh, nm, i see that in the logs, already tried [15:25:35] I haven't actually, I just tried resuming/restarting [15:26:17] but I don't even see them anymore in "virsh list" [15:26:38] this one for example: https://horizon.wikimedia.org/project/instances/c8cdb44d-fcf0-44e9-b5c1-b59eda68f7fd/ [15:26:55] "sudo virsh dumpxml i-0009fb9b" returns "error: failed to get domain 'i-0009fb9b'" [15:26:59] it did work earlier today [15:28:07] virsh only shows things that are running, so that's consistent with them being in shutdown state [15:28:23] ah right, my bad [15:28:31] I'm worried about the ceph mon name mismatch, though, is that going to be true for /every/ VM now that we replaced the mons? [15:29:15] anyway, we can discuss this in the call :) [15:29:22] yep [15:30:47] um... give me 2 minutes, still eating breakfast [15:31:28] andrewbogott: I was wondering the same, that's why I checked everywhere, but did not find any other VMs having issues [15:32:06] weird [15:46:43] The new VM (ctt-qa-03) is looking good so far but I am going to give it a week before I shut down parsing-qa-02. [16:20:34] bd808: I'm getting a crawler bot hammering my IABot webservice with a generic UA. Since IPs are getting filtered, I can't impose an actual block to reduce traffic. [16:21:09] Any chance you can help with that? [16:27:51] probably best if you file a task [16:28:17] Well it's seems to have slowed down for now, so I'll just observer. [16:33:03] proc, can you check if things are working better now? [16:42:49] I wonder if we should turn on some global rate limiting for the *.wmcloud.org sites like we have for *.toolforge.org? [16:45:41] we already have that iirc [16:52:15] There does seem to be some config. It is not the same as the toolforge config which makes me wonder if we have ever tuned/tested the non-toolforge version. [16:54:27] oh, maybe it does have the same config after all. Just some confusion in my reading of puppet files [16:57:51] not exactly the same, but similar. wmcloud has `limit_req_zone $binary_remote_addr zone=cloudvps:10m rate=100r/s;` and toolforge has `limit_req_zone $binary_remote_addr zone=toolforge:10m rate=50r/s;`. So I guess we allow double the burst traffic outside of toolforge. [17:33:00] bd808: barring the fact I can't see actual IPs, any chance cloud users could be given the option to block known crawlers? [17:34:46] Skynet: I think we have config to block a given crawler from all proxies, if you open a ticket with details (ip and user agent string, at least) and assign to me I can work on that. [17:35:32] I think if you just grab all the headers they'll have what I need [17:35:56] andrewbogott: The UA is Chrome browser, but I can't actually see the IPs. It's being sanitized. [17:36:18] just include what you can find [17:41:36] andrewbogott: https://phabricator.wikimedia.org/T383592 [18:02:12] any chance you can log more of the headers on your end? Or is that really all you get? [18:03:52] andrewbogott: that's probably mostly all Skynet can see that would also be findable in the front proxy (the path and the user-agent). [18:04:15] It's all I got. [18:04:45] the tuple of request time, path, and UA along with vhost name should match up pretty well. [18:04:54] The UA is practically the only thing I can work with. [18:05:47] I mean I can program the UI to connect the UAs to user accounts, but I'm pretty sure that would draw in some unneeded controversy. [18:06:00] bd808: so you're saying I would block all requests to that vhost with that ua? That seems like a wide net [18:06:02] There is an opt-in to pass the requesting IP to the Cloud VPS project. That is used by xtools and a couple of other projects where it was deemed critical for access control. [18:06:14] (Not that it would be of any help here) [18:06:36] andrewbogott: no, that you would look for the matching time, UA, and vhost to find the real IP [18:06:47] oh I see, ok [18:07:17] (I'm in the midst of a mild unrelated cloud-vps crisis so won't actually be implementing anything for a bit) [18:07:34] time, UA, vhost, and path probably. Or just request count per IP to the vhost and look for widlness [18:07:39] bd808: I think I couldn't establish a case for critical access. I'm not realistically using IPs for any users' benefits here. It would only benefit for implementing IP blocks on undesirable crawlers. [18:08:22] Skynet: another "fix" for crappy crawlers is to put everything remotely expensive behind an auth block [18:08:34] Oh it is/ [18:08:55] hmmm... and that's not enough? [18:08:56] Still brought down nginx at some point. [18:09:25] Not in this case it seems. [18:09:46] like nginx crashed or ran out of inodes or ?? [18:10:07] Yea. [18:10:19] All it was doing was a 502 on every request [18:11:04] That would mean the front proxy lost contact with your backend. [18:11:34] I was still seeing access.log and error.log requests streaming in. [18:12:03] is your backend also nginx as a reverse proxy to something? [18:12:46] Honestly it didn't seem like it was excessive, but I didn't see anything else causing the issue. [18:13:20] Nope. It's not a reverse proxy. It's just straight up the web service to iabot.wmcloud.org [18:13:38] Only RP is the one configured by Horizon [18:18:30] > Honestly it didn't seem like it was excessive, but I didn't see anything else causing the issue -- so you had a bug where the front proxy failed to talk to the backend and you reported it as excessive bot scraping without any actual evidence? [18:21:44] The only thing happening was a burst of requests coming in when I was looking at the access logs. [18:22:04] And that nginx was erroring on every request coming in. [18:22:59] If you were seeing requests at the backend then the front proxy's 503 responses mean that your backend's replies were not getting back to the frontend. [18:23:04] I do know for a fact there is a pesky crawler [18:23:15] 502 [18:23:26] which may mean the backend was crashing and may mean something else entirely [18:23:54] I figured the high load caused the crash? [18:24:46] There's nothing else indicating a problem. Nginx just went down. [18:25:59] Skynet: but you said that you don't have nginx in your backend stack, so what you are really saying is that your backend stopped talking to the shared nginx front end. That did not "go down" in the sense that the shared nginx did not crash. [18:26:32] I use nginx as the web service backend. Not as a reverse proxy [18:27:10] you run PHP inline in nginx? [18:27:24] I did not know that was possible [18:28:03] You use it with php-fpm [18:28:16] as a revers proxy [18:29:14] Correct me if I'm wrong, but a reverse proxy simply directs traffic to another web service based on a set of criteria? [18:31:01] nginx proxies inbound requests to php-fpm over the fcgi protocol rather than using http, but nginx is acting as a proxy for the the php-fpm process. [18:32:11] I never saw it that way. I see php-fpm as simply a php processor, acting as a component to nginx. [18:33:31] Anyway, the stack here is nginx shared frontproxy -> your host -> nginx -> php-fpm then if I understand the explanation so far. And you were seeing HTTP 502 "Bad Geteway" status responses at which nginx layer? The shared frontproxy? Or your local nginx? [18:34:14] Mine [18:35:11] That would indicate that php-fpm crashed or at least hung in some manner then I think. Does that sound right? [18:35:53] Yea. Now that I'm looking at it more closely. The socket to php-fpm was gone [18:36:10] So php-fpm crashed out. [18:36:48] and you likely had a burst of retries because folks were getting error responses [18:38:28] Yea, I'm digging in the php-fpm log./ [18:38:59] Still would be nice to nuke the crawler though [18:40:22] Okay, constant warnings about being out of children. [18:42:15] That's weird, my pm.start_servers value got reset to 2. How did that happen. I had that set to 100 [18:44:48] my whole www.conf got reset to default. Not sure how that came to happen [18:45:20] package update gone a bit wonky? [18:45:53] Maybe. I wouldn't have thought to look there if I hadn't bounced thoughts/requests off you. :-) [18:50:09] Nothing in the log indicating why it crashed though [18:50:31] I guess I'll keep a watch over it [19:24:15] andrewbogott: kube is back up and the pod was scheduled. though, some things in the kube-system namespace still look unhealthy: [19:24:21] https://www.irccloud.com/pastebin/FodfvQ5X/ [19:24:37] logs of the CSI cinder plugin: [19:24:40] https://www.irccloud.com/pastebin/OCAjDigm/ [19:25:23] I don't know much about the CSI cinder plugin, is that something you can configure/reconfigure? [19:26:42] unsure. it came with the magnum setup [19:26:59] I can try delete the pods and let kube spin up new ones, see if that fixes it... [19:27:40] that's where I'd start [19:28:12] btw, what was the issue with the CSI stuff / the compute instances not starting up before? [19:28:30] The short explanation for what happened with the VMs is: they were trying to talk to no-longer-existing ceph mons. more details than you want at T383583 [19:28:30] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [19:30:13] oh I see, quite unlucky I was the first to run into it then lol [19:30:37] I wonder then what the cause of my bi-monthly crashes is, as I guess that's probably not related to this second issue [19:31:30] yeah, likely unrelated [19:36:45] looks like also some network issues. pods can't seem to talk to wikipedia (maybe safe to assume they can't talk to the internet): `/usr/local/lib/ruby/2.7.0/net/http.rb:960:in `initialize': Failed to open TCP connection to en.wikipedia.org:443 (getaddrinfo: Try again) (Faraday::ConnectionFailed)` [19:40:44] is there precious state in the magnum cluster or an you just regenerate it? < definitely not an explanation but possibly the fastest way to get you back up and running [19:42:05] I think I saved all the stuff needed to recreate the cluster as scripts on the bastion, so I should be able to recreate [19:42:29] may well be that a fresh start is a good idea lol [19:43:42] that failure looks more like a dns issue than a full-blown network issue but it could be either [19:43:56] unless your cluster is very old it should have a correct dns config [19:57:39] speaking of network issues... [19:58:54] uhoh, you aren't going to bring up T374830 are you? [20:00:23] someone on the Telegram side asked “Toolforge down?” and given that the bridge stopped bridging and I can’t connect to it either (whether via web or SSH) I think something might be up indeed :/ [20:00:39] crap [20:00:59] yeah can't connect to either of my webservices or login.toolforge.org [20:02:44] hm one of my tools eventually responded (but without stylesheets) [20:02:52] so it’s not 100% down (but possibly 99%) [20:02:59] looks like nfs [20:03:18] dcaro: if this ping reaches you... I think we're having nfs issues on toolforge [20:04:40] * dcaro going to a laptop [20:06:22] I don't see anything going on on ceph side [20:06:39] seems to work right now [20:06:49] I also couldn't SSH a few minutes ago [20:07:19] !log tools restarted nfs server on tools-nfs, rebooting tools-bastion-12 and -13 [20:07:19] something happened for sure [20:07:22] https://usercontent.irccloud-cdn.com/file/N2N92lQC/image.png [20:07:30] number of stuck processes in k8s workers [20:07:46] ~19:50 UTC [20:08:17] Guess I'll reboot some more workers, it's been a few days [20:08:27] dcaro: mind re-linking me to that grafana page? [20:09:07] someone rebooted the bastion? [20:09:08] https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools [20:09:12] *anyone [20:10:50] yeah, I did [20:10:56] and I logged it except the log bot doesn't work :) [20:11:02] “we probably won’t reboot the bastions that soon” – me, foolishly, 8 days ago :D [20:11:59] andrewbogott: ack [20:12:24] there's no network throughput issues either (on the switches/osd nodes at least) [20:13:23] I live-migrated a bunch of things for T383583, maybe including the nfs server (ugh) let me check [20:13:39] become: no such tool 'stashbot' [20:13:39] o_O [20:13:45] guess I won’t restart it yet then [20:13:47] yes [20:13:53] did anyone reboot the nfs server? (tools-nfs-2) [20:14:02] dcaro: so, likely that's the cause [20:14:12] because that might cause the nfs issues xd [20:14:19] https://www.irccloud.com/pastebin/1Hnd3k8z/ [20:14:20] not reboot, but live migrated [20:14:31] well.... [20:14:34] I /told/ it to make it live [20:14:40] but anyway, yes, that's clearly the cause [20:15:03] okok, good, so that's not going to happen while we fix stuff xd [20:15:08] so let's fix stuff [20:15:26] lucaswerkmeister: looking, that was from the bastion I guess? [20:15:59] yeah, bastion-13 [20:16:14] ack [20:16:16] I’m fine with waiting if you’re still working on it, just wanted to see if I could fix that already [20:16:28] (and was surprised by the failure mode) [20:16:55] that's ok, ldap seems unrelated to the nfs issues [20:17:03] so might be something else going on [20:18:06] My guess is that the live migration command I ran on a few dozen VMs was not actually live, and those unexpected reboots are the cause of everything [20:18:11] although no idea why that would include ldap :/ [20:19:25] it's weird yep, as I was able to login with my user (and others), and I can sudo -u to the tools [20:19:28] so maybe not ldap? [20:19:46] oh, wait, now become works too :/ [20:19:53] lucaswerkmeister: can you try now? ^ [20:20:32] $home wasn't mounted properly on the bastions until just now [20:20:44] or... maybe still isn't [20:21:02] dcaro: still the same (in the same bash, i.e. I haven’t re logged in) [20:21:10] ack [20:21:14] hmm, weird [20:21:15] `groups` shows all my groups though [20:21:20] (and did a few minutes ago too) [20:21:26] interesting [20:21:42] puppet keeps saying [20:21:44] https://www.irccloud.com/pastebin/aRzwIt9y/ [20:21:49] `if ! id "$prefix.$tool" >/dev/null 2>&1 || ! [ -d "/data/project/$tool" ]; then` [20:21:57] still the same from a new session [20:22:02] yep, `become` seems it looks for the home dir too [20:22:10] so if that did not work it would not get there [20:22:32] andrewbogott: yep, that looks like it, I see lots of `intr is deprecated` logs in `dmesg` from nfs [20:23:08] andrewbogott: it's getting no route to host [20:23:31] https://www.irccloud.com/pastebin/zyT8wfe5/ [20:24:17] oh, maybe the ip is not set on the vm? (got lost on the migration?) [20:24:29] (on the nfs vm I mean) [20:26:02] so puppet was broken and I suspect that meant that some part of the nfs server didn't get setup up or updated... [20:26:17] Notice: /Stage[main]/Profile::Wmcs::Nfs::Standalone/Interface::Ip[nfs-service-ip]/Exec[ip addr add 172.16.7.14/32 dev ens3]/returns: executed successfully (corrective) [20:26:18] Notice: /Stage[main]/Cloudnfs::Fileserver::Exports/Systemd::Service[nfs-exportd]/Service[nfs-exportd]/ensure: ensure changed 'stopped' to 'running' (corrective) [20:26:30] yep, that sounds likely [20:26:40] but the bastions still won't mount [20:26:58] I don't see the ip on the nfs vm though [20:27:15] oh, now I see it [20:27:46] and now it's working \o/ [20:28:06] lucaswerkmeister: can you test again? (thanks in advance xd) [20:28:40] what, what's working? I still see mount failures on the bastions [20:28:53] is this limited to toolforge or cloud vps in general? [20:29:00] just toolforge [20:29:08] my tools are working again and I can ssh to bastion-13 [20:29:10] ah. funny coincidence then lol [20:29:20] andrewbogott: I can `become stashbot` on the bastion-13 [20:29:20] dcaro: works! [20:29:54] well, working for some definition of working, seems like tools-static is still having issues [20:29:56] dcaro: I'm not convinced :/ [20:30:20] and everything else is slow to load [20:30:23] !log lucaswerkmeister@tools-bastion-13 tools.stashbot bin/stashbot.sh restart [20:30:43] AntiComposite: yep, workers/other VMs might still have issues (need restarting/remounting), looking [20:30:55] /home and /data/project are still wrong on tools-bastion-12 [20:31:04] andrewbogott: ack [20:31:16] let me remount stuff there [20:31:21] oh, but they're right on -13! [20:31:32] so that is indeed progress [20:32:25] !log lucaswerkmeister@tools-bastion-13 tools.sal webservice restart [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [20:32:43] I think it might be the way the newer kernel handles nfs mounts :/, on -12 there's a lot of stuck mounts as D, andrewbogott can you force-restart it? [20:32:45] !log lucaswerkmeister@tools-bastion-13 tools.stashbot bin/stashbot.sh restart [20:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [20:33:02] ok looks like stashbot is back alive for logging the repair work [20:33:09] dcaro: sorry, force restart what? [20:33:11] the whole host? [20:33:34] the -12 bastion [20:33:38] yep, ok [20:33:42] there might be some ip6 issues too? [20:33:45] Jan 13 20:32:53 tools-bastion-12 sshd[905]: error: connect to 2a00:1450:4001:81c::2002 port 443 failed: Network is unreachable [20:34:08] (low prio but tools-sgebastion-10 probably also needs a reboot Eventually™) [20:34:11] done [20:34:25] lucaswerkmeister: done also [20:34:37] dcaro: so now we're just rebooting k8s nfs clients? Or are there more missing pieces? [20:34:48] static might also be in trouble [20:35:13] tools-bastion-12 is up and running :) [20:35:24] shall I reboot static? [20:35:54] lucaswerkmeister: tools-sgebastion-10 seems to be ok for me [20:36:04] !log tools restore root-owned /tmp/framer.txt on tools-sgebastion-10, tools-bastion-12, tools-bastion-13 (cf. 2025-01-05 log entry) following bastion reboots [20:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:36:17] dcaro: I was unable to SSH into it until the reboot [20:36:24] andrewbogott: looking [20:36:49] andrewbogott: thanks! [20:37:10] andrewbogott: yep, let's reboot, it's back connected to nfs but nginx got already stuck [20:37:37] done [20:37:41] thanks [20:38:28] I’m restarting bridgebot, there may be a minor flood of catching-up messages from telegram in a second [20:38:35] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge jobs restart bridgebot [20:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [20:38:40] dcaro: I am also restarting select nfs workers [20:39:16] andrewbogott: thanks, yep that should get stuff unstuck [20:39:30] I think maintain-kubeusers might have gotten stuck too (it's failing to report prometheus metrics) [20:40:11] oh, back online :) [20:40:52] redis might be having issues [20:41:06] https://usercontent.irccloud-cdn.com/file/U5ZqTSMo/image.png [20:43:05] I am tracking with T383625 [20:43:06] T383625: tools nfs outage - https://phabricator.wikimedia.org/T383625 [20:43:15] andrewbogott: thanks, I'll add stuff there [20:48:26] !log anticomposite@tools-bastion-13 tools.stewardbots SULWatcher/manage.sh restart # SULWatchers disconnected [20:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [20:48:47] dcaro, alert manager seems happy, are we good outside of more worker reboots? [20:49:07] If so then I'll send an email and you and dhinus can go back to your dinner [20:49:44] !log tools restart prometheus to pick up the new ips for vms and such [20:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:50:03] andrewbogott: should be ok yep, things are coming back online [20:50:13] ok. Sorry for the racket :( [20:50:30] Jan 13 20:49:56 metricsinfra-prometheus-2 prometheus@cloud[2107623]: ts=2025-01-13T20:49:56.821Z caller=refresh.go:80 level=error component="discovery manager scrape" discovery=openstack config=tf-infra-test_node msg="Unable to refresh target groups" err="could not authenticate to OpenStack: Authentication failed" [20:50:31] thanks for poking the bridge lucaswerkmeister [20:50:33] interesting [20:51:52] hmm, `msg="Error sending alert" err="Post \"http://metricsinfra-alertmanager-3.metricsinfra.eqiad1.wikimedia.cloud:9093/api/v2/alerts\": context deadline exceeded"` [20:52:01] Toolforge is gradually coming back to life (I just restarted the bridge, hence the above messages – there should be more but they might be lost), but it’s not fully healthy yet [20:53:43] no probs andrewbogott, you managed to time this perfectly between my dinner and my bedtime :) [20:55:48] all my tools seem to be alive again \o/ [20:57:27] now I can see the 'nfs too many workers have processes in d state' alert xd [20:57:54] lucaswerkmeister: \o/ [20:58:16] "no longer so broken that I can't see it's broken" [20:58:48] yep :) [20:59:09] :D [21:06:47] andrewbogott: I think all that's left is to reboot the workers that are stuck bit by bit, ping me again if you need any help, otherwise I'll retire for the night :) [21:07:23] sounds good. thank you for appearing during my hour of need! [21:07:32] you too lucaswerkmeister and dhinus [21:13:38] andrewbogott: #hugops [21:13:57] thx [22:20:23] AFAICT the quickcategories background runner kept running right through the outage btw [22:20:28] ✨ the magic of not needing NFS mounted ✨ [22:20:56] (are NFS-less pods scheduled to a separate pool of runners and that’s why the runner didn’t need a reboot?) [22:26:19] Yep, they are [22:26:25] nice [22:26:40] (I think “there are plans” in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#NFS_and_LDAP needs an update then? ^^) [22:27:20] also are there really only 6 non-NFS workers or am I using `kubectl sudo get nodes` wrong? (just curious) [22:27:53] that's what I see on https://k8s-status.toolforge.org/nodes/ too [22:28:08] oh, good to know that doesn’t need sudo :D [22:28:40] hm, bridgebot seems to be down again (or possibly lagging) [22:28:43] unless you using buildpacks, the default is NFS, so most usage is NFS [22:29:21] (also that k8s-status link gave me 502 Webservice is unreachable 😬) [22:29:36] that's likely my fault [22:30:23] the perf is bad and I tried to load 6 pages simultaneously [22:30:44] well, combined with the bridgebot problem I’m worrying if it’s a wider issue [22:30:52] but I guess I’ll try restarting bridgebot and see if that helps [22:31:02] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge jobs restart bridgebot [22:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [22:31:15] some other pages on the tool load fine [22:32:30] ok bridge is mostly working again (one telegram->IRC message still pending) [22:32:47] bridge test [22:42:48] Any log alternative on Toolforge that doesn’t use NFS? [23:11:55] @MaartenDammers: nothing durable, but if your Kubernetes processes write to stderr/stdout it will get captured by Kubernetes until the container has gone away. Those records can be found with `kubectl logs ` or `toolforge jobs logs `. The jobs wrapper I think only works if your job is not configured to log to disk. [23:14:11] T97861 is maybe the current feature request for something more awesome? We have talked about this need for years and years of course. There always seemed to be something more urgent in line ahead of it. [23:14:12] T97861: [toolforge.infra] Provide centralized logging (logstash) for Toolforge - https://phabricator.wikimedia.org/T97861 [23:15:34] ah, T127367 is the specifically non-NFS logging for tools task [23:15:34] T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367 [23:20:30] One of the harder bits in past investigations has been finding a FOSS log aggregator that is also multi-tenant so that we aren't accidentally leaking secrets that end up in debug logs to everone with a Toolforge account. https://wikitech.wikimedia.org/wiki/User:Taavi/Loki_notes did seem promising when it was reviewed a couple of years ago. [23:26:30] I helped implement a central logging infrastructure where Apache Kafka used in the middle between data producers and consumers. That gives quite some control [23:27:21] Our production ELK cluster does that. https://wikitech.wikimedia.org/wiki/Logstash [23:28:23] back in the olden days when I first built the ELK stack out here out it was just UDP and hope :)