[07:21:20] morning [07:56:47] morning [07:57:08] morning [09:10:18] anyone playing with toolsbeta-prometheus-1 ? [09:10:27] it went down (alert) [09:10:27] not me [09:10:36] I can look in a bit [09:11:07] let me do a quick check firsh [09:11:09] *first [09:12:14] hmm, ssh timed out, console seems unresponsive, maybe having memory/load issues [09:12:29] yep, just very unresponsive, not dead [09:13:58] I'll do a snapshot of the current processes, and reboot the VM [09:15:29] uuhh, me likey the project alerts section in the grafana cloudvps board [09:15:31] https://usercontent.irccloud-cdn.com/file/Hc99p63Y/image.png [09:16:05] hmm, that VM is periodically struggling [09:18:00] what happens every 2h? will check [09:19:38] * dcaro missing a cloudvps.reboot_vm --force cookbook [09:25:55] it has a bunch of processes in D state, I think that might be when prometheus is compacting metrics or similar [09:26:44] and it's using all it's ram (2G only) [09:31:31] sssd-pam is also failing there [09:31:31] Oct 27 09:29:52 toolsbeta-prometheus-1 systemd[1]: sssd-pam.service: Main process exited, code=exited, status=70/SOFTWARE [09:31:49] (now at least, I was able to login, seeing that on an ssh shell) [09:32:13] Oct 27 09:31:44 toolsbeta-prometheus-1 sssd[515]: Child [608] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason. [09:32:36] I think it might need more resources :/ (at least, would help with debugging too) [10:13:05] hmm, just realized that the freshly reimaged cloudvirt-wdqs1001 is only using a fraction of the available space for the nova instances LV. I'll extend it a bit to get a canary to schedule there, as I'm not sure if it's supposed to use the whole VG or if I should leave some buffer space there just in ase [10:32:18] I think it's ok to use all the space, I don't see why having some spare space would be better than using all of it at once [10:32:47] (unless backups or such, but if so, we would be using the space, an would not be spare) [12:31:09] tools-nfs-2 is almost out of disk space: T349895 [12:31:10] T349895: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 [12:39:56] oh, there's no alert? [12:40:47] yeah, looks like that was lost on the virtual nfs move [12:41:13] I'm running a script to find the largest tool directories [13:28:29] I think I might've found the culprit :D T349904 [13:28:30] T349904: 'fixsuggesterbot' tool cache directory is very large - https://phabricator.wikimedia.org/T349904 [13:29:38] wow... feels like a failing cleanup function xd [13:30:06] (or they though they'd clean up when the space limit is reached... forgetting that it's a shared storage server) [14:36:20] taavi: sorry I missed you yesterday, did you figure out about getting new canaries on the wdqs boxes? [14:39:47] andrewbogott: yeah I figured it out in the end, the issue in the end was that the LVM volume for /var/lib/nova/instances was like 10G by default. I added a few dozen GB more to get the VM to schedule, was meaning to ask you if you've usually used the full disk for that or if you've left a few percentages of the LVM VG unused to be available in case [14:39:47] of an emergency? [14:41:20] I'm pretty sure it was using all available space before. [14:42:03] But it would also be fine to leave a few free G for wiggle room... I think the project that required that exact amount of disk space is long finished. [14:43:26] taavi: is that enough of an answer? [14:44:06] it is, thanks. I think I'll resize the LV to use 95% or something similar [14:45:57] sgtm [14:55:08] I freed up 600G on the tools nfs server by truncating two (2) log files. that gives us enough head room that I'm comfortable leaving it like this over the weekend [14:57:00] that's a whole log of log [15:24:32] I'm not sure if I should be happy or sad that it's this easy to clean up disk space. I just found 150G more worth of log files that I'm going to clean up [15:36:47] It's a long-running gag with toolforge NFS. The 'right solution' is some kind of managed log service I guess [15:37:12] dcaro: can you link me to that ceph monitoring/alerting/smart bug? (I think you mentioned that one exists) [15:37:32] yeah. although ideally I'd rather not store 150G worth of "please create a pywikibot config file" in the first place [15:40:34] quota on the storage is also a good one [15:41:01] dcaro: that would mean having a per-tool nfs volume? [15:41:02] andrewbogott: what do you mean? (my memory is quite bad :/ ) [15:41:10] andrewbogott: or similar yes [15:41:17] maybe not nfs [15:41:43] dcaro: I think last week you asked me something about why those disk failures didn't alert, and found some bitrotted metrics that you wanted me to revive. [15:42:04] ahhhh [15:43:05] andrewbogott: yes, those metrics are being gathered again, but it's only on ceph side (ceph device get-health-metrics ) [15:43:18] quota-wise: The half-baked fix would be to write some kind of agent that crawls all the tool dirs, does a 'du' and takes measures. Email, then disable tools, etc. etc. A hack, but would be a simple hack. [15:43:37] yep, that would be nice to get some metrics about disk usage too [15:44:22] andrewbogott: for prometheus we will need to export them somehow [15:44:22] https://phabricator.wikimedia.org/T348716 [15:44:31] We have a metrics and dashboards that measure and identify high disk usage per tool but they're turned off because running an nfs-wide 'du' was messing up performance. [15:44:37] (the ceph device metrics, as they are not exposed by the ceph prometheus endpoint) [15:44:38] s/have/had/ :/ [15:45:04] yep, that's a downside of an nfs-wide 'du' xd [15:45:23] If it were per tool we could throttle it pretty easily. [15:45:34] Is T349694 meant to say 'enable'? [15:45:35] T349694: [ceph] Unable disk failure prediciton - https://phabricator.wikimedia.org/T349694 [15:45:58] hahahahah, yes xd [15:46:59] ok, makes a lot more sense that way! [15:49:25] I think it's time for me to log off [15:49:34] cya on monday! [15:49:47] * andrewbogott waves