[07:21:20] <taavi>	 morning
[07:56:47] <dcaro>	 morning
[07:57:08] <blancadesal>	 morning
[09:10:18] <dcaro>	 anyone playing with toolsbeta-prometheus-1 ?
[09:10:27] <dcaro>	 it went down (alert)
[09:10:27] <taavi>	 not me
[09:10:36] <taavi>	 I can look in a bit
[09:11:07] <dcaro>	 let me do a quick check firsh
[09:11:09] <dcaro>	 *first
[09:12:14] <dcaro>	 hmm, ssh timed out, console seems unresponsive, maybe having memory/load issues
[09:12:29] <dcaro>	 yep, just very unresponsive, not dead
[09:13:58] <dcaro>	 I'll do a snapshot of the current processes, and reboot the VM
[09:15:29] <dcaro>	 uuhh, me likey the project alerts section in the grafana cloudvps board
[09:15:31] <dcaro>	 https://usercontent.irccloud-cdn.com/file/Hc99p63Y/image.png
[09:16:05] <dcaro>	 hmm, that VM is periodically struggling
[09:18:00] <dcaro>	 what happens every 2h? will check
[09:19:38] * dcaro missing a cloudvps.reboot_vm --force cookbook
[09:25:55] <dcaro>	  it has a bunch of processes in D state, I think that might be when prometheus is compacting metrics or similar
[09:26:44] <dcaro>	 and it's using all it's ram (2G only)
[09:31:31] <dcaro>	 sssd-pam is also failing there
[09:31:31] <dcaro>	 Oct 27 09:29:52 toolsbeta-prometheus-1 systemd[1]: sssd-pam.service: Main process exited, code=exited, status=70/SOFTWARE
[09:31:49] <dcaro>	 (now at least, I was able to login, seeing that on an ssh shell)
[09:32:13] <dcaro>	 Oct 27 09:31:44 toolsbeta-prometheus-1 sssd[515]: Child [608] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
[09:32:36] <dcaro>	 I think it might need more resources :/ (at least, would help with debugging too)
[10:13:05] <taavi>	 hmm, just realized that the freshly reimaged cloudvirt-wdqs1001 is only using a fraction of the available space for the nova instances LV. I'll extend it a bit to get a canary to schedule there, as I'm not sure if it's supposed to use the whole VG or if I should leave some buffer space there just in ase
[10:32:18] <dcaro>	 I think it's ok to use all the space, I don't see why having some spare space would be better than using all of it at once
[10:32:47] <dcaro>	 (unless backups or such, but if so, we would be using the space, an would not be spare)
[12:31:09] <taavi>	 tools-nfs-2 is almost out of disk space: T349895
[12:31:10] <stashbot>	 T349895: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895
[12:39:56] <dcaro>	 oh, there's no alert?
[12:40:47] <taavi>	 yeah, looks like that was lost on the virtual nfs move
[12:41:13] <taavi>	 I'm running a script to find the largest tool directories
[13:28:29] <taavi>	 I think I might've found the culprit :D T349904
[13:28:30] <stashbot>	 T349904: 'fixsuggesterbot' tool cache directory is very large - https://phabricator.wikimedia.org/T349904
[13:29:38] <dcaro>	 wow... feels like a failing cleanup function xd
[13:30:06] <dcaro>	 (or they though they'd clean up when the space limit is reached... forgetting that it's a shared storage server)
[14:36:20] <andrewbogott>	 taavi: sorry I missed you yesterday, did you figure out about getting new canaries on the wdqs boxes?
[14:39:47] <taavi>	 andrewbogott: yeah I figured it out in the end, the issue in the end was that the LVM volume for /var/lib/nova/instances was like 10G by default. I added a few dozen GB more to get the VM to schedule, was meaning to ask you if you've usually used the full disk for that or if you've left a few percentages of the LVM VG unused to be available in case
[14:39:47] <taavi>	 of an emergency?
[14:41:20] <andrewbogott>	 I'm pretty sure it was using all available space before.
[14:42:03] <andrewbogott>	 But it would also be fine to leave a few free G for wiggle room... I think the project that required that exact amount of disk space is long finished.
[14:43:26] <andrewbogott>	 taavi: is that enough of an answer?
[14:44:06] <taavi>	 it is, thanks. I think I'll resize the LV to use 95% or something similar
[14:45:57] <andrewbogott>	 sgtm
[14:55:08] <taavi>	 I freed up 600G on the tools nfs server by truncating two (2) log files. that gives us enough head room that I'm comfortable leaving it like this over the weekend
[14:57:00] <andrewbogott>	 that's a whole log of log
[15:24:32] <taavi>	 I'm not sure if I should be happy or sad that it's this easy to clean up disk space. I just found 150G more worth of log files that I'm going to clean up
[15:36:47] <andrewbogott>	 It's a long-running gag with toolforge NFS. The 'right solution' is some kind of managed log service I guess
[15:37:12] <andrewbogott>	 dcaro: can you link me to that ceph monitoring/alerting/smart bug? (I think you mentioned that one exists)
[15:37:32] <taavi>	 yeah. although ideally I'd rather not store 150G worth of "please create a pywikibot config file" in the first place
[15:40:34] <dcaro>	 quota on the storage is also a good one
[15:41:01] <andrewbogott>	 dcaro: that would mean having a per-tool nfs volume?
[15:41:02] <dcaro>	 andrewbogott: what do you mean? (my memory is quite bad :/ )
[15:41:10] <dcaro>	 andrewbogott: or similar yes
[15:41:17] <dcaro>	 maybe not nfs
[15:41:43] <andrewbogott>	 dcaro: I think last week you asked me something about why those disk failures didn't alert, and found some bitrotted metrics that you wanted me to revive.
[15:42:04] <dcaro>	 ahhhh
[15:43:05] <dcaro>	 andrewbogott: yes, those metrics are being gathered again, but it's only on ceph side (ceph device get-health-metrics <device>)
[15:43:18] <andrewbogott>	 quota-wise: The half-baked fix would be to write some kind of agent that crawls all the tool dirs, does a 'du' and takes measures. Email, then disable tools, etc. etc. A hack, but would be a simple hack.
[15:43:37] <dcaro>	 yep, that would be nice to get some metrics about disk usage too
[15:44:22] <dcaro>	 andrewbogott: for prometheus we will need to export them somehow
[15:44:22] <dcaro>	 https://phabricator.wikimedia.org/T348716
[15:44:31] <andrewbogott>	 We have a metrics and dashboards that measure and identify high disk usage per tool but they're turned off because running an nfs-wide 'du' was messing up performance.
[15:44:37] <dcaro>	 (the ceph device metrics, as they are not exposed by the ceph prometheus endpoint)
[15:44:38] <andrewbogott>	 s/have/had/ :/
[15:45:04] <dcaro>	 yep, that's a downside of an nfs-wide 'du' xd
[15:45:23] <andrewbogott>	 If it were per tool we could throttle it pretty easily.
[15:45:34] <andrewbogott>	 Is T349694 meant to say 'enable'?
[15:45:35] <stashbot>	 T349694: [ceph] Unable disk failure prediciton - https://phabricator.wikimedia.org/T349694
[15:45:58] <dcaro>	 hahahahah, yes xd
[15:46:59] <andrewbogott>	 ok, makes a lot more sense that way!
[15:49:25] <dcaro>	 I think it's time for me to log off
[15:49:34] <dcaro>	 cya on monday!
[15:49:47] * andrewbogott waves