[06:52:46] greetings [09:36:52] morning [10:45:33] I was looking at the tools object storage quota alerts, trending upwards like there's no tomorrow :| https://w.wiki/H2hz [10:45:55] I guess we can bump the quota for now and revisit in jan if growth is expected [10:47:19] what happened when it went down? manual cleanups? [10:47:47] no it's at midnight so it must be something automatic [10:47:58] but not enough to compensate the growth [10:48:05] I'd say so too, automatic cleanup [10:48:22] yeah like 2% a day growth [10:50:53] the 4w-view is basically modern art https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=(sum%20by%20(user)%20(ceph_rgw_quota_objects_used)%20%2F%20sum%20by%20(user)%20(ceph_rgw_quota_objects_total%7Bcluster%3D%22wmcs%22%2Cuser%3D~%22(tools.*%7Cadmin%7Cquarry%7Cmetricsinfra%7Cpaws%7Ctofu%7Ccloudinfra)%22%7D))&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=4w [10:51:02] (sorry I should've shortened that link) [10:52:26] +1 to bumping the quota, but we'll need to find out who [10:52:32] who's writing all the data [10:53:11] harbor, probably [10:53:22] yeah 4w is basically the NVDA valuation graph [10:54:14] do we have per-bucket usage breakdowns ? [10:55:20] and how do we go about bumping the quota ? [10:56:25] LOL re: NVDA graph :) [10:57:00] per-bucket usage: not sure, quota: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#swift_/_S3_/_radosgw_/_object_storage [10:57:31] (it's linked from the clinic duties wikitech page, as sometimes we have to change it for a cloudvps project) [10:57:52] ah! thank you, I must have missed it [10:58:33] ok so ceph of course has per-user stats, we could export them periodically like we do for swift [10:58:40] anyways I'm looking into the quota bump [10:59:37] "tools" is a single user from the ceph perspective, so I'm not sure if we can do per-bucket stats inside that user [11:01:30] ah yes you are right [11:01:41] root@cloudcontrol1006:~# radosgw-admin quota set --quota-scope user --uid 'tools$tools' --max-size 500G [11:01:51] quota is 200G now [11:03:41] lgtm [11:04:01] {{done}} [11:04:23] thanks! please leave a SAL about it :) [11:04:30] you were 1 second faster :P [11:04:41] hahah great minds think alike [11:04:54] I think I would use "!log tools" in this case as it's related to the tools cloudvps project [11:05:12] that's fair yeah, I'll do that too [11:06:15] previous bump: https://sal.toolforge.org/tools?d=2025-11-19 [11:06:38] a.ndrew did increase the object count quota as well... let's check if we're close to the limit or not [11:10:21] yeah 83k objects and 100k quota [11:11:37] I'll bump to 200k dhinus [11:12:15] +1 [11:14:48] ahhh the alerting quota was actually the "object count" quota and not the "gb" quota [11:15:11] good catch dhinus, I misread the alert which does mention 'objects' [11:15:23] I also did not notice it when I looked before :) [11:15:24] whereas toolsbeta is by size [11:16:07] I have to run an errand, bbl :) [14:02:57] what's the triage/debug story for the updatetools job emails to tools.admin ? [16:32:23] godog: sorry I didn't see your message about updatetools ^ -- I created T413099 [16:32:24] T413099: updatetools frequently emailing about failures - https://phabricator.wikimedia.org/T413099 [16:32:40] nice, thank you dhinus ! [17:00:51] updatetools sent another email "Start timestamp 2025-12-18T15:59:01Z. Finish timestamp 2025-12-18T16:20:22Z. Exit code was '1'. With reason 'Error'." [17:01:46] but if I do "sudo become admin" and "tail updatetools.err" the last run seems successful [17:02:26] ok in the log file there is an error 4 runs from the bottom, which probably matches the time of the error [17:02:42] "ConnectionResetError: [Errno 104] Connection reset by peer" [17:02:59] I have to log off but I'll check again tomorrow if it keeps on failing [17:04:49] ok there's an increased traffic activity to toolsdb starting today: https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1&from=now-7d&to=now&timezone=utc&var-server=tools-db-6 [17:05:08] but if you look at the 30d graph it's not unprecedented [17:05:58] aborted connects have spiked https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1&from=now-7d&to=now&timezone=utc&var-server=tools-db-6&viewPanel=panel-10 [17:12:04] looks like it's improving in the past hour [17:18:40] logging off, if you see more issues with toolsdb ping me and I'll try to have a look later [17:18:48] * dhinus off