[06:52:46] <godog>	 greetings
[09:36:52] <dhinus>	 morning
[10:45:33] <godog>	 I was looking at the tools object storage quota alerts, trending upwards like there's no tomorrow :| https://w.wiki/H2hz
[10:45:55] <godog>	 I guess we can bump the quota for now and revisit in jan if growth is expected
[10:47:19] <dhinus>	 what happened when it went down? manual cleanups?
[10:47:47] <dhinus>	 no it's at midnight so it must be something automatic
[10:47:58] <dhinus>	 but not enough to compensate the growth
[10:48:05] <godog>	 I'd say so too, automatic cleanup
[10:48:22] <godog>	 yeah like 2% a day growth
[10:50:53] <dhinus>	 the 4w-view is basically modern art https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=(sum%20by%20(user)%20(ceph_rgw_quota_objects_used)%20%2F%20sum%20by%20(user)%20(ceph_rgw_quota_objects_total%7Bcluster%3D%22wmcs%22%2Cuser%3D~%22(tools.*%7Cadmin%7Cquarry%7Cmetricsinfra%7Cpaws%7Ctofu%7Ccloudinfra)%22%7D))&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=4w
[10:51:02] <dhinus>	 (sorry I should've shortened that link)
[10:52:26] <dhinus>	 +1 to bumping the quota, but we'll need to find out who
[10:52:32] <dhinus>	 who's writing all the data
[10:53:11] <taavi>	 harbor, probably
[10:53:22] <godog>	 yeah 4w is basically the NVDA valuation graph
[10:54:14] <godog>	 do we have per-bucket usage breakdowns ?
[10:55:20] <godog>	 and how do we go about bumping the quota ?
[10:56:25] <dhinus>	 LOL re: NVDA graph :)
[10:57:00] <dhinus>	 per-bucket usage: not sure, quota: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#swift_/_S3_/_radosgw_/_object_storage
[10:57:31] <dhinus>	 (it's linked from the clinic duties wikitech page, as sometimes we have to change it for a cloudvps project)
[10:57:52] <godog>	 ah! thank you, I must have missed it
[10:58:33] <godog>	 ok so ceph of course has per-user stats, we could export them periodically like we do for swift
[10:58:40] <godog>	 anyways I'm looking into the quota bump
[10:59:37] <dhinus>	 "tools" is a single user from the ceph perspective, so I'm not sure if we can do per-bucket stats inside that user
[11:01:30] <godog>	 ah yes you are right
[11:01:41] <godog>	 root@cloudcontrol1006:~# radosgw-admin quota set --quota-scope user --uid 'tools$tools' --max-size 500G
[11:01:51] <godog>	 quota is 200G now
[11:03:41] <dhinus>	 lgtm
[11:04:01] <godog>	 {{done}}
[11:04:23] <dhinus>	 thanks! please leave a SAL about it :)
[11:04:30] <dhinus>	 you were 1 second faster :P
[11:04:41] <godog>	 hahah great minds think alike
[11:04:54] <dhinus>	 I think I would use "!log tools" in this case as it's related to the tools cloudvps project
[11:05:12] <godog>	 that's fair yeah, I'll do that too
[11:06:15] <dhinus>	 previous bump: https://sal.toolforge.org/tools?d=2025-11-19
[11:06:38] <dhinus>	 a.ndrew did increase the object count quota as well... let's check if we're close to the limit or not
[11:10:21] <godog>	 yeah 83k objects and 100k quota
[11:11:37] <godog>	 I'll bump to 200k dhinus 
[11:12:15] <dhinus>	 +1
[11:14:48] <dhinus>	 ahhh the alerting quota was actually the "object count" quota and not the "gb" quota
[11:15:11] <godog>	 good catch dhinus, I misread the alert which does mention 'objects'
[11:15:23] <dhinus>	 I also did not notice it when I looked before :)
[11:15:24] <godog>	 whereas toolsbeta is by size
[11:16:07] <dhinus>	 I have to run an errand, bbl :)
[14:02:57] <godog>	 what's the triage/debug story for the updatetools job emails to tools.admin ?
[16:32:23] <dhinus>	 godog: sorry I didn't see your message about updatetools ^ -- I created T413099
[16:32:24] <stashbot>	 T413099: updatetools frequently emailing about failures - https://phabricator.wikimedia.org/T413099
[16:32:40] <godog>	 nice, thank you dhinus ! 
[17:00:51] <dhinus>	 updatetools sent another email "Start timestamp 2025-12-18T15:59:01Z. Finish timestamp 2025-12-18T16:20:22Z. Exit code was '1'. With reason 'Error'."
[17:01:46] <dhinus>	 but if I do "sudo become admin" and "tail updatetools.err" the last run seems successful
[17:02:26] <dhinus>	 ok in the log file there is an error 4 runs from the bottom, which probably matches the time of the error
[17:02:42] <dhinus>	 "ConnectionResetError: [Errno 104] Connection reset by peer"
[17:02:59] <dhinus>	 I have to log off but I'll check again tomorrow if it keeps on failing
[17:04:49] <dhinus>	 ok there's an increased traffic activity to toolsdb starting today: https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1&from=now-7d&to=now&timezone=utc&var-server=tools-db-6
[17:05:08] <dhinus>	 but if you look at the 30d graph it's not unprecedented
[17:05:58] <dhinus>	 aborted connects have spiked https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1&from=now-7d&to=now&timezone=utc&var-server=tools-db-6&viewPanel=panel-10
[17:12:04] <dhinus>	 looks like it's improving in the past hour
[17:18:40] <dhinus>	 logging off, if you see more issues with toolsdb ping me and I'll try to have a look later
[17:18:48] * dhinus off