[00:11:53] * bd808 off [06:25:34] morning [07:33:27] morning [08:05:56] morning [08:08:03] o/ [08:25:32] cloudbackups are failing, anyone knows if that's expected? [08:26:13] I remember puppet was failing at some point I think, but this is not puppet (that I know of, have not looked yet) [08:33:28] it's all "codfw" cluster backups though (-dev) [08:35:17] T361912 isn't "urgent" but we'd definitely appreciate a promptish look at it — sorry to pressure! <3 [08:35:17] T361912: xtools quota increase - https://phabricator.wikimedia.org/T361912 [08:40:08] TheresNoTime: are xtools-prod08 and 09 a pair of redundant appservers, or do they have different functionality? [08:40:54] one handles API requests, the other is the main app (afaik) [08:41:39] but now looking at this, it may be the trove db (`xtools-db01`) that needs resizing (too/instead of) [08:41:48] they don't share the quota do they? [08:42:30] I think trove quota is separate [08:42:55] since based on looking at https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=xtools&var-instance=All&from=now-6h&to=now those instances themselves don't seem particularly loaded [08:43:23] do trove db instances also report usage to a grafana? [08:43:39] I don't think so.. [08:43:44] but I can ssh to those and have a look [08:44:51] thank you [08:45:47] it's not particularly loaded at this exact moment.. but it is a g3.cores1.ram2.disk20, so I can totally see that suffering when it is [08:46:51] Shall I close T361912 and open a new task for upping the trove quota instead? [08:46:52] T361912: xtools quota increase - https://phabricator.wikimedia.org/T361912 [08:47:38] (or I can repurpose 361912 I suppose..) [08:47:58] I suspect you could resize the trove thing to at least cores2.ram4 within the default quotas (although I don't remember what they are exactly) [08:51:53] Okay, I'll look at that, thank you again :) I've closed 361912 [08:54:12] One more question — https://wikitech.wikimedia.org/wiki/Help:Trove_database_user_guide#Backup_or_snapshot_your_databases mentions that saving a backup to swift isn't yet implemented, but the linked task there is resolved and clicking "backup" in horizon does prompt me for a swift container name [08:54:18] is that something we can do, or no? [08:55:54] if I recall correctly it currently only works on instances created after a certain date. and moving all older instances to those newer ones is difficult without the ability to natively backup/snapshot them [08:56:52] Okay, in that case and given that at this exact moment its not all crashing and burning I'm gonna defer to the person on commtech who understands xtools better (: [08:57:04] thank you very much for the help and patience ^^" [09:10:57] hmm, I think that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016447 never worked [09:13:37] (is that part of the trove backup thing I mentioned?) [09:16:33] oh nonono, sorry, completely off-topic :) [09:16:53] it's related to the alerts we are seeing on cloudbackup*-dev [09:19:01] oh :p [09:19:57] looking for a proofread of https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/RBAC_and_PSP/PSP_migration to make sure it doesn't contain any obvious non-sense [09:23:35] arturo: which PSA policy are you thinking of using for tools? [09:24:03] restricted + baseline, but it doesn't cover everything, as you can see in the table [09:24:29] I though we decided not to decide yet if replacing the admission controller with policy rules, and look at it after we had some more experience with the new policy rules engine [09:25:18] hmmmm, restricted does not seem to allow hostPath mounts, so that would block tools using NFS mounts [09:26:34] dcaro: I think that is the smaller problem here. The PSA as they are can't work as a replacement for PSP. Meaning that we need a policy agent anyway [09:27:07] once we have a policy agent in place, replacing the admission controllers is trivial [09:27:25] that's the point I don't want to decide on yet [09:27:26] (famous last words?) [09:27:51] I don't think it's that beneficial to move the admission controllers to (whichever agent) policies [09:27:52] why you don't want to decide yet? [09:28:52] I'm curious -- we definitely don't need to decide yet. But it would be good to keep in the radar, it might inform our decision regarding which policy agent to use [09:28:59] ex. envvars, I think it's better to keep the admission controller separated from any policy agent we have, and I don't think yaml is any nicer than golang for expressing it [09:29:59] I would prefer not adding a coupling on another system unless there's no other option [09:31:02] that's ok, we can decide later [09:31:24] just for reference, here is an example of how mutating a pod manifest to inject envvars is with kyverno https://kyverno.io/policies/other/add-env-vars-from-cm/add-env-vars-from-cm/ [09:31:34] again, just an example, not saying we should go with kyverno [09:32:57] mmm I may start playing with kyverno just because I like the docs more than OPA gatekeeper [09:33:01] we currently use many configs to store the envvars (one per envvar) [09:33:57] regarding replacing the admission controllers, this is my only concern at the moment: [09:34:03] << make sure the policy agent we choose can also absorb the functionalities of the several custom admissions controllers we have >> [09:34:23] I would not make that a "must" though [09:34:28] in other words, make sure the movements we do don't close future doors [09:35:02] if there's a big advantage of using a different policy agent, that does not support the custom admission controllers, that's also ok, and probably preferred to choosing a different policy agent [09:36:33] taavi: the hostPath thing is too bad. May mean that it doesn't even worth introducing PSA at all [09:37:07] even the baseline disallows hostPath [10:17:18] dcaro: I looked at the cloudbackup1*-dev alerts, backy2 is refusing to start because postgres is down [10:18:10] I would wait for a.ndrew to be online, I'm not sure how postgres is configured on those hosts [10:18:46] maybe it's a simple postgres config issue [10:21:22] dhinus: agree yes, let's wait [10:29:03] the tf-infra-test alert was again failing because Trove quotas are not working correctly (T359412) [10:29:04] T359412: [trove] wrong quota_usages values in project tf-infra-test - https://phabricator.wikimedia.org/T359412 [12:05:34] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/72 [12:06:16] dcaro: flake8 does not spport pyproject.toml, right? [12:06:26] not yet no :,( [12:06:51] yeah that's what I thought. approved [12:06:56] thanks [12:07:41] dcaro: just move to ruff :P [12:07:49] ;) [12:08:37] I might send a patch after we get the decision request sorted [12:21:53] quick review for https://gerrit.wikimedia.org/r/c/operations/alerts/+/1017273? [12:22:45] done :) [12:22:46] thanks [12:22:47] * dcaro lunch [12:41:24] dcaro: the backup alerts are from T358855, a new service I'm setting up. I'm not sure why those alerts un-ack'd themselves but you can ignore. [12:41:25] T358855: Use cloudbackup100[12]-dev for cinder backup test/dev - https://phabricator.wikimedia.org/T358855 [12:59:48] andrewbogott: ack [13:04:07] * arturo food [13:28:28] * dhinus running an errand [14:27:53] dcaro: mind doing the etherpad rotation at some point so I can add things to the next week's pad? [14:28:26] oh, the new is there already, I forgot to delete the old [14:28:43] and the /topic needs updating [14:31:38] thanks :-) [14:32:47] I kind of got distracted by one of the subjects of the daily and started looking into it, and forgot to come back to the task [14:32:53] task buffer overflow xd [15:09:13] * dcaro off [15:09:17] have a good weekend [15:09:28] you too! [15:09:38] same! [15:09:44] note that I'm leaving some tests running on cloudcephosd1034 (in a tmux), don't kill them xd [15:55:03] * arturo offline [17:27:57] taavi, dcaro: there is a flake8 plugin that adds the pyproject.toml support that upstream does not want to add themselves -- https://pypi.org/project/Flake8-pyproject/ [17:28:53] I still tend to use tox.ini out of habit most places [17:38:36] * bd808 lunch