[08:09:25] morning! I am back, anything you want me to look at first? [08:12:29] welcome back! [08:12:40] I'm currently looking at maintain-kubeusers [08:12:48] we deployed the refactored version yesterday [08:12:56] but there are a few things that I'm not liking [08:13:21] looping over all the accounts takes so long that the daemon fails the liveness checks [08:13:29] I'm exploring how to solve that [08:14:44] I think checking for all the configmaps in each namespaces is taking a long time, so I'll explore how to cache that [08:17:23] T366564 [08:17:26] T366564: toolforge: new maintain-kubeusers takes long time to loop over all the accounts to reconcile them - https://phabricator.wikimedia.org/T366564 [08:18:06] another strategy is to introduce parallelism, as dcaro suggested yesterday [08:19:51] welcome back! [08:20:15] didn't the old maintain-kubeusers also read the configmap for each namespace on each run? what changed? [08:20:38] it did, didn't it? [08:21:03] a lot changed, it was a major refactor of the code [08:21:17] arturo: is it failing on every loop? I thought it was only having the issue on the first loop [08:21:29] like, why is it suddenly much slowe? [08:21:30] (when it was refreshing all the certs and such) [08:21:39] the loop is still surprisingly slow [08:21:49] so it still misses the liveness probe [08:22:04] taavi: I don't know, and in my local tests, the refactor seemed faster [08:22:18] so I guess a good start point is to see how was it looping before when there were no changes needed, and what is it doing now [08:23:34] the check for the kube certs now includes a filesystem check, maybe this is NFS introducing the delay. I'm not sure this filesystem check was before [08:24:47] now: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/main/maintain_kubeusers/resources/kubecerts.py?ref_type=heads#L45 [08:28:27] one thing I'm seeing is that before we checked if the tool was disabled before checking anything else, and those were skipped [08:28:37] now you retrieve the configmap and such even if it's disabled [08:29:00] ok [08:29:20] do we know how many tools are disabled? [08:29:22] but I guess the ratio of disabled/enabled account should be in favor of enabled [08:30:20] if it's high enough might be relevant [08:31:24] in the current loop, only 7 [08:31:36] https://www.irccloud.com/pastebin/Nxcc8E7C/ [08:32:15] those are not many xd [08:32:29] can you get times for the logs? (see if all are more or less the same or a few are way slower) [08:33:14] In that sense, can we enable a verbose/debug mode to get extra logs also? that might hint to which parts are the slow ones [08:33:18] https://disabled-tools.toolforge.org is the list of currently disabled tools [08:33:35] previous code to detect if certs needed renewal: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/cca65ee3c03812a9ea3f3e70cd61835f7a595299/maintain_kubeusers/k8s_api.py#L67 [08:33:37] that matches yes [08:33:47] so indeed it did not check for the existence of the kube certs [08:34:10] I think this supports the NFS-induced delays [08:34:25] I'd say it does not refute them [08:34:41] there's a few other fs checks when checking 'needs_delete' on the new code [08:36:44] there are a few checks like this [08:36:45] if self.account.is_disabled and os.path.exists(self.home): [08:36:52] but I was assuming lazy evaluation [08:37:09] and the the os.path.exists would not execute if the first part was evaluated false [08:37:11] given that most accounts are not disabled, the check will run [08:37:29] wait, no, there's no `not` there xd, so yes, it should not run [08:37:54] I'm going to start looking at the WMCS hosts on https://phabricator.wikimedia.org/T366555 [08:38:31] taavi: ack [08:38:32] taavi: 👍 [08:39:23] dcaro: there are others too: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/main/maintain_kubeusers/resources/homedir.py?ref_type=heads#L21 [08:40:11] this one can be saved by storing something in the state configmap [08:41:46] hmm, shouldn't that be returning False? [08:41:57] (if the user is disabled, no need to create the home) [08:42:15] the flag itself is inside the home directory xd [08:42:23] yeah, this is confusing [08:42:35] the disabledflag is something created when the user is disabled [08:42:52] so that first if is there to support a transition from disabled to not-disabled [08:43:21] self.disabled_flag is the name of a file in the home dir [08:43:29] that transition does not mean that the flag was removed? [08:43:39] (as in, if the flag exists, the user is disabled) [08:43:41] self.account.is_disabled is the actual boolean that was fetched from LDAP [08:44:23] well, this is the function/code in charge of that transition [08:44:25] ok, so it's expecting that the users passed in are all enabled at that point? [08:44:34] (on ldap) [08:44:43] yes, needs_created() is only called for enabled users [08:46:41] os.makedirs will fail if the directories already exists I think [08:46:47] did you test that flow? [08:47:02] yes, they have `exist_ok=True` [08:47:09] I see, nice [08:49:44] 2024-06-04T08:49:31.970568402Z finished run, reconciled 22 admins, 3293 tool accounts [08:50:30] can you check the stats? how mane needed create and such? [08:51:30] yes [08:51:52] eventually we may want to setup grafana dashboards [08:51:59] but for now, running curl inside the pod [08:52:27] in the last loop, no resources were created, and no resource errors [08:52:45] maintain_kubeusers_run_seconds_sum 1934.9139037914574 [08:52:54] and the loop lasted this many seconds [08:53:05] it seems to be doing ~10 accounts per minute [08:53:16] *per second [08:53:25] about 32 minutes to loop [08:53:52] 10 accounts per second is not that bad if you think about it [08:53:56] (I got that from the logs) [08:54:26] that should be 300 seconds, but it takes ~2k, so some accounts are way slower? [08:54:46] let me recheck how I counted xd [08:55:27] I counted the `response body` entries, that seem to happen only once per account [08:55:33] is that assumption ok? [08:55:43] yes, the state configmap read [08:56:44] if we are doing 3200 accounts in 32 mins, that is 1000 accounts in 1 min, which is about 16 accounts in 1 second [08:57:44] I mean, the k8s API is also stressed here, so I would like to maybe don't query it even more? [08:58:27] it's 100 accounts per minute [08:59:05] that's a bit more than 1.5 accounts per second [09:00:56] found a slow bump in the logs: [09:01:00] https://www.irccloud.com/pastebin/Pk4JGMCM/ [09:01:24] it's the end of a run xd [09:02:19] right [09:06:58] first proposed change: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/28 [09:13:00] hmm... I think that the stats we are gathering are not true, run_seconds_sum is the aggregated runtime, not the time per run [09:13:11] *not true/not what we think [09:13:26] ok, that is certainly possible [09:13:33] it actually takes ~300 s for each run [09:14:05] now it's kinda hard to extract as it's a counter in grafana, and you have to find the right 'rate' to match 1 loop [09:14:37] you have maintain_kubeusers_run_seconds_count [09:14:48] https://grafana-rw.wmcloud.org/d/E9HyRBySk/maintain-kubeusers-copy?orgId=1 [09:14:49] so maybe you can divide the runtime by that number to get the average time per run? [09:14:55] that's the number of runs [09:15:08] that might work, let me check [09:15:37] maybe? https://grafana-rw.wmcloud.org/d/E9HyRBySk/maintain-kubeusers-copy?orgId=1 [09:16:04] I see the same as before [09:16:29] https://usercontent.irccloud-cdn.com/file/fECGdonZ/image.png [09:16:41] even with hard refresh? [09:16:47] (I'll hit save again) [09:16:55] now is better! [09:20:11] it's ~10x slower than before xd (old runs took ~30s) [09:20:26] :-( [09:23:38] 5min is not crazy long either, but might be nice to improve [09:27:28] I agree [09:31:07] I've updated a bit the graphs, feel free to improve/add stuff, we might want to add an alert also at some point if there's no data [09:31:13] did not see any error-related counter though [09:32:10] yeah I think the prom client lib only creates counters when you increment it [09:32:26] (and there wasn't any error so far) [09:32:52] hmm, there's no other "action" counters, the ones shown are from before the refactor I think [09:34:28] https://www.irccloud.com/pastebin/5cjKpbkH/ [09:34:37] this is what maintain-kubeuser exports [09:34:49] but both counters are empty because doing noop at the moment [09:35:07] does it export anything about how many accounts are created/deleted/etc? [09:35:11] or only total? [09:35:22] only when doing any action [09:35:35] so, maybe to see some real data, delete the certs or the namespace of some test account [09:36:55] ahh, the total has a label operation [09:37:37] hmm... those are also "sums" it seems, as in aggregated sum, not total in the last run [09:37:52] correct [09:38:16] they are global counters [09:39:10] I'm guessing that there's one entry per resource action, as in, one account might have several entries no? [09:40:10] I think at some point I reduced the amount of labels because the cardinality exploded [09:40:35] so for normal reconciliation actions, the labels are [09:40:39] that makes it kinda hard to get actual counts (ex. how many accounts were created in the last day?), you get instead the Xmin rate of account creates xd [09:40:45] "account_type", "account_name", "operation", [09:41:27] where operation can be "create", "update", "delete" [09:42:17] at the moment, with the number of resources we have, for a normal account creation, you will get a counter like this [09:42:43] maintain_kubeusers_resources{account_type="tool", account_name="whatever", operation="create"} 7 [09:42:51] 7 resources were created for a new account [09:52:38] dcaro: arturo: is it ok to reboot cloudcumin1001 now? asking as you both are currently logged in there [09:52:47] taavi: yes [09:52:51] taavi: ok from me [09:52:53] thanks [09:55:32] arturo: this is what I got so far for actions taken, not sure it makes sense [09:55:50] https://usercontent.irccloud-cdn.com/file/xdsc80XS/image.png [09:56:12] that's actions taken per hour [09:57:08] yeah, I think that may be correct [09:57:31] (done) [10:14:06] dcaro: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/29 another patch proposal [10:15:21] that will not regenerate the certs if the files are missing no? [10:15:42] if someone removes the certs by hand, this code will not detect it [10:15:53] this is what the previous version of maintain-kubeusers was doing [10:16:12] to force-regenerate the certs, we will need to touch the state configmap [10:17:16] hmm, okok, would be nice though to have more idea on what's making it slower, it seems stably slower (all account are more or less as slow all the time), so might not only be just one file check [10:17:48] trying to get some profile data from a single account run might show which parts are taking the most time [10:20:21] I think we can do that with debug logs, they are pretty verbose in what they are doing [10:25:28] amazing: dropping the sleep(1) in the cert generation makes it reliably fail :-( [10:26:35] yep, that's expected, iirc cert-manager takes some time between saying that the cert is ready to actually putting the data inside the cert crd [10:26:55] we can maybe loop in a different way [10:27:17] *sleep [10:27:45] I don't think this cert has anything to do with cert-manager, is generated by the API server directly, but yeah, probably same logic [10:35:46] kube-controller-manager or whatever manages the certs yep [10:36:39] cloudlbs have failing systemd units (networking) [10:37:06] looking [10:37:17] Jun 04 10:01:12 cloudlb1001 ifup[659]: ifup: failed to bring up vlan1151 [10:38:18] ouch [10:38:21] aaaahh, they were rebooted [10:38:46] post-up ip route add 172.20.0.0/16 via 172.20.1.1 dev vlan1151 [10:38:46] post-up ip route add 185.15.56.0/24 via 172.20.1.1 dev vlan1151 [10:38:46] post-up ip route add 172.20.0.0/16 via 172.20.1.1 dev vlan1151 [10:38:46] post-up ip route add 185.15.56.0/24 via 172.20.1.1 dev vlan1151 [10:38:49] hmm, why are those duplicated? [10:38:59] the extra space is suspicious [10:39:02] yep [10:39:07] yeah, probably a puppet change at some point [10:39:12] a template change may have left those [10:39:20] i'm removing both from /e/n/i and then running puppet to add the current one back [10:39:23] puppet cannot cleanup old entries via augeas :-( [10:41:21] ok, /e/n/i fixed. I'll reboot both again just to be sure the config is correctly applied [10:42:38] ack [10:46:52] there is also a BGP alert for cloudsw @ codfw [10:47:09] oh, no, is for eqiad [10:47:11] so maybe related [10:47:35] hey topranks are you available for a BGP check? [10:47:39] probably related to rebooting the cloudlb hosts [10:47:42] since those speak BGP [10:48:10] if the BGP sessions went down unexpectedly, maybe we need these alerts to be paging [10:48:47] arturo: sure [10:49:01] what's the TLDR [10:49:18] it is expected [10:50:02] topranks: ok, so we got alerts about the BGP session in clousw @ eqiad going down. The interfaces on the cloudlbs were briefly misconfigured bc puppet, but it should be now back to normal [10:50:07] cloudlb1001 bounced bgp 2 mins ago but is back [10:50:19] ok, expected [10:50:20] ok yep that syncs with what I see [10:50:35] thanks for double checking! [10:50:51] looks ok at a glance [10:50:55] https://www.irccloud.com/pastebin/4gWfKtrr/ [10:51:06] np! ping me if there is any other issue [10:51:19] do you know if the BGP session being down means no traffic is circulating? [10:51:50] well, if the interfaces on one end are down, then I guess BGP is the last problem [10:52:04] it depends [10:52:10] right now I see cloudlb1002 BGP is down [10:52:33] went down about 90 seconds ago [10:52:33] Jun 4 10:51:07 cloudsw1-d5-eqiad bfdd[2322]: BFDD_STATE_UP_TO_DOWN: BFD Session 172.20.2.2 (IFL 571) state Up -> Down LD/RD(37/2731712460) Up time:00:39:21 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. [10:52:42] taavi: I see now the same alert for codfw, could you double check the same file on cloudlb @ codfw ? [10:53:02] in terms of traffic circulating, it all depends on if there is *any* route to the destination [10:53:10] ok [10:53:11] those alerts are still caused by the reboot cookbooks I am running [10:53:19] I assume if both cloudlb's are down then there is no route, so traffic doesn't get where it should [10:53:30] ok [10:53:35] if they go down, but not at the same time, you'd expect traffic to keep working, flowing by the other one [10:53:45] taavi: ok cool [11:33:03] dcaro: another proposal: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/30 [12:02:19] dcaro: thanks [12:29:21] dcaro: are the functional tests in toolsbeta only jobs-api ones? or does that depend on the tool you test with? [12:35:51] (I think I'm confused because I ran them recently and think I saw some builds-api tests in there?) [12:38:45] blancadesal: main only has jobs-api, this has build and envvars https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/306 [12:38:51] (there's another with builds only) [12:38:57] I'm still testing that last one (envvars) [12:44:14] 👍 [12:51:03] this might also be interesting to test locally (I'm using it a lot since I wrote it xd) https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/134 [12:52:30] ohh, nice [13:01:00] ohhh, we are not deploying the envvars admission on lima-kilo at all [14:58:49] taavi: was the script you were thinking of "confctl depool", or was it something else? [14:58:57] I'm reading https://wikitech.wikimedia.org/wiki/Conftool [15:00:43] dhinus: no, just `pool` or `depool`. they need the conftool::scripts puppet class, which might or might not be missing from the hosts [15:01:16] I'm not finding those scripts, so maybe the class is not there [15:14:40] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038847 [15:15:49] hmm, do we care that also installs them to clouddb1021? [15:15:57] good point [15:16:20] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/135 [15:16:21] maybe better to move it to a more specific class [15:22:50] is it acceptable to use class{} from a role:: file? I vaguely remember some convention of only including profiles [15:23:03] I'm rebooting one of the cloudcephosd hosts, but I did not silence any ceph alerts to try see what happens (fyi) [15:23:10] dcaro: ack [15:23:27] there should be no pages, but maybe a warning appearing [15:24:29] dhinus: no, only profiles in roles [15:24:34] yup, just found https://wikitech.wikimedia.org/wiki/Puppet/Coding_and_style_guidelines#Roles [15:26:26] is there a way to avoid creating an extra profile just to include that class? [15:27:02] and ceph is back healthy, I did not see any alerts, but probably just timing [15:27:08] dcaro: nice [15:27:44] dcaro: if you are offering to take over T360378, I'm fine with that. I'm also happy to move it forward myself, but would like another Toolforge admin to give a +1 to https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1012797 so I feel like there is some consensus. [15:27:44] T360378: Provide a Redis container for use within a tool's namespace - https://phabricator.wikimedia.org/T360378 [15:28:46] * bd808 remembers being told off by taavi for changing docker things without consensus and tryies to avoid that now [15:29:27] ugh are you saying I need to form an opinion about offering that container? [15:29:29] xd, bd808 do you want me to test the patch? (that might take me a bit more) [15:30:36] looks ok to me (untested) [15:31:17] I'm pretty sure I tested it locally when I wrote the patch initially, but confirmation is always nice. [15:31:46] okok, I'll try to test it, probably tomorrow [15:31:51] taavi: the problem with expressing opinions is that you just keep having to do it once you start ;) [15:31:54] (not that we are in a rush xd) [15:32:18] yeah, I don't think a few days or even weeks is a problem here. [15:32:29] I'm more concerned about the semantic being exposed. We will need to support that container forever once people start using it [15:33:33] we have discussed a few times about how to offer such services, a wrapper API, a command line interface, etc, that would allow us to "hide" the details, and have more freedom to change the actual implementation [15:34:02] feel free to add your concerns in the task, it seems to me that for this specific one having the `redis` option in toolforge jobs might be a simple way to go [15:34:24] I don't think it is the case that we will have to support it forever. That was my reversal on the -2 block. We can change our interfaces. I really don't think that y'all can reasonably wrap the entire universe in your own custom services [15:35:16] my two immediate thoughts are 1) can/should we do that with build service in a similar way to the pywikibot container, and 2) should we pick some of the redis forks immediately instead of pushing that (and an image rename?) down a few years? [15:36:10] once we have persistent volumes we might want to retire it and use some dbaas instead [15:36:20] not the entire universe, but we know there are few interesting services that people want to use: redis, elasticsearch, mariadb, maybe others. While I agree the way to go is each namespace having their tiny deployment, for me the main question is: How do we want to expose them? [15:36:48] If we are going to shift shared containers to build service then yes this would be a good candidate for that. The fork question is basically blocked by that. There are right now no forks that are published to an apt repo we can get at. [15:37:40] we don't have a "build a c program" build thing [15:37:54] and we don't have a "add a random apt repo" build thing [15:38:01] nope to both :/ [15:39:07] T363033, T363027 [15:39:08] T363033: [builds-builder] Support using custom buildpacks - https://phabricator.wikimedia.org/T363033 [15:39:08] T363027: [builds-builder] Support adding repositories for Apt buildpack - https://phabricator.wikimedia.org/T363027 [15:42:30] ok, so given my concerns found no traction, I'll mark the patch as +1 because otherwise it looks good to me :-) [15:45:50] dcaro: quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/33#cd69e70ac18f9d0c61d15505d300190e2f3b755f [15:46:06] wrong URL; good one: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/33 [15:47:34] it's the same MR no? [15:48:57] I'm thinking about taavi's (1)... doing this container using build service would make it easier to pull in some of the active users of the container as long term maintainers. That would give them agency going forward on decisions like a more Toolforge friendly upstream. [15:49:09] Right now it would take some annoying boilerplate code in Python or something else to make that container as the Apt buildpack doesn't work stand alone. [15:49:32] you mean having 'redis as a buildpack'? [15:49:49] or building the redis container using a buildpack? [15:50:16] the latter. making a solution like https://gitlab.wikimedia.org/toolforge-repos/pywikibot-buildservice [15:51:11] ack, you can use an empty requests.txt [15:51:34] and get a full python runtime in the container with you! :) [15:51:39] yep [15:51:46] *requirements [15:52:06] you only want to install packages? Or also run some build scripts? [15:52:55] should be relatively easy to make the apt buildpack a possibility (if there's a need there's a way) [15:53:57] For this very particular case I only need apt and Procfile. the rest is static config and an entrypoint shell script. I'm not sure this is a highly generic need though. [15:54:24] meaning I'm not sure it is worth the work to make Aptfile work standalone [15:54:45] unless that is really trivial I guess [15:56:14] way more than running inline buildpacks xd [16:08:05] my whinging about using an empty requirements.txt or one of the other tricks to get a language runtime buildpack to trigger is really just being pedantic about image size and contents. That's not super helpful, so I will try to stop using it as an argument. [16:09:39] reviews welcome: https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/12 [16:10:31] oh, thanks david, just saw your comments [16:13:58] * arturo offline [16:24:55] np [16:25:16] bd808: I understand, and it bothers me too [16:41:59] ffmpeg has many, many dependencies... [16:49:42] I was looking at T365164, and I can see there are some Icinga alerts that used to be visible in alertmanager that now are only visible in Icinga [16:49:43] T365164: [wikireplicas] clouddb* free memory decreases over time - https://phabricator.wikimedia.org/T365164 [16:49:58] is there something broken in the icinga->alertmanager flow, or am I missing something? [16:54:36] * dhinus offline