[08:38:14] morning [09:27:51] morning! [09:28:06] o/ [10:25:47] created T360016 from yesterday's toolforge meeting, had a quick look so far and feels a bit complex to be fair :/ [10:25:48] T360016: [jobs-api,buildservice-api,envvars-api] evaluate crossplane for composite objects creation and maintenance - https://phabricator.wikimedia.org/T360016 [10:26:04] feel free to take over/add stuff [10:33:15] I have been thinking on the database thing we discussed yesterday [10:34:10] I think we should re-evaluate before introducing the DB. I'm worried about the extra maintenance, and the DB getting out of sync with the data plane, like what happens with openstack, that from time to time requires manual updates to the DB [10:36:14] We already have that issue, it's just mixed up with the k8s object, so for example if we change the format we embed the data withing the k8s object, or the overall data we put in there, then we have to resync all the objects (recreate the k8s objects) [10:36:27] or if a k8s gets deleted, we lose data [10:37:01] it just disappears [10:37:11] (or if it's malformed, etc.) [10:37:44] we never got any problem so far. Updating the representation was solved in the past with versioning the k8s object, via a label [10:38:01] yep, and it's getting more and more complicated [10:38:54] and will just keep getting more complicated with every version + new feature addition [10:39:15] I guess my point is: how to avoid that? [10:39:24] we will always make changes to the software [10:39:28] T359649 [10:39:28] T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version - https://phabricator.wikimedia.org/T359649 [10:39:46] but not needing to embed our datastructure with the k8s objects allows for easier upgrades [10:39:54] we can evolve both of them separatedly [10:41:28] you suggest some kind of migration script? [10:41:57] yep [10:42:21] not sure xd, can you rephrase? [10:42:37] so we have jobs in version format 1 [10:42:54] the only way to force them to version format 2 would be some kind of migration script, no? [10:43:05] that takes each old job and generates a new one in the newer format [10:43:07] ah, yes, or manual updates if there's not many [10:44:10] how many objects do we have on the old format, even? let me try if I can get an answer to taht [10:44:18] sure [10:45:15] ok, that was easy [10:45:20] aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/version="1" | wc -l [10:45:20] 402 [10:45:21] aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/version="2" | wc -l [10:45:21] 1366 [10:47:09] are there any without the version label? [10:47:16] (that's handled in the code too) [10:47:25] aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/managed-by=toolforge-jobs-framework | wc -l [10:47:25] 1767 [10:47:31] only one (1) apparently [10:47:38] that feels weird [10:47:38] xd [10:48:13] that's only cronjobs right? not jobs/deployments? [10:48:26] jobs I guess should be ok, as they are one-off [10:50:00] yeah, only cronjobs [10:50:41] seems enough though to make a script [10:50:49] https://www.irccloud.com/pastebin/Tmoc0tM1/ [10:50:58] deployments (cont jobs) ^^^ [10:51:14] that does not sum up no? [10:51:21] yeah, that's weird [10:51:42] can't why explain at the moment [10:51:49] can't explain why* [10:54:15] ack, can you add that to the task? [10:54:29] sure [10:55:20] thanks [10:57:46] would this version format migration script relax the need for the extra DB? [10:58:47] not really [10:59:55] it would help simplify the current code [11:00:23] (so it's easier to modify and maintain, less things to test, etc.) [11:53:14] turns out you can do countdowns in phabricator, so https://phabricator.wikimedia.org/C10 [11:56:40] :D [11:56:49] looks ominous [12:08:12] may qualify for a good topic in the -cloud IRC channel hehe [12:08:54] oh my [12:10:11] what happens when it reaches 0? [12:10:57] i start shutting down things [12:12:44] hahahah, I meant on phabricator side xd, like sends a notification/email closes the counter, the UI becomes red and on fire, or confetti shows up... [12:12:59] no idea :D we'll see tomorrow it seems [13:04:32] quick review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010887? replacing tools-sgegrid-master with an another VM used in designate monitoring [13:13:05] taavi: +1 [13:13:21] ty [13:37:12] taavi: wdyt? https://gerrit.wikimedia.org/r/1010558 [13:37:53] why are we parsing HTML there? [13:38:12] because we don't know the version of the deb, only present in the html listing [13:40:30] it seems like in the new setup the version is always -1.1 (instead of -00), I think we could just use that instead [13:41:37] the script may break again if they decide to change the scheme, I prefer just to ignore the suffix [13:43:17] another option is to don't verify if there are debs, just the component being present. But that doesn't prevent some human mistakes, like actually fetching the packages via reprepro [13:43:45] which is harmless anyway. So maybe those 2 are my preferred options: [13:43:59] a) parse the HTML like the patch [13:44:00] i don't think reprepro will create the dir in /pool/ unless it has a package to add there [13:44:16] oh [13:44:16] ok [13:44:39] taavi: I think I'm ready to switch tools VMs over to the new puppetserver. The other day you warned me about k8s re-using puppet certs... do you think it would be an adequate test to just migrate a single k8s exec node and confirm that it still works? [13:45:10] andrewbogott: etcd reuses puppet certs. Maybe start there [13:45:18] arturo: or you could just do something like `self.dst_version in result.text and ".deb" in result.test` which will effectively do what the current code does [13:45:48] andrewbogott: yeah, it's the certs between k8s control nodes and etcd nodes. I think testing on a single etcd node first would be the safest [13:45:55] taavi: good idea, will do that [13:46:14] great, I'm going to flip etcd-16 in a moment [13:47:53] hm, ferm just did this: [13:47:55] https://www.irccloud.com/pastebin/zKYGZdvN/ [13:48:10] Which I'm guessing is because the new puppetdb doesn't know about everything yet [13:48:17] mmmm [13:48:20] that seems like puppetdb data has not been carried over [13:48:28] warning, k8s outage incoming= [13:48:29] ?? [13:49:08] oh, just one etcd server. Ok. Not an outage incoming [13:49:55] taavi: I didn't migrate puppetdb data, I assume it will repopulate after migrating. Is that wrong? [13:50:06] (Of course I didn't think about the transition) [13:51:25] it would eventually fill up again, but I'm afraid that's not good enough here [13:51:36] ok, let me revert and then we can strategize :) [13:52:26] ok, it's back [13:53:02] Now... what do you think, dump + export? Or can I add the new puppetdb as a backend for the old puppetmaster for a while and let it populate that way? [13:54:09] good q, I thought I did the latter yesterday already but apparently not [13:54:35] * andrewbogott looks [13:54:56] I think running puppet in noop mode everywhere after migrating is one option too. but how to do that atomically is an interesting question [13:55:22] Oh yeah, that's this: [13:55:25] https://www.irccloud.com/pastebin/bWZGlgsc/ [13:55:40] sure seems like that should've done something... [13:56:19] btw, in deployment-prep puppet won't run at all because the host I'm running it on isn't in puppetdb yet. So unless that's some other problem, it suggests that the submit_only trick won't work there at all. [13:57:06] taavi: I had a pointer in my CV to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010503/8/modules/dynamicproxy/files/urlproxy.lua to showcase some LUA I had written in the past :-P [13:57:22] taavi: please review again https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1010558 when you have a moment [13:58:52] just fyi, I've invited ebomani (the qte/catalyst intern) to the envvars-cli repo. there's no coding work on catalyst yet so you may see her work on some wmcs tasks in the meantime! [14:00:41] taavi: submit_only_hosts is implemented in the puppetdb4 template but not in the modern 7 one [14:02:32] blancadesal: what do you mean by invited? I don't think you need any special access to send MRs and commit access should be tied to toolforge root rights [14:02:34] is the puppet migration risky? If so, should we wait until after the grid deprecation? (avoiding having an outage while we decom grid things) [14:03:24] blancadesal: that's awesome! Let me know if you don't find easy tasks, we can try to prepare a few [14:04:40] long read about this problem on T338811 [14:04:40] T338811: puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 [14:04:54] taavi: so that she won't have to send MRs from a fork. Is that tied to toolforge root rights? [14:06:18] dcaro: she has started on T359558 and got lima-kilo set up [14:06:19] T359558: [envvars-cli] Either hide or show envvars values, but not both - https://phabricator.wikimedia.org/T359558 [14:06:38] that was quick! any issues with lima-kilo? [14:07:04] no, that was smooth :) [14:07:32] :-) [14:07:36] \o/ [14:08:31] just a year ago, getting an intern started on anything toolforge would have been much more difficult 🎉 [14:09:24] * arturo food [14:09:36] blancadesal: i would say that at least 'maintainer' access (= ability to merge to main) is. then there's 'developer' access which lets you (at least) write to non-main branches to send MRs from, I have a weak preference for using forks instead of that to keep the ACLs more maintainable [14:10:05] taavi: I gave her the developer role [14:10:24] but we have not experimented with the different gitlab roles very much and it's possible that 'developer' also has something that should be restricted to roots. not sure [14:11:40] hmm, this sounds like something that may need clarification/a clear policy written up somewhere [14:12:03] it's probably fine this time but I'd rather not make this a habit [14:12:38] 'people with root access also get full access to the repositories' is the current de facto policy [14:13:48] taavi: everything I said about the template is wrong on account of the puppetdb.conf template seems to be unused. So something else interesting is happening :( [14:14:49] it would be good to have a canonical way to give limited rights on specific repos without having to make someone full root though [14:21:34] blancadesal: what's the use case beyond not having to fork the repository? [14:23:49] feeling included/not like an outsider [14:30:08] i suspect thats's a 'feature' in gitlab unfortunately [14:30:33] there are lots of things to think about if opening up even limited access to the repos, like harbor tokens for MR branches, etc, etc [14:30:33] we might want to start using forks ourselves maybe? [14:31:01] i mean, that's one option if you want to get rid of that separation [14:31:07] hmm, I wonder if CI is ready for that [15:00:37] taavi: hacky solution is to move the k8s-masters to the new puppetserver first, then do the etcd nodes (at which point they'll be in puppetdb). How do you feel about that? [15:00:46] that sort of glosses over any other possible races... [15:01:11] i think a bit less hacky but still somewhat hacky solution would be to: [15:01:18] * disable puppet everywhere [15:01:28] * move the existing puppetserver to use the new puppetdb [15:01:38] * run puppet in no-op mode everywhere [15:01:41] * re-enable puppet [15:01:48] * move to new puppetserver [15:01:54] thoughts? [15:02:34] That should work. [15:03:22] I'm going to watch this meeting and then I'll try it on a test VM (= grid node) [15:04:14] ack [15:04:29] (I'm planning to watch the recording later as usual) [15:06:20] yeah, maybe watching the recording at 1.50x speed is not a bad idea [15:50:49] arturo: the prepare_upgrade cookbook expects the full version including the patch version, I'm a bit worried you running it without has broken something [15:51:16] broken the hiera setting? [15:51:23] at least profile::wmcs::kubeadm::kubernetes_version in https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/638cf2a12dd2f5632f3c4eaa28780156d2fc61af%5E%21/#F0 is using the version without the patch now [15:51:46] that feels definitely inconsistent [15:53:00] profile::wmcs::kubeadm::kubernetes_version seems to be only used for kubeadm boostrapping, which is not involved in the upgrade, no? [15:53:40] yeah [15:53:57] anyway the cookbook may need an update, to fail if not a full version is provided [15:54:09] yeah, I will send a patch [15:54:14] thanks [16:03:16] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1010901 [16:06:57] commented [16:21:38] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010906 [16:25:54] which servers are you seeing those issues on? [16:26:09] toolsbeta-sgebastion-05 for example [16:26:30] maybe we can just get rid of it instead [16:27:07] sure, but there are a bunch of buster/bullseye servers everywhere, I don't think we should block on that [16:27:41] as far as I'm aware the bastions are the only ones that have this issue [16:28:32] and as the grid is being decom'd tomorrow I was hoping we would not need 1.24 packages for anything other than bookworm [16:28:41] fair [16:31:44] taavi: oddly your suggestion worked but only partially. [16:31:46] https://www.irccloud.com/pastebin/164Czik7/ [16:32:07] pretty close, but still one missing host [16:33:26] I am inclined to just move forward and let things sort themselves out via repeated cumin-forced puppet runs [16:33:52] go ahead if you're confident you won't cause an outage [16:34:12] did you try that first in toolsbeta? [16:34:53] I wasn't as worried about temporary outages in toolsbeta. So yes, I tried it, but it might've hiccuped for a few minutes on the way [16:35:33] if toolsbeta worked just fine, that gives me confidence [16:38:52] I would have more confidence if someone who isn't me doublechecks that k8s is still working in toolsbeta :) [16:39:12] I assume we'd be getting alerts if it didn't...? [16:39:18] andrewbogott: it is working! I'm working with it at this very moment [16:39:29] oh good. [16:39:56] the k8s API -and therefore etcd- seems to be responding just fine [16:41:02] OK, so, taavi, I've moved these hosts to the new puppet infra: tools-k8s-etcd-16, tools-k8s-worker-102, tools-k8s-control-8. Do you see any distress, and/or suggest other canaries to check before I flip the big switch? [16:41:36] one second [16:42:49] as far as I can tell we're fine [16:43:38] ok. Then, here goes... [16:48:46] is there a way to execute a cookbook in a way that if a cumin commmand fails, it gives me the original command + the stdout/stderr of the failed command? like, an envvar or something? [16:49:15] all I get is `spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)` which is really not useful [16:49:52] I could figure out the original command by backtracking to the calling code, but is still very cumbersome [16:51:42] ok, nevermind, I forgot about /var/log/spicerack/ [16:57:11] taavi: please review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1010914 [17:02:48] * arturo offline [17:46:30] andrewbogott: I think you might want to scale up `profile::puppetserver::java_max_mem` on the tools puppetserver [17:46:57] Yeah, it's really struggling [17:49:11] although 'top' shows it as more cpu bound [17:49:18] but that could be a side-effect of lacking ram [17:51:40] if that does not help, it might be time to scale the tools puppet setup horizintally [17:53:08] java_max_mem seems to help, not sure yet if it's enough [17:54:18] seems fine now [17:54:25] going to drop back to 8 cpus and see if it still manages [17:54:30] jvms will become cpu bound when they are starved for ram. The GC loop will thrash the cpu [17:54:43] that certainly fits what I'm seeing [17:55:42] * bd808 has very ungenerous thoughts about java's garbage collection implementation [17:57:47] sounds like experience to me :) [17:57:56] * dcaro off [17:57:57] cya tomorrow [18:11:20] * bd808 lunch [19:19:16] The 'tools' project is now using puppet7 with a new puppetserver and puppetdb host. Please let me know if oddities appear. I'm going to shut down, but not delete, the old puppet infra there.