[00:02:33] * bd808 off [08:52:56] * arturo online [09:17:44] morning! [09:17:53] morning! [09:18:03] welcome to π day :) [09:18:19] did you mean "toolforge grid engine shutdown day"? [09:18:37] yelp [09:29:19] I'm shutting down the toolsbeta grid [09:39:58] taavi: if you need us to do anything just let us know [11:00:03] ok, it's time [11:00:19] (no fancy fireworks or explosions on phab timers it seems) [11:00:50] I'm starting by running the disable grid access scripts for the remaining tools, this will archive the crontab files and create the readme files in tool home directories [11:02:31] now stopping the remaining grid-related VMs [11:03:03] done [11:03:43] now merging the dynamicproxy removal patch [11:06:02] * dhinus tunes in for the big moment [11:07:03] * dhinus is disappointed by the absence of fireworks, but relieved by the absence of explosions [11:09:13] I am more or less done already [11:09:16] 🎊 🦄 🎊 [11:10:27] 🎉 [11:10:41] 👏🏻 [11:10:56] quick review here? https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1011095 [11:11:04] should we add something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1011096 ? [11:12:26] I was thinking on adding also a shell wrapper to qstat/jsub to show a message instead of just hanging and timing out [11:12:29] dcaro: LGTM [11:13:05] I think the idea by taavi may be to just rebuild the bastions without jsub at all, he can confirm [11:13:12] a shell wrapper sounds good, but I'm not a huge fan of a MOTD banner [11:13:19] I'm indeed planning to rebuild the bastions quite soon [11:13:36] good bye grid 👋 [11:14:10] taavi: I know, it's just for a few days, I agree that the shell scripts might be more effective [11:14:17] so long, and thanks for all the jobs [11:14:47] hahahah [11:15:08] :D [11:15:28] we have one less reason to keep using NFS. The grid had a strong dependency on it. [11:20:19] can I get a +1 here please? https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1011095 [11:20:27] cc taavi [11:20:49] +1 [11:20:56] thanks [11:27:49] oh, we changed from tools-puppetmaster to tools-puppetserver xd [11:28:09] old habits die hard [11:30:42] hmm... in toolsbeta the instance is -1, not -01, meh [11:33:07] when someone has a moment, I'm looking for reviews for the patches starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010504 [11:33:16] this is to add support for svc.toolforge.org in dynamicproxy [11:34:28] I will look later [11:42:36] hmmm... I'm really spoiled by autoformatters [11:43:36] I don't care if there's a non-functional extra space, just get rid of it for me instead of complaining please? [11:47:56] last quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1011096 [11:49:58] dcaro: done [11:50:03] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1011095 <--- quick review here too? [11:52:14] wait, wrong link [11:52:32] correct link: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1011098 [11:52:40] xd /me started wondering [11:57:44] dcaro: I'm sorry, I did a last minute update [11:57:53] np [12:10:10] hmmm.... I'm confused.... the patch I just merged is not being applied on toolsbeta-bastion-6 :/, it did when I cherry-picked it though [12:11:52] does it work if you run `sudo puppetserver-deploy-code` on the puppetserver? [12:12:17] there's a git hook that's supposed to run that automatically after the code is updated :/ [12:13:04] yep \o/ [12:13:08] we should do a puppet 7 demo [12:13:09] ! [12:13:38] taavi: we missed this before, no? https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1011102 [12:13:40] andrewbogott: ok, so apparently we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009798 regardless of the hooks I added :/ [12:21:18] * dcaro lunch, ping me on tg if something is needed, be back in a bit [12:27:49] komla: sorry for probably flooding your inbox with my mass phab task decline [12:29:10] I'm late, but nevertheless here for the excitement. Congratulations to all of you! 🎉 [12:29:43] * arturo waves to balloons ! [12:30:06] oh hello balloons! it is indeed finally done now [12:33:06] * balloons waves back to everyone [12:35:26] hello balloons! [12:35:38] I'm very proud of you all. Shutting down something like this is a huge accomplishment. I hope you celebrate 🥳 [12:38:26] it is your doing too! [12:40:08] taavi: Any interest in troubleshooting yet one more puppetdb host? This time it puppetized just fine but no one will talk to it. deployment-puppetdb05.deployment-prep.eqiad1.wikimedia.cloud backing deployment-puppetserver-1 [12:40:19] I'll have a look [12:40:38] yep, it's largely because of balloons pushing for a deadline that we're doing it 'now' rather than still doing it 'eventually'. [12:40:42] thanks taavi [12:42:18] I'm pretty sure that this is me writing eqiad1.wikimedia.coud when I should write eqiad.wmflabs, or the other way around, but no matter how many times I re-read the hiera code I can't find my mistake. [12:42:33] andrewbogott: where are you seeing things not talking to it? [12:42:42] You can see the problem with run-puppet-agent on deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud [12:42:43] also puppet-git-sync-upstream.service is broken on the puppetserver too [12:43:03] ok, I'll fix that [12:43:14] FYI I'm upgrading all toolsbeta workers now to 1.24 [12:44:17] /var/log/puppetserver/puppetserver.log has Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target [12:44:45] java? :-( [12:45:05] puppetserver is written in clojure which runs on the JVM [12:45:58] accessing that url works fine with curl [12:46:00] I like that they re-wrote it but moved from ruby to clojure. "let's make this even harder to read!" [12:46:40] taavi: is there any special consideration for ingress nodes to upgrade? [12:46:54] "we'll never have any trouble hiring clojure devs!" [12:47:29] arturo: yes, https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes#Ingress_nodes [12:48:30] balloons: are you getting to enjoy some leisure? Planning elaborate landscaping/cultivation projects? [12:48:46] taavi: thanks [12:52:05] andrewbogott: I have a feeling that deployment-puppetdb05 needs to become a client of deployment-puppetserver-1 [12:52:26] right now it's using the shared puppetmaster, and that one has a different CA [12:52:31] taavi: I tried that in one iteration and it didn't help much... [12:52:42] but I can switch it over for troubleshooting purposes, stay tuned... [12:52:58] and deployment-puppetserver-1 should probably be its own client, for the same reason [12:53:26] ok, that I haven't tried [12:53:38] I was trying to eliminate the 'manages itself' use case [12:54:34] in tools, at least, the puppetdb is a client of the puppetserver but the puppetserver is a client of the central master [12:55:00] Andrew, I've managed to unwind yes, at least as much as I can. It's a learning exercise for me [12:55:13] taavi: I see the scale-down instructions about the ingress controller. But, noob question, aren't taints and labels supposed to achieve the same? [12:57:03] taavi: ok, deployment-puppetdb05 is now a client of puppetserver-1. Of course puppet runs don't work because... [12:57:16] but the cert dance is done at least [12:58:14] (part of why I don't like the self-hosted puppetmaster thing is this chicken/egg mess we are now in) [12:59:16] yeah.. [13:00:51] I kept rechecking to make sure this was the same setup as tools and toolsbeta, and finally decided "taavi must've done another magic step that I don't know about." I'm sad to learn that that is not the case :/ [13:01:10] If you want to rebuild that puppetdb host from scratch I won't mind, it's already the third or fourth incarnation. [13:02:08] I manually hacked out the puppetdb setup from the puppetserver config to get puppet to run successfully on the puppetdb host to update the server certs [13:02:31] let's see what happens next with the config back there [13:03:02] ok! Nice to see you trying all the exact same things I tried last night, maybe it will work when you do it. [13:03:36] :-) [13:12:57] andrewbogott: so I made the server a client of its own, and now it works [13:15:40] huh [13:15:50] why does tools work then, I wonder? [13:16:37] or hmh [13:16:39] now it's broken again [13:17:12] * andrewbogott doesn't know whether to prefer success or consistency [13:18:14] ok i did `systemctl restart nginx` on the puppetdb host and now it really seems to work [13:19:11] andrewbogott: I'm declaring success [13:19:34] ok, thank you! I'm going to reboot those servers just to be sure and then I'll see about adding some more clients. [13:42:33] taavi: is this familiar? [13:42:36] https://www.irccloud.com/pastebin/oOHkR6Jk/ [13:43:27] John's notes suggest that it's because of a uid mismatch between client + server https://phabricator.wikimedia.org/T234315#5552284 [13:44:42] andrewbogott: we're missing https://gerrit.wikimedia.org/g/operations/puppet/+/c4e87775a53e8fe67231d183e8f0aadf89227dad/hieradata/role/common/puppetserver.yaml#58 at least [13:45:13] ok, trying... [13:45:38] and likely profile::puppetserver::volatile [13:46:22] * andrewbogott tries to not be curious why this is different from tools/toolsbeta [13:46:44] tools/toolsbeta does not use that geoip data [13:47:17] ah, I see [13:50:53] extra mounts reduced (but didn't eliminate) the errors. profile::puppetserver::volatile::geoip_fetch_private: seems to be a no-op on the puppetserver. [13:51:06] now the only error is [13:51:07] Error: /Stage[main]/Geoip::Data::Puppet/File[/usr/share/GeoIP]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/GeoIP [13:52:07] I don't see that profile applied to deployment-puppetserver-1? [13:55:23] oh, I misunderstood, hang on... [13:55:38] you want to leave that hiera key to its default valuue [14:02:04] komla, Raymond_Ndibe, meeting time [14:02:12] (also bd808 but only if you want to) [14:31:48] profile::puppetserver::volatile is sending me down a bit of a rabbit hole... it's not applied on the old puppet5 master in deployment-prep is it? [14:32:45] no, the code path used there is via puppetmaster::geoip [14:33:26] * andrewbogott dives deeper into the rabbit hole [14:33:27] so how broken are things currently? can we just leave them that way until someone who cares about that data being there appears? [14:33:49] I'm not sure. It seems like we were going to get puppet errors on every node [14:36:36] profile::puppetserver::volatile includes profile::puppetserver [14:36:39] so that's not grea [14:36:41] *great [14:38:04] if we think this is worth spending time on, the easiest fix is likely to split the geoip things to a different profile from the rest of stuff in profile::puppetserver::volatile [14:39:29] If it's not worth spending time on... do we leave things in an error state or do you think there's a way to exclude the geoip stuff entirely? [14:41:05] I honestly do not know. [14:42:13] ok! [14:48:06] there's alerts on puppet stuff from toolsbeta (silenced) and paws, not sure if that's related, if not I can try to give it a look (I was ignoring it because I though it was related) [14:48:57] Paws appears to be `Error: The certificate 'CN=Puppet CA: paws-puppetmaster-01.paws.eqiad.wmflabs' has expired, verify time is synchronized` [14:49:48] andrewbogott: ^ [14:51:11] That's the old master, can be shut down. [14:51:13] I'll look [14:51:55] Although shutting it off may not silence the alert [14:52:26] Congratulations on the plan and hard work folks. Sorry I missed the real-time event and the following team meeting. DST shift plus meeting time shift made my brain nope out of being in from of a computer at either moment in time. [14:53:30] dcaro: is that in icinga or alertmanager? [14:53:45] alertmanager [14:53:51] https://alerts.wikimedia.org/?q=team%3Dwmcs [14:54:52] I think that the puppetmaster one is gone though [14:55:16] ok, I guess shutting them down did make the alerts go away [14:55:58] I was going to shut those off today anyway :) [14:56:14] kind of weird to see so much code I wrote on https://disabled-tools.toolforge.org/ this week. [14:57:26] A parent shouldn't outlive their children but software devs pretty much always outlive their code [14:58:24] * arturo bbiab [15:08:46] andrewbogott: a fun aspect to think about is how much of that code was written while I had a manager title and was not supposed to be writing code for work according to my managers and their managers. Small acts of rebellion. ;) [15:09:23] "I wrote hacky code so my team didn't have to" [15:11:57] andrewbogott: I just got this https://phabricator.wikimedia.org/P58797 when trying to revoke certs for a host I'm removing. restarting nginx did not work so I think that means I need to make tools-puppetserver-01 a client of itself :/ [15:13:43] huh. Well, that's consistent at least! [15:14:15] hopefully that's only needed w/puppetdb and not for all puppetservers [15:15:31] i would assume that's the case [15:16:58] and that fixed the issue I was having [15:17:25] btw did you see https://phabricator.wikimedia.org/P58796 in the puppetserver log? [15:20:36] ugh, no, I didn't. But I guess it won't be an issue for a few years. [15:21:09] So the old puppet master is shut down, though is there a reason that the paws bastion is getting that cert error, or why it seems to be trying to use a local puppet master? [15:22:46] Rook: I'll look, maybe that one escaped migration [15:23:31] Thanks! [15:23:59] Rook: should it just use the central vps puppetmaster? I lost track of which hosts needed the local puppet patches [15:24:10] Only nfs needs the local [15:25:13] I don't understand why it cares about that cert at all [16:39:41] how do you usually devel/deploy local changes to jobs-api in lima-kilo? [16:49:06] toolforge_deploy_mr.py [16:49:18] to deploy from the images/charts made in CI for an open PR [16:51:31] dcaro: ok! where is that script? [16:51:45] nevermind, I found it! [16:51:49] thanks! [16:52:24] it's in the path yep :) I think there's a note in the README for lima-kilo too, if not we should add it [16:52:53] I was very distracted by the Makefile not playing well with buildkit on the lima-kilo VM [16:53:37] that's something we have not sorted out yet, being able to build the projects there (would be nice for example to have pack installed too, to inspect buildservice images/etc.) [16:53:52] ok [17:00:54] this should be ready for another round of reviews: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/66 [17:01:38] * arturo offline [17:04:31] I was just helping essie (the catalyst intern) with her MR for envvars-cli, which made me reflect on that as a team, we've converged on ways of working e.g. around git workflows (basically trying to emulate gerrit) and other expectations that I'm not sure are easy for newcomers to know about unless explicitly told [17:10:17] ^ I agree :) I don't try to emulate gerrit though [17:11:08] a CONTRIBUTING.md file might be helpful, or a wiki page if none exists yet. [17:12:07] +1 for a contributing.md, better if all to the wiki for reusage xd [17:22:56] I also created a python script to make it easier for her to test her changes inside lima-kilo. where would I share it in case it would be useful to other folks too? [17:24:55] lima-kilo repo itself? what does the script do? [17:29:57] it automates cloning/fetching changes from a gitlab branch and setting up a poetry shell to test stuff in basically `./test_cli {envvars | builds}`, then it lists the possible branches and sets up everything for you [17:31:53] why not install the package generated by ci if you are testing the branch? [17:32:01] is she developing inside lima-kilo? [17:32:29] (that's an option we could explore I guess, most editors nowadays have a 'remote' kind of feature, we could ssh to the VM) [17:33:49] the package generated by ci is just from MRs though, not a branch? [17:34:50] true [17:35:09] hmm, not sure now if branches get CI run on them too, maybe? xd [17:35:35] if so, it might generate a package, though the script might need some modifications (it currently just looks for open MRs) [17:35:35] they don't, afaik [17:35:57] vscode does the remote thing really well, but I haven't taken the time to figure out if it can connect to lima-kilo running on a remote server [17:36:08] Rook: puppet should be sorted out in paws now, let me know if you bump into anything new there [17:58:21] Will do, thank you! [18:03:54] sorry for that proxy-03 page. and now victorops has logged me out [18:04:19] taavi: does it need fixing? [18:04:35] or just acking? [18:04:38] * andrewbogott acks [18:04:46] I'm on it. traffic is failed over to the other host [18:05:09] ok. know why it died? [18:06:09] I merged a patch to the nginx config that only works if you have a single domain in dynamicproxy. codfw1dev only has codfw1dev.wmcloud.org, and eqiad1 has both wmflabs.org and wmcloud.org [18:06:29] ah, ok [18:06:37] I'll leave you to it then, unless there's something I can do to help [18:08:36] * bd808 lunch [18:10:50] * dcaro off [21:21:42] I think still holds up pretty well 8 years later as a high level vision of what Toolforge should strive to provide to the community. We have come a long way, but I'm sure all y'all have ideas about how we can go farther together. [21:32:54] maybe we should write down and publish a similar vision for the next few years. [21:33:53] btw, is anyone planning to post anything abut the grid shutdown to cloud-announce? bd808 or komla maybe? [21:45:53] I sort of assumed this was komla's thing to do, but yeah something should go out. Has anyone heard from komla today? [21:59:09] I have not. [22:11:01] andrewbogott: do you need any help with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009798? lack of tha is still breaking updates of the puppet repo on our puppet 7 servers [22:12:09] taavi: I haven't been paying attention to it but I can definitely move it forward soon. [22:13:00] I would do at least before you making more servers we'll need to fix by hand [22:13:18] ok [22:13:35] Wait, will we need to fix things by hand /after/ that patch is merged? Everything will catch up right? [22:14:32] oh, right, you will only need to fix the servers with puppetdb (= are their own clients) by hand [22:18:06] yep, ok [23:04:48] email sent -- https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/L32RGQLPBPO7KHGYE6WKXJGZKDPUQULB/ [23:09:31] Thanks for writing, bd808 [23:09:35] * andrewbogott -> cook dinner [23:11:52] We have maybe the first(?) request for quota to use the S3 gateway in Cloud VPS. T360162 [23:11:55] T360162: Increase Object Storage quota for QRank - https://phabricator.wikimedia.org/T360162 [23:37:27] * bd808 off