[08:40:24] dcaro: for a tool (= app) with several components would we ever want to scope envvars to the component where they are needed, or is the easier option of considering that config is global to the tool good enough? [08:40:43] moning! [08:41:08] morning :) [08:41:20] blancadesal: I think that for starters considering it global is good enough, though it should not be too hard to be able to scope it to a specific component [08:41:53] note that the envvars are not meant to be put in the manifest (because they might be secrets) [08:43:19] > toolsbeta being able to page in the first place seems like a bug [08:43:19] I'm surprised about that too [08:45:25] I was thinking that secrets could be put in the manifest as something like `name: $(UI_RUN_SECRET)`, without the explicit value, and also separate config from secrets so that the user is aware of which needs to be kept secret. [08:47:12] xd, that comes back to having two different services, secrets and envvars, we decided to go with one for simplicity [08:47:21] that would be a bigger change [08:47:37] on the backend it could still be one, at least for now [08:47:41] what's the time-scope of what you are thinking of? [08:49:38] time-scope: nearest future, but with the manifest format being flexible enough to accommodate for some of the future options we're already thinking about [08:52:32] I think the manifest could be used as a way for users to define the app in addition to being an internal format [08:52:34] in your opinion, how are users supposed to manage secrets/envvars in a push-to-deploy scenario? [08:52:45] good morning! [08:52:51] hey :) [08:53:42] I have some ideas about that but need to grab a ☕ before the meeting :)) [08:56:00] the way heroku/etc do it is manually set on the UI or cli [08:56:14] (we mimicked the envvars from heroku) [08:56:58] ok [08:57:27] for the nearest future, I would say put as little things as you need, adding anything extra would mean that any future decision will need to change it, so if we can go without specifying the envvars as a start and defer to later the decision, that might be better [08:57:54] secrets will have to be set manually for sure; other config _could_ be in a manifest, if we choose to [08:57:59] dcaro: ack [08:59:10] the key is in the balance... [08:59:28] (most helpless sentence ever xd) [08:59:40] hahahaa [09:56:30] I'm reading the last comment on T357388 and wondering how difficult would be to introduce some debug shortcuts, to for example get a shell inside a job container, whether it is a build-service image or a generic one [09:56:30] T357388: toolforge jobs current image aliases - https://phabricator.wikimedia.org/T357388 [09:58:26] should be simple enough yes, not sure we might need anything specific even [09:58:38] (build-service images have bash installed by default) [09:59:54] in my mind, eventually users would not log in into bastions, but to shell containers on k8s [10:00:18] yes, that idea has been circulating for years [10:00:38] yep [10:01:01] soon without the grid, that wolud be way easier :) [10:01:47] jobo: T296434 [10:01:48] T296434: [SPIKE] Identify potential metrics which could be computed for tools - https://phabricator.wikimedia.org/T296434 [10:01:57] I wonder if there are web-based shells/tty for k8s [10:02:22] jobo: T299152 [10:02:22] T299152: View Tool Version Control Information - https://phabricator.wikimedia.org/T299152 [10:03:40] I had a quick look on those a couple years back, I found something but it was not open-source iirc [10:03:56] hopefully there might be something now [10:10:26] maybe https://cloudtty.github.io/cloudtty/ [10:11:11] also possibly related: T311917 [10:11:12] T311917: Make `webservice shell` a standalone tool - https://phabricator.wikimedia.org/T311917 [10:11:59] I think we should focus on eliminating the need for dedicated bastion hosts (by e.g. making the CLI installable locally and building a web interface) instead of inventing "cloud-native bastion hosts" or whatever [10:13:08] yeah, but the point here is that debug and testing is not going to be any easier if there are no bastion, and we don't come up with an alternative [10:13:26] what do you think? [10:18:08] not sure, if there's no bastions, people won't expect that whatever they put in the bastion would be available on the image [10:19:52] but I agree that at least being able to shell to the webservice environment might help some debugging (though, `docker run -ti -- bash` kind of gives you some of that, without envvars/nfs) [10:20:51] heroku has ssh tunneling into dynos https://devcenter.heroku.com/articles/exec; iirc, with digital ocean you can get a web-based tty [10:21:56] yep, that'd be a similar thing to 'kubectl exec', needs a running pod though [10:22:09] yeah, I don't think we can get fully rid of all forms of shell access, something like that is what I have in my mind too [10:22:39] whether it's in the running pod or spawns a new one doesn't seem like that significant of a difference [10:23:07] `kubectl debug` allows to use a different image on the same pod, also interesting [10:42:42] inject a a new container in the pod with a different image? [10:43:04] I think I never got to fully understand the debug/exec dancing in k8s [11:07:25] yep, that way you can have an image with debugging stuff connected to the same namespaces as the thin running image [11:29:53] I'm rebooting clouddumps1002, in theory this should have no impact as all traffic was moved to 1001 yesterday [11:32:22] do we have a list of HW refreshes happening this FY? [11:33:04] I believe andrew has a spreadsheet somewhere [11:34:41] I think I found something [11:36:19] I needed a combo of this https://docs.google.com/spreadsheets/d/1FkBT8BJfN7t0r9ZhE1NLGheNJ5KCRQNfljQicb8CW30/edit and this https://docs.google.com/spreadsheets/d/1y3kh8JAYlb3VqJOazwq7y6EksIKlBw-GY5IaWXqQ_VA/edit [11:43:03] FYI I'm going to reimage cloudvirt1031 into the single NIC setup [11:45:27] moritzm: both clouddumps hosts have now been rebooted, I closed https://phabricator.wikimedia.org/T321313 [11:48:28] great, thanks [11:51:38] * dcaro lunch [11:52:59] FYI cloudvirtlocal1001 seems to be in the maintenance aggregate, not sure if that's expected [11:53:09] web ttys are fancy, but my feeling (possibly completely wrong!) is that people who want to run shell commands will usually prefer to run them from a terminal, if there's an easy-to-install CLI and you don't need to think about bastions, etc. [11:53:18] arturo: looking, I restarted cloudvirtlocal1001 yesterday [11:54:05] ah I think I know [11:55:08] I tried first using the cookbook wmcs.openstack.cloudvirt.safe_reboot [11:55:28] that failed halfway through, but it probably changed the aggregate before failing [11:56:04] not sure if aggregate has any impact on cloudvirtlocal though, as we don't schedule new VMs on them [11:57:29] yeah, maybe just cleanup by hand, and that's it [11:57:53] done [12:04:32] can I get a +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003616 [12:05:16] looking [12:05:44] +1 [12:08:40] thanks [12:11:31] about the tty/ssh into debug containers discussion: in addition, some users might still want kubectl access to manage/troubleshoot their tools. is that at all doable without bastion hosts? [12:16:50] good question. i guess the main questions are accessing the api endpoint (doable, either by exposing it directly on the internet, inventing some authenticating proxy, or having people ssh forward via some host), and k8s api authentication (currently certs in NFS, again there are some options but none are super simple) [12:19:00] maybe we could keep the bastions _only_ for kubectl and in general "advanced" usage? and migrate all the simple use cases (like running a shell) to the CLI [12:19:03] at least as a first step [12:19:18] i mean yes, the move from bastions to other interfaces will be incremental [12:19:49] exposing the raw k8s API to the internet sounds scary [12:20:07] agree [12:22:40] I think that the step there would be accessing the k8s cluster from within the shell in k8s, with no bastion involved [12:23:58] afaik, giving direct access to their backends is not something any hosting provider does, if not for security reasons then because they want to abstract it completely. I also think that cutting our users off raw k8s would be very unpopular [12:24:43] I think most users would not mind, only a few power users do actually use the k8s backend directly [12:25:30] those are the ones who scream the loudest though xd [12:25:36] yep [12:35:53] Hi, can a WMCS SRE take a look at this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003420/2 This fixed sending mail in beta cluster and I want to stop cherry-picking it [12:36:02] I'd be grateful [12:58:51] LGTM, though probably taavi might have a better judgement [13:00:31] Amir1: PCC is showing changes on unrelated hosts though, https://puppet-compiler.wmflabs.org/output/1003420/1375/bastion-eqiad1-03.bastion.eqiad1.wikimedia.cloud/index.html [13:01:04] so I think mediawiki_smarthosts needs to be empty by default on WMCS and then be overridden on beta to point to deployment-mx03 [13:01:37] Yup [13:01:51] Why it's set in wmcs fleet wide even [13:02:33] no clue :D [13:03:02] is any other host using it though? as in, does it have any effect at all on VMs right now? [13:03:30] it would have an effect on practically all VMs except toolforge which does its own thing [13:04:09] well the actual change would be pretty much no-op but the config file would be changed and it's trivial to avoid so I think I'd prefer that [13:05:29] I mean, before this change, setting mediawiki_smarthosts had any effect on the VMs? [13:05:54] no, it would have been completely ignored [13:10:03] I'm seeing this nova-compute error on a freshly reimaged host cloudvirt1031 [13:10:04] nova.exception.InvalidConfiguration: No local node identity found, but this is not our first startup on this host. Refusing to start after potentially having lost that state! [13:10:17] I was not aware of any nova state change required to reimage an hypervisor [13:10:21] is this something new? [13:10:45] do we need any manual step on the openstack API to reset the compute agent or something? [13:13:03] * arturo reads https://docs.openstack.org/nova/latest/admin/compute-node-identification.html [13:22:53] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003753 [13:23:00] I will merge all three together [13:24:01] * taavi looking [13:25:01] Amir1: Warning: Failed to compile catalog for node bastion-eqiad1-03.bastion.eqiad1.wikimedia.cloud (retrying with pson): Function lookup() did not find a value for the name 'profile::mail::default_mail_relay::mediawiki_smarthosts' (file: /srv/jenkins/puppet-compiler/1378/change/src/modules/profile/manifests/mail/default_mail_relay.pp, line: 6) on [13:25:01] node bastion-eqiad1-03.bastion.eqiad1.wikimedia.cloud [13:25:32] FWIW, it's also set in codfw1dev but that has labtestwiki so it actually might have had a broken mail system [13:25:40] ah, I make it empty then [13:26:51] taavi: what about now? [13:27:41] well in codfw1dev it's set to mx1001/2001, that's obviously not going to work :D but I'll have a look at that later [13:28:05] hmmm https://puppet-compiler.wmflabs.org/output/1003753/1379/bastion-eqiad1-03.bastion.eqiad1.wikimedia.cloud/index.html is still showing a diff [13:28:21] maybe you need to compare for it being non-empty instead of just truthy [13:29:43] empty array is falesy :((( [13:30:07] not in ruby apparently [13:40:23] I looked at other erb files and the newest patch should work [13:44:31] taavi: are you available to assist me on a nova thing? [13:45:02] Amir1: why not just `<%- if @mediawiki_smarthosts != [] -%>`? [13:45:03] arturo: sure [13:45:14] lets use the coworking meeting? [13:45:36] joining, one minute [13:45:38] I wanted to protect against lack of definition but yeah, let's just do that [13:56:53] Amir1: it still has a whitespace change but it's good enough, +1'd [14:20:27] arturo: I ran homer on the switch cloudvirt1031 is connected to which added the instances VLAN and now it works [14:33:49] topranks: XioNoX: hi, could either of you please do the required reverse zone delegation change on the RIPE side required for T341338? [14:33:50] T341338: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 [14:38:52] taavi: I’ll take a look shortly [15:04:43] thanks taavi [15:37:45] so radosgw seems to be fully broken in eqiad1 [15:38:54] ah I think I was operating on the wrong service [15:39:42] no, still broken [15:39:49] Feb 15 15:38:51 cloudcontrol1006 radosgw[1293082]: 2024-02-15T15:38:51.619+0000 7f7194b4e6c0 -1 Fail to open '/proc/1293783/cmdline' error = (2) No such file or directory [16:46:02] that sounds weird [16:46:13] (unable to find the process cmdline entry) [16:54:57] any recent package upgrades? [16:55:09] however, per the demo during the team meeting, it seems to actually work? [16:56:38] not on ceph side I think [17:02:16] andrewbogott: please take a look at T357631 if you have a chance [17:02:17] T357631: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 [17:03:51] * arturo offline [17:04:50] arturo: yes! [17:19:09] I'm gonna reboot cloudnets and cloudservices (T356975) [17:19:49] ok. Do you remember the thing about how there's a 'right' order for the cloudnets? [17:20:03] yep, cloudnet1005 is the standby so I'll start from that one [17:20:07] cool [17:20:39] there's https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrading_cloudnet_nodes but we should probably add a section about "reboots", as it's not obvious to search for "upgrade" when you don't want to upgrade :) [17:21:19] sure [17:21:23] Is there any way I can quickly test if radosgw is still failing? (besides creating a new object + auth + etc.) [17:22:45] ah there's also wmcs.openstack.roll_reboot_cloudnets, I forgot about that one [17:23:10] Do standalone puppet masters' clients act on themselves? from yesterday the nfs server seems happy with the changes, but the puppet master itself is having issues [17:23:42] Rook: they can but it generally makes things worse. [17:24:05] oh wait, I misread the question [17:24:19] it's actually quite simple, it seems to work https://object.eqiad1.wikimediacloud.org/swift/v1/AUTH_toolsbeta/dcarotest1/hojas.png [17:24:33] rook, sorry I think I don't understand the question. [17:25:14] Rook: just regenerated manually the puppet client cert on the puppetmaster, and it's happy now [17:25:23] Neat how did you do that? [17:25:24] I like that photo :) [17:25:27] got a weird error once though, so did it twice [17:26:56] if you ran it manually, it would output the steps, essentially: [17:27:52] https://www.irccloud.com/pastebin/VJJ42GFN/ [17:28:18] usually the `puppet cert *` is run on the master, and the others on the client, in this case it was the same host [17:28:35] the `find` is missing a `-delete` [17:29:02] hmm...I got a different output from similar. Though I put a `puppet agent -tv` after the first step, perhaps that caused issues [17:29:52] the -v should be verbose, should be ok [17:30:17] the steps are here too: https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_puppet_client, and there's the cookbook also [17:30:43] Oh I see, maybe, I was looking for the cert with `puppet cert list` which wasn't showing it. I've updated the wiki with a `--all` [17:31:21] those are already signed [17:31:36] it will show only unsigned by default (and should be a new one that you just generated) [17:32:07] I always get a bit confused too :) [17:32:08] odd, I didn't get any cert output. Just a warning about the command being depreciated [17:32:49] At any rate, thanks for getting it working [17:33:06] yw [17:35:32] striker is failing on coludweb2002-dev, it seems there's a wrong param passed (--nostatic), anyone know something about it? [17:35:55] it comes from the docker image itself [17:36:13] that message is code for "something in the settings is failing to load" [17:36:57] nice misdirection xd [17:36:59] dcaro: bd808 and I looked at that a month ago and decided it wasn't worth getting into since striker was never properly provisioned there anyway. [17:37:12] So I recommend ignoring it, and/or nagging me to remove the service entirely from that host [17:37:20] (the codfw1dev striker thing, that is) [17:37:33] oh, it showed up as a puppet alert [17:37:37] I mean, you can look at it if you're genuinely curious :) [17:37:49] Yeah, I don't know why it cleared for a while and then returned [17:38:00] wmcs.openstack.roll_reboot_cloudnets seems to be working nicely, thanks dcaro for creating it :) https://phabricator.wikimedia.org/P56845 [17:38:05] cool! [17:38:44] the only issue is that it won't work from cloudcumins because it must connect to icinga/am [17:39:01] happy it's useful :) [17:39:10] taavi: are your shell session pastes on T357631 meant to be "I did this and now it's fixed" or "here is evidence of the problem?" [17:39:11] T357631: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 [17:39:25] we should create one for cloudservices too! [17:39:51] +1 [17:39:52] andrewbogott: "I did this and it seems a bit better, but I don't know what it really did" [18:20:03] ah-ha, yeah just found it! [18:20:13] I missed the bin vs sbin :P [18:20:46] sbin is first in the path though :D [18:25:02] * dcaro off [18:25:13] cya tomorrow [18:25:46] * andrewbogott waves [18:26:58] andrewbogott: so how do you check if the VM gets deleted, from horizon? [18:27:20] it's created in the admin-monitoring project [18:27:31] Which currently contains 0 VMs. [18:27:49] you could also do wmcs-openstack server list --project admin-monitoring [18:28:50] hmm I've restarted the unit, shouldn't I expect a VM being present now? [18:28:58] or is it that fast to create and delete? [18:29:16] yeah, it should create one right away [18:29:21] usually takes 2-3 minutes [18:29:30] I mean, the VM usually lives 2-3 minutes if all is well [18:29:36] ok then maybe it's already gone, let me run it onem ore time [18:29:40] *one more [18:29:45] nope, it just showed up [18:30:07] yes I see it now, it's the new one [18:30:16] second restart of the unit [18:30:23] ok [18:33:35] wait there were 2 VMs though, and now there's only one [18:34:13] https://phabricator.wikimedia.org/P56854 [18:34:33] the datetime is only 30 seconds apart [18:34:42] in the VM name [18:35:03] Yes, one of them is leaked from the previous time you started the service and then restarted it :) [18:35:21] hmm but I did wait about 5 mins between the two "systemctl restart" commands [18:35:36] and there was no VM before I typed that command the second time [18:35:42] I don't know why there was a delay [18:35:59] hmm, so now do I have to delete that VM manually? [18:36:04] yep [18:36:23] openstack server delete {id}? [18:37:15] yep [18:39:41] done. can I proceed with rebooting the other cloudservice, or would you run nova-fullstack one last time to check it doesn't leak? [18:42:33] go ahead and reboot the other one [18:44:03] reboot started [18:44:11] I captures all the details about the procedure here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#Rebooting_hosts_(e.g._for_security_upgrades) [18:44:18] *captured [18:53:04] looks like it's working fine after the second reboot [18:55:03] did you restart nova-fullstack? because I didn't [18:55:22] no, but it runs periodic tests [18:55:27] and the last one was good [18:55:43] I see. it's not a systemctl timer, is it? [18:55:58] what's triggering it? [18:56:53] it's a daemon I think [18:59:25] the code definitely contains 'while True:' before starting the test [19:03:17] ok! [19:03:38] I think we only have cloudrabbits left to reboot, I'll do those tomorrow [19:04:39] * dhinus off [19:05:21] * andrewbogott waves [19:15:32] The `--nostatic` striker problem is I think caused by some failure of the Django app to load the correct settings file. That flag is available when the contrib.staticfiles app is loaded. It's some variation on the T355522 dev environment bug [19:15:32] T355522: Striker dev env fails to start with `manage.py runserver: error: unrecognized arguments: --nostatic` - https://phabricator.wikimedia.org/T355522 [19:15:45] * bd808 lunch