[11:17:12] hello there, I am trying to provice VPS access to a teammate, and I added them access to a project I manage [11:17:40] it added them to the bastion project automatically- so far, so good [11:18:03] but they cannot ssh into the bastion hosts, is there something else needed for first time access? [11:18:23] do I need to manually add the bastion group to LDAP? or maybe it just takes some time? [11:19:01] (ther right groups are on horizon but not on ldap) [11:19:40] what bastion? [11:19:57] I assume they set up a SSH key in their profile? [11:20:00] bastion.wmcloud.org [11:20:13] I can see it on ldap, but is idm separate from lda? [11:20:21] *ldap? [11:20:41] jynus: what's the username? [11:20:52] fceratto [11:21:11] jynus: this is the reference documentation https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances [11:21:16] root@bastion-eqiad1-03:~# groups fceratto [11:21:16] fceratto : wikidev ops wmf project-mariadbtest [11:21:25] yeah, it was followed [11:21:31] I ask after it didn't work [11:21:47] it says it gets added to bastion automatically [11:21:53] * andrewbogott notes that project-bastion isn't in that list [11:22:01] so that's that bug happening again. One moment... [11:22:13] so I see it on horizon [11:22:16] but not on ldap [11:22:35] he is part of the project on horizon, I saw his name there [11:22:45] (of the bastion, I mean) [11:23:59] "fceratto reader" [11:24:02] oops I'm going to burn my breakfast! arturo I added them to the bastion project which might have fixed things [11:24:22] ok [11:24:29] feel free to ping me later of what I did wrong or what I could do differently [11:24:34] to amend docs [11:24:45] jynus: it may have been a bug, try again? [11:25:11] I've seen people missing from the project bastion before, but it's interesting if they're listed in the project in horizon but not in ldap. maybe andrewbogott did the change just as jynus was looking? [11:25:41] now it's visible in ldap too https://ldap.toolforge.org/user/fceratto [11:25:46] or maybe it just takes time to sync, which would be ok [11:25:59] I can add that advice to the docs [11:26:47] jynus: no, there's a legit error with adding new users to the bastion... [11:26:53] which I have hopefully corrected [11:26:59] maybe "he project-bastion LDAP group will be added automatically" should read "SHOULD be added automatically" :P [11:27:03] ok, let me know if I can help in any way [11:27:08] try again now? [11:27:26] yeah, I told him. he may not respond immediatelly, sorry [11:27:36] ldap looks fine now [11:27:59] thank you, will ping if I need further help [11:28:11] please go have breakfast [11:47:02] I was confirmed the log-in was successful [11:47:05] thanks again [11:50:43] great! [11:51:13] jynus: tangentially, I need to reboot the bastions for maintenance in about 10 minutes so that'll be a rude one-time surprise [12:51:50] andrewbogott: I see your cloud-announce message in "Held Messages" (because it's too big), but I also see it in the archives [12:51:59] I will delete the one in "Held" [12:52:09] thanks -- I sent the first one from the wrong email address I think [12:53:08] looks correct to me: "abogott@wikimedia.org The message is larger than the 40 KB maximum size" [12:53:33] I clicked "reject" so it should bounce back to you [13:10:17] hm, so how did it get to the archives? [13:10:31] it being in the archives means it got delivered to people, right? [13:45:15] it was delivered to me [13:49:55] ok then :) [14:00:00] dhinus: prefer video or irc? [14:00:14] irc should be fine [14:00:30] I will do a graceful shutdown of toolsdb [14:00:43] and take the chance to upgrade to the latest minor version [14:00:56] ok [14:01:20] What do you think about the proxy: move the floating IP, or just reboot the active one? [14:02:27] I forgot to find the cookbook for it [14:02:40] if you want to look for it while I do toolsdb [14:02:43] I don't think there is one, only for the ha proxies but not the web proxies [14:03:29] I thought taavi mentioned there is a cookbook but maybe he was referring to the ha proxies [14:04:26] yeah, I looked in the backscroll and he says 'probably not' about the regular proxies [14:04:43] I'm going to just reboot since there's less opportunity for me to mess something up [14:05:03] btw dhinus, I'm seeing that occasionally a hard reboot is not enough to reset the ceph settings. I know not why, but a cold reboot always works. [14:05:33] So I've been doing 'openstack server migrate --shared-migration --wait && openstack server migration confirm ' rather than just a reboot [14:06:16] andrewbogott: reboot for web proxies sounds good then [14:07:31] give me a sec before rebooting tools-db [14:07:48] ok, proxy is done [14:10:00] there is some apt pinning that is preventing me from upgrading mariadb from 10.6.9 to 10.6.20 [14:10:51] I'm giving up for now as it's taking too long, will research that separately [14:11:18] ok -- can't you take your time if you do it on the standby host? [14:14:15] yep exactly [14:14:33] though we are now using the standby as a "read-only" host as well [14:14:47] but I can still take my time to understand how to change the pin [14:14:56] I'm shutting down gracefully the primary now [14:15:00] it's taking a while [14:15:07] ok it's stopped [14:15:30] sorry, I'm looking in the wrong place, that's -5 or -4? [14:15:40] -4 is primary, I stopped -5 as well [14:15:47] ok, ready for me to reboot both? [14:16:12] yep [14:16:54] ok -4 is done [14:17:46] and -5 [14:18:10] everything back and happy? [14:18:25] restarting mariadb in -4 [14:18:35] the systemctl unit is configured not to start automatically [14:18:49] I remember that, although I don't remember why :) [14:20:26] aaaand here come the complaining emails :) [14:21:00] I tried to set a silence, but it didn't work :/ [14:21:20] I need to write a cookbook that does this reboot dance properly [14:21:35] I think we're back now [14:21:44] replication is also working and in sync [14:21:50] great [14:22:06] nice to have that working reliably that we can mess with it without a day of downtime! [14:22:31] I'm doing 'integration' nodes by hand, and have a script running in the background rebooting everything else [14:24:34] thx dhinus [14:25:27] thank you! did you reboot prometheus/alertmanager VMs by any chance? because I see the FIRING notifications but not the RESOLVED ones [14:25:49] um... I'm not sure. What project? [14:26:11] metricsinfra I think [14:26:12] oh, metricsinfra [14:26:28] yes, those are in the background script but I haven't been paying attention to when they'll be hit [14:26:30] let's see... [14:27:32] looks like not yet, the script will get to them in 5-10 minutes [14:28:43] hmm then I'm confused why the RESOLVED notifications are missing [14:29:24] the alerts are not firing in https://prometheus.wmcloud.org/alerts [14:31:38] hmm I caused a page to go off sorry arturo [14:32:27] I resolved the incident in victorops now [14:40:22] no problem!! [14:41:07] the RESOLVED notifications arrived eventually, both IRC and email [14:58:01] dhinus: uh which cookbook? [14:58:22] there's a generic cookbook to move a floating IP from one server to another [14:58:37] taavi: I think I misread your comment the other day [15:01:03] you mentioned the "move a floating IP" cookbook and I thought you were saying it could be used for rebooting the webproxies with no downtime? [15:01:10] I'm not familiar with the webproxies setup [15:02:53] also I keep confusing the different proxies we have :D [15:03:03] which ones were you rebooting today andrewbogott? [15:03:40] to reboot the tools outermost proxy, you move the floating IP from one server to another? [15:04:39] tools-proxy-7 I think? [15:08:34] I found the conversation from Feb, 3rd that caused my confusion: [15:08:53] I rebooted the inactive proxy, tools-proxy-7. Not sure if it's better to move the floating IP over [15:08:59] andrewbogott: previously we've just accepted the short outage from moving the floating IP without any announcements. there's a cookbook to do it really fast [15:09:30] ah, ok. So I caused slightly more outage than necessary. [15:19:24] I misread that comment from the other day as "there is a cookbook specifically to failover tools-proxy", but I think taavi was referring to wmcs.vps.migrate_floating_ip, which could have been used to failover tools-proxy before rebooting the VM [15:19:37] yes [15:19:53] all clear now :) [15:23:29] * andrewbogott wishes that Integration project had cookbooks because depooling/repooling via the web ui is tedious [16:27:39] I drained and rebooted a cloudvirt and it worked! But also caused an alert [16:31:30] there will be lots of alerts about kernel errors, as always when we reboot servers [16:49:34] in theory the patch that was merged the other day should reduce them [16:49:38] but maybe not enough :) [16:56:27] ah good I see we only have "warning" level alerts so far, that's already an improvement [16:56:35] because they don't create tasks [16:57:36] I will try setting a 1-week silence on all "warning"-level KernelErrors alerts [16:57:44] that should get rid of the emails [17:02:41] silence created [17:27:30] thanks!