[09:50:12] FYI I've upgraded thanos on cloudmetrics hosts, no impact expected [10:36:04] 👍 [12:23:33] FYI prometheus roll-restart coming up to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/709032 no impact expected [12:48:30] is there a regular place that tools but their operational documentation? [12:50:37] addshore: yes, wikitech has the Tool: namespace for that, https://wikitech.wikimedia.org/wiki/Category:Toolforge_tools [12:51:01] wonderfull! [13:47:45] !log admin Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus [13:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:47:50] !log admin Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) (T285858) - cookbook ran by dcaro@vulcanus [13:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:48:21] !log admin Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus [13:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:51:29] !log admin Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus [13:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:20:12] !log wikidata-dev wb-reconcile configured mediawiki-runJobs.service to restart itself frequently to pick up LocalSettings.php changes [14:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev/SAL [14:20:35] (--maxjobs 10 and Restart=on-success, if anyone’s curious) [14:28:56] Lucas_WMDE: if it's in puppet, why not just have a notify [14:29:08] If git or anything a post-merge hook [14:29:10] it isn’t [14:29:34] totally manual [14:30:21] Sounds a nightmare [15:44:52] Lucas_WMDE: you may also want a RuntimeMaxSec so it gets restarted even if no new jobs come in (idk how often your wiki is being edited) [15:45:00] for reference, https://salsa.debian.org/mediawiki-team/mediawiki/-/blob/master/debian/mediawiki.mediawiki-jobrunner.service is what I use [15:45:09] but RuntimeMaxSec will make it fail, not succeed [15:45:18] so then I’d need a different Restart= setting too, right? [15:45:24] and I didn’t want to mask legitimate failures [15:45:25] yeah, I use Restart=always [15:45:37] fair enough [15:45:40] I’m happy if the job runner doesn’t restart itself while idle [15:46:56] (I tried --maxtime but it didn’t seem to do much, I think that only restarts if one “batch” of jobs takes too long to run) [15:48:08] ah, but looking at your commit message, Restart=always makes sense if the job runner might start before mysql or other stuff is ready [15:48:11] so that makes sense in your case :) [16:09:08] hi. I have a possibly very stupid question but I can't log in to bastion.wmcloud. I have followed https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances and also been added to one of the projects (Traffic) [16:09:21] I am using restricted.bastion.wmcloud.org and I have confirmed (thrice now!) that I am using the correct key [16:10:00] what else can I debug and what may I be missing? [16:10:13] does it work on primary.bastion.wmcloud.org ? [16:10:28] omg [16:10:29] yes it does [16:10:44] sukhe: what's your uid/username? [16:10:54] majavah: ssingh [16:11:23] sukhe: that account is missing from the ops ldap group, which is needed for restricted. access [16:11:28] my understanding was as per the docs, "A member of Wikimedia SRE Teamrestricted.bastion.wmcloud.org" [16:11:49] https://ldap.toolforge.org/group/ops [16:11:55] majavah: that is sukhe there [16:12:13] https://ldap.toolforge.org/user/sukhe [16:12:14] as far as I can tell, sukhe != ssingh [16:12:45] correct, but I think the confusion stems from that I started with ssingh when I had restricted access and now as SRE, I am sukhe [16:13:25] so sukhe@primary doesn't work but ssingh@primary does [16:13:31] but neither works for @restricted [16:13:47] sukhe is missing from project-bastion, which is needed to access any bastion [16:14:03] but somehow sukhe is in project-traffic, which should auto-add it to project-bastion [16:14:12] :) [16:14:27] and ssingh is in project-bastion, but not any other cloud vps project, so I have no clue on how it got there [16:17:35] looks like https://openstack-browser.toolforge.org/user/sukhe doesn't match up with ldap [16:17:58] yep, I have a feeling something got confused as you have two ldap accounts with the same email, but I'm not sure what exactly got that something is [16:18:14] andrewbogott: ^^ want to have a look? [16:19:04] ldap account reconciliation is my worst nightmare but, yes, I can look shortly :) [16:19:48] andrewbogott: thanks, it's not urgent! I am also currently looking [16:39:26] ok, just catching up... I see ssingh/Sukhbir Singh and sukhe/Ssingh do you have history with both accounts or should we just murder one of your clones? [16:39:49] andrewbogott: I think I can live without ssingh/Sukhbir Singh since it's sukhe/Ssingh for my main job :) [16:40:03] And also majavah is right that regardless we're in a "this can't happen" situation [16:40:34] ok, I'm going to just add sukhe to bastion manually and then maybe we'll never have to think about this again. [16:40:40] thanks andrewbogott [16:40:48] ...but... would you be willing to change your not-work account to have a non-wmf email associated with it? [16:40:54] That'll make forensics a lot easier in the future [16:42:38] I wonder if we can do ssingh+...@ or something but yeah, I think we can do it for the purposes of debugging this at least [16:44:05] if you know the login for the personal account you can just log in to wikitech and change it [16:44:57] Haha I think this used to happen now and then when I was around 😁 (re @wmtelegram_bot: And also majavah is right that regardless we're in a "this can't happen" situation) [16:45:09] andrewbogott: done [16:45:29] sukhe: this started with you trying to ssh someplace? Now is a good time to try again [16:46:47] tried with sukhe since we killed ssingh but can't log in :/ [16:47:05] show me your commandline? [16:48:07] (and I didn't actually kill ssingh, I'm hoping that won't be needed) [16:49:02] andrewbogott: see PM for detailed logs [16:49:15] https://ldap.toolforge.org/user/sukhe is now in ops/wmf/project-bastion and https://ldap.toolforge.org/user/ssingh is in project-traffic [17:06:02] @yuvipanda: the impossible happens all the time at scale :) [19:33:29] fwiw and as an update, I managed to fix the problem by standardizing the accounts I have. I haven't killed my former self yet because of the other places I have used it so I updated my ops account instead. thanks to andrewbogott for his help in helping debug this [19:33:41] (and AntiComposite and majavah as well) [19:48:15] !log tools.lexeme-forms deployed 37acc67c90 (l10n updates) [19:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [21:27:04] "I haven't killed my former self" sounds... suicidal [21:27:05] xD [21:46:41] Platonides: or just things a clone might say :) [21:59:22] or a time-traveler