[00:44:20] * bd808 off [08:45:00] morning [08:54:15] morning [09:31:04] o/ [10:11:24] taavi: you might want to block https://wikitech.wikimedia.org/wiki/Special:Contributions/I, per -stewards, that's also Q28 [10:25:19] there's some widespread puppet issues on cloudinfra and others, looking, has anyone touched something? [10:25:58] i merged the sssd socket activation patch recently [10:26:10] okok, looking [10:27:41] weird, I don't really see emails/alerts of a host failing to run puppet, just the widespread ones [10:28:11] the widespread alerts have a shorter activation time than the individal ones [10:28:28] i also see the widespread alert resolving in some projects [10:30:14] Mar 06 10:05:35 enc-1 puppet-agent[4005019]: (/Stage[main]/Ldap::Client::Sssd/Service[sssd-nss]) Failed to call refresh: Systemd restart for sssd-nss failed! [10:30:48] but yep, a second run works [10:31:10] and it's running [10:31:18] https://www.irccloud.com/pastebin/ffcm0SXi/ [10:31:20] so probably an ordering issue with how the everything-in-one unit was stopped and the socket units started? [10:31:24] in that case, sorry about that [10:31:34] that'd be my guess :), np [10:32:12] hmm, the logs for that service don't help much [10:36:35] it seems to be restarting every 8 min or so by itself (just says it shuts down, unit succeeded, starting up) [10:39:31] 7:30m, that's the cadence xd [10:42:48] it might be that some cron/ssh check does something and triggers the socket [10:46:08] * dcaro be back in a bit [12:14:59] Is there any timeline for moving to bobcat? [12:46:01] Rook: I think roughly by the end of the FY (T356287 was added to the "goals" for Q3/Q4) [12:46:02] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [12:47:24] it should be an easier upgrade than antelope because we don't have to reimage hosts... but I learnt the hard way not to make any estimate when openstack is involved :P: [12:54:29] Sounds good. Thanks [14:07:08] * taavi looking for code review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008462 [14:15:42] dhinus: I remember weird things on trove side, I think that the runbook has some details on that, is that the same thing? [14:17:56] taavi: done [14:18:45] dhinus: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Reserved_quota_does_not_go_down that was it [14:23:25] I'm thinking on adding logrotate to the bullseye standalone image (https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1009264), I think it's a nice workaround for users to get log rotation on all their jobs [14:23:54] I thought we already had some image with it included? Mariadb maybe? [14:24:13] would be nice if we could enable something like that by default even xd, at least until we get a nicer logging solution [14:24:26] taavi: let me check, if that's the case that's good already :) [14:25:22] yep it does! [14:25:25] nice :) [14:25:39] let me try to add some note in the docs also [14:28:03] thanks! [14:31:55] the docs are there already xd https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Pruning_log_files [14:44:44] I'm thinking about T357881, why do we use 1/2 of the limit for the requests, shouldn't we allow using requests as high as the limit? otherwise users would not be able to request more than half their limit (we could actually limit the request to the max ram of a worker node - a little bit, so the pod will be able to start) [14:44:45] T357881: [maintain-kubeusers] Allow setting the requests cpu and mem quota - https://phabricator.wikimedia.org/T357881 [14:45:14] as in, we don't need to let users set whichever there, but we can have that default so users can use as much as they could [14:58:58] i am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008462, this should have no impact but please ping me if you see any network issues in codfw1dev [15:58:12] bd808: is the striker work for T148048 still something you plan to do soon or can I take over? [15:58:13] T148048: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 [15:58:31] the trove quotas are even more inconsistent than the situation described in https://wikitech.wikimedia.org/wiki/Portal -- I filed T359412 [15:58:32] T359412: [trove] wrong quota_usages values in project tf-infra-test - https://phabricator.wikimedia.org/T359412 [15:58:51] *correct link: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Reserved_quota_does_not_go_down [16:06:15] :facepalm: [16:17:29] hmm, we have 3 osds running on buster [16:24:06] nm, we have a mixture [16:43:03] taavi: I have a patch, but it is busted. I could get it up in gerrit for you to look at and take over or rewrite from scratch if you've got big Striker energy (which seems to be the cool case right now) [16:44:27] The bustedness is that the way I have coded it the LDAP directory needs a migration to add a new objectClass to all of the Developer accounts. Specifically the objectClass that holds the new attribute. [17:03:43] * arturo offline [17:11:57] there are some DNS leaks alerts that triggered 5 hours ago. I ran the wmcs-dnsleaks script and it points to tools-sgeweblight-10-32 [17:12:33] does anybody know why they leaked? should I just run "wmcs-dnsleaks --delete"? [17:15:43] that rings a bell (I think I saw something in the emails) [17:17:18] maybe not (can't find it) [17:17:34] found an email with "alertname=InstanceDown" on that host [17:17:50] the InstanceDown resolved after 5 mins [17:18:04] but the time matches the time of the dns leak [17:18:06] taavi removed it https://sal.toolforge.org/log/h5ZgE44BGiVuUzOd3Ah0 [17:18:54] it had gotten stuck on NFS earlier this morning too [17:19:40] I'll just run "wmcs-dnsleaks --delete" for now, then keep an eye on that instance [17:20:48] 👍 [17:35:12] bd808: ugh, I see, that might require changes in the ldapdb library. please push whatever you do have so far, I'll either use that as a base or not [17:37:17] (or backfill the object class to existing users instead of changing the filter in the library) [17:39:45] taavi: I will try to get it uploaded for you today. I've got too many things open in parallel at the moment so it is helpful to hand some off, but it also means I have to gather my WIP bits and make them semi-presentable :) [17:46:06] I assumed that you still want to review all striker patches, so I'm just really converting your work from writing code to reviewing it instead of reducing it :-) [17:49:55] * dhinus off [17:55:18] taavi: if you are interested in "owning" Striker for a while I don't have any objections. The 2023 reorg made my maintainership a volunteer/best effort thing that was maybe going to be backfilled in WMCS. Nicholas leaving makes that backfill seem unlikely in the near term. [17:56:02] but I'm also totally willing to review and merge, you will just need to have some patience with my turn around times. :) [18:01:55] in that sense we will want to start working on striker soonish too from the toolforge perspective, once we have the APIs exposed [18:10:59] * dcaro off [18:11:01] cya tomorrow [18:42:49] taavi: you might be amused that I just fixed another wikibugs bug that has existed longer than you've been around the movement. T127506 was just over 8 years old. [18:42:50] T127506: wikibugs no longer says new tasks are "NEW" - https://phabricator.wikimedia.org/T127506 [18:54:18] bd808: hmm. the wikibugs gerrit listener seems to be stuck again. do you want to poke at it or should I just restart it? [18:55:16] taavi: I'll take a look. Probably nothing to see, but I can try. [18:56:03] last event timestamp in the logs is 2024-03-06T17:56:59Z, so yeah ssh connection is busted [19:49:18] * bd808 late to lunch