[07:58:06] * arturo online [07:58:17] <taavi> morning! [07:59:16] <dcaro> \o morning [07:59:18] <arturo> o/ [08:12:08] <dcaro> there's a bgp alert in codfw 'BGP CRITICAL - AS64605/IPv4: Active - Anycast' [08:13:36] <dcaro> is anyone doing anything there? maybe topranks ? [08:13:57] <taavi> that's cloudservices2005-dev [08:14:01] <topranks> not I [08:14:16] <taavi> not letting me ssh either, I'll look [08:14:28] <taavi> also reminds me we need to merge https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1142517 [08:17:04] <taavi> management console is super slow, I'll powercycle [08:29:14] * arturo brb [08:34:51] <taavi> arturo: (when you're back) what do you think of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144452? [10:06:19] <arturo> taavi: for T393760 do you have a link to the code creating that call? [10:06:20] <stashbot> T393760: trove: Unable to create user with IPv6 address as host - https://phabricator.wikimedia.org/T393760 [10:06:58] <taavi> arturo: in a meeting, but https://gitlab.wikimedia.org/repos/cloud/metricsinfra/tofu-provisioning/ [10:07:30] <arturo> thx [12:11:05] <dcaro> Hi all, I'm thinking on moving the next toolforge workgroup sync to wednesday, as tomorrow I'm off, I'll send an update to the meeting but let me know if you prefer any specific time/day [12:13:40] <dhinus> no preference, wed is fine [12:27:50] <taavi> arturo: chuckonwu: something I've been thinking about the toolforge tofu repo, is that right now if you want to provision a service, you need to touch lots of different places since all the resources (security groups, dns records, server groups, volumes, floating ips, etc) are all defined separately in separate files [12:27:53] <andrewbogott> I had to reboot cloudservices2004-dev last night, and cloudnet2005-dev for the same reason -- unresponsive even via mgmt console. There were also bgp alerts on one of the cloudsw nodes in codfw1dev [12:28:35] <taavi> instead i'm thinking of having a generic "service" module that could have the common things each service needs defined all in one place, and then deployment-specific things (mostly exact VM names/sizes) defined per-project [12:29:03] <taavi> so something like the current legacy_redirector module that I introduced, but with some level of abstraction to reduce the required boilerplate [12:29:10] <arturo> taavi: sounds good to me [12:31:07] <dcaro> that's the same approach as the projects in the tofu-infra repo right? [12:33:24] <taavi> i guess yes [12:33:52] <taavi> alright, I'll draft up a proof of concept of what that could look like [12:34:50] <dcaro> +1 for that yep [12:35:41] <dcaro> btw. I have updated https://phabricator.wikimedia.org/T393564 with the doc info, and set some dates and such, feel free to keep asking/adding comment/editing there on the task [12:35:48] <dcaro> T393564 [12:35:48] <stashbot> T393564: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564 [12:46:59] <andrewbogott> Is anyone else working on the tools bastion issue? It seems like an emergency if it's still happening although I'm pretty sure https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143963 resolved the symptom at least [12:50:32] <taavi> andrewbogott: i was last week, i would've today but after your fixes it seemed stable enough to wait for you to get online [12:51:22] <andrewbogott> In theory I'm out this morning but I have a big before I need to go. Shall we try merging my patch and enabling puppet? [12:51:47] <andrewbogott> I suspect that the next step is to get ldap monitoring which is more than a 20 minute project [12:51:53] <andrewbogott> *have a bit* [12:52:50] <andrewbogott> I knew you had worked on it before, I wasn't sure if you stopped today because you thought it was better or because you'd despaired :) [12:53:44] <taavi> yeah. i see someone else helpfully tried to run PCC on that patch across all of wikiprod hosts [12:53:50] <taavi> anyway, +1 [12:55:41] <andrewbogott> when I enable puppet it'll flip back from the codfw to the eqiad replica but we don't think that that helped do we? [12:55:53] <taavi> yeah that's fine I think [12:56:22] <andrewbogott> ok, here goes [13:00:59] <andrewbogott> I keep trying to run a watch on logins for a day but my laptop finds different ways to kill off the process after an hour or two. [13:01:07] <andrewbogott> anyway it's working right now at least [13:03:05] <arturo> taavi: please review this: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/51 and this as example: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/23 [13:12:43] <taavi> arturo: done [13:19:09] <taavi> what's up with these WidespreadPuppetAgentFailure alerts? [13:20:14] <taavi> andrewbogott: ^ seemingly related to that sssd patch [13:20:24] <andrewbogott> yeah [13:20:31] <taavi> more specifically, sssd-nss apparently failing to restart [13:20:34] <andrewbogott> for some reason a manual 'systemctl restart' solves it [13:20:37] <andrewbogott> but puppet can't restart it [13:20:46] <taavi> weird [13:20:50] <taavi> need any help with that? [13:21:03] <andrewbogott> I'll just restart with cumin and move on unless you're curious [13:21:16] <taavi> sounds ok to me [13:22:31] <taavi> i silenced (Widespread)PuppetAgentFailure alerts for the next hour to avoid spam [13:23:03] <andrewbogott> thx [13:28:45] <andrewbogott> taavi: one more https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144572 [13:29:06] <taavi> +1 [13:34:25] <taavi> arturo: for your consideration https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/24 [13:38:49] <arturo> I need to pick up child, be right back [13:48:16] <andrewbogott> taavi: I'm going to go in a minute. login.toolforge.org may or may not be stable, but puppet is definitely going to start breaking everywhere. Can you run sudo cumin --force O{*} "systemctl restart sssd" [13:48:25] <andrewbogott> in 30 minutes to reset things? [13:51:43] * taavi sets a reminder [13:51:49] <andrewbogott> thank you! [13:52:03] * andrewbogott out for a few hours [14:30:00] <arturo> taavi: LGTM [15:20:50] <chuckonwu> arturo: can you check this MR https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/20 [15:21:08] <arturo> yes [15:26:13] <arturo> chuckonwu: +1'd please merge [15:50:55] <arturo> taavi: I updated https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/51 [16:00:09] <taavi> arturo: thank you, will look tomorrow morning [16:00:18] <arturo> ack [16:16:55] * dhinus offline [16:26:28] * arturo offline