[01:10:18] * bd808 off [10:52:38] somehow web services for grid-disabled tools end up in 'dr' state but don't actually go away, so I'm doing a bit of scripting to actually stop and unregister them [14:23:44] taavi: that feels like some kind of order-of-operations thing... possible https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/11 ? [15:13:28] taavi: noted. let me know when I can proceed with disabling the next batch of tools. [15:19:31] komla: you can continue, I have a script to workaround that issue for now [15:21:01] andrewbogott: we can try that, but I have a hard time imagining how that affects stopping already existing jobs [15:23:53] taavi: agreed, I was thinking that maybe you were seeing examples of a service trying to start after the quota was set [15:25:40] no, it's the already existing job that just refuses to stop. and unlike the usual case of the grid losing track of something, the processes don't actually stop from the exec node [15:29:29] ok, then I've got nothing [15:30:03] maybe I'll have a look at the gridengine-exec logs on the node itself next time it happens [16:13:07] dhinus: when you make the new etherpad can you pls copy the "who's around what days" to the new pad for easy reference? thanks! [16:23:15] "interesting" observation of the day: The time between the first commercial radio broadcasts in the US (~1920) and the invention of Unix (~1970) is about the same as from Unix invention to today. Not sure what this tells us about anything, but there it is. [16:23:54] If something happens, we can call it bd808's law, and make a prediction about the next cycle! [16:24:36] andrewbogott: good idea, will do! [16:32:08] I've just realized that the grafana link in the topic of this channel has been broken for a while [16:32:29] I think that dashboard was not migrated to the new grafana instance [16:32:43] I could try and find it in the grafana backup... but does anybody need it? [16:35:07] For now, I've removed the broken link from the topic. The part I removed was: "Tools status board: https://grafana-labs.wikimedia.org/d/000000012/tools-basic-alerts" [16:35:13] https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 is maybe a reasonable replacement? [16:36:12] I'll add a note to discuss at the next weekly meeting in Jan [16:56:18] that and https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1 is what I've ended up starting at most often in the last few days [16:59:16] but I'm not sure if it's important enough for the channel topic [17:19:39] I've added that one to the etherpad too [17:20:03] did anyone use the wmcs.vps.create_project cookbook recently? [17:20:25] I ran it from cloudcumin1001, it did create the project but then it failed with "Gateway Timeout (HTTP 504)" [17:23:18] dhinus: do you know which call failed? [17:26:31] trying to find out [17:32:07] the command that failed is the first one: "wmcs-openstack project create...." [17:32:21] but the project is now there [17:32:49] ok, so it's likely some of the hooks within keystone that timed out [17:33:00] what's the name of the new project? [17:34:08] adiutor [17:34:25] I can see the full command logged in cloudcontrol1005 [17:34:37] not sure if some logs will contain the reason of the failure [17:34:59] I fear that it'll turn out to be ldap things that didn't get finished. I'll see what I can see [17:37:45] yeah, it didn't create the initial group but when I added myself that created it. [17:37:57] so far I don't see anything actually wrong... [17:38:33] related phab: T353421 [17:38:34] T353421: Request creation of Adiutor VPS project - https://phabricator.wikimedia.org/T353421 [17:38:55] should I just continue and add admin users to the project? [17:39:06] and I'll have to set the quotas manually [17:40:14] dhinus: have a timestamp for when the failure happened? [17:40:26] But yes, I think it's OK to proceed [17:40:55] 17:15:23 UTC [17:41:14] that's when the command started, the failure was recorded 2 minutesl ater [17:41:30] thx [17:43:12] I see an error at 17:18:23.144 about a duplicate project name, did you try to re-run after the failure? Or did the cookbook? [17:43:19] yes that was my retry [17:43:21] ok [17:44:09] looking again at the cookbook, the quota handling is only for "trove-only" projects, so I don't think I need to do any additional step [17:44:13] I can continue and add the users [17:45:17] the logs seem perfectly happy during the first run [17:46:09] keystone logs at least [17:46:10] not sure if it's worth creating a test project to see if the error is reproducible or not [17:47:13] Sure, do you mind retrying? [17:47:21] Are you running the cookbook on cloudcumin or locally? [17:49:27] first time from cloudcumin [17:49:36] then retried locally [18:04:10] I'm retrying with a test project name, running the cookbook from cloudcumin1001 [18:06:07] failed again with the same error: Gateway Timeout (HTTP 504) [18:06:47] and again no complains in the logs [18:07:14] anyway dhinus I think you can move on, I'll see if I can reproduce the issue here [18:07:26] yep I've already added the users and resolved the ticket [18:07:33] (but I guess delete the test project if you can) [18:07:41] I think the next debugging step could be running the openstack cmomand manually from the cloudcontrol, without using the cookbook [18:07:49] yep [18:09:06] I'll open a task to track this, it's not urgent [18:09:20] thanks! [18:09:26] I've deleted the test project [18:13:13] T353829 [18:13:14] T353829: [openstack] Creating a new project returns Gateway Timeout (HTTP 504) - https://phabricator.wikimedia.org/T353829 [18:17:50] I updated the section "Creating a new project" here https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#Creating_a_new_project [18:27:44] umpf, the quota_increase cookbook is also failing :/ [18:31:54] the 504 happens on the cli too (for creation) [19:04:32] created T353833 to track the quota_increase cookbook error [19:04:33] T353833: [wmcs-cookbooks] quota_show fails to parse openstack CLI output - https://phabricator.wikimedia.org/T353833 [19:04:47] I suspect the openstack CLI output changed in Antelope [19:05:30] I tried to fix it but the output is quite different so there's a bit more work needed, we need to update test fixtures as well [19:06:57] different topic: I'm always unsure about toolforge membership requests... this one comes from a very active user, but the reason is... short :D https://toolsadmin.wikimedia.org/tools/membership/status/1620 [19:07:57] while this one has a nice explanation but the user account is not active at all https://toolsadmin.wikimedia.org/tools/membership/status/1605 [19:14:23] seemingly all of the tools shut down in the last batch today did shut down the web services cleanly, nothing is stuck in 'dr'. not sure what was wrong with the earlier ones, but I'm certainly not complaining [19:22:39] great [19:22:42] woo! [19:23:27] dhinus: I'd ask for more explanation from the first one. The second one I'd probably just approve [19:45:29] andrewbogott: thanks [19:46:20] the alerts about cloudservices100[5-6] are caused by the test project "T353829test3" [19:46:37] "Exception: Unable to parse project=T353829test3" [19:46:50] for some reason it's crashing labs-ip-alias-dump.py [19:51:48] bd808: the user reply here is not very convincing, I'm tempted to decline and suggest they host their tool somewhere else, WDYT? https://toolsadmin.wikimedia.org/tools/membership/status/1619 [19:52:49] weird, [19:52:52] I'll delete it [19:53:09] or, actually, I did already [19:53:43] let me try restart that systemd unit [19:54:10] oh it's a timer [19:55:10] but not very frequent [19:55:28] It's every 30 I think? I was going to just wait for it to re-run [19:56:09] dhinus: It is not convincing at all. This person has no history as a member of the Wikimedia community to convince me that there is any longterm connection from their hobby project and the movement. I think your instinct is correct. [19:56:17] andrewbogott: every 60, I've triggered manually and it looks fine [19:56:23] ok! [19:56:31] I wonder if it didn't like the capital T [19:56:45] maybe, I read somewhere the name should be all-lowercase [19:56:56] bd808: thanks, I'll decline [19:58:34] dhinus: related, for ToprakM I would personally take the history at https://meta.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=ToprakM as enough to approve. 48K edits to their home wiki is more than enough for me to believe they will do good things. [20:00:56] yes I was also impressed by the history, but I've already replied asking for feedback following andrewbogott's suggestion :) I think it's fair to ask for a tiny bit more detail... [20:02:34] agreed that it is fine to ask for a bit more [20:03:30] I just generally don't ask for accounts with community standing of a few hundred edits or more and a year of more of existance [20:05:32] that sounds reasonable, I wonder if we should update the guidelines at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Users_and_community [20:09:30] could do :) [20:09:32] * bd808 lunch [20:14:55] Jon's new https://wikipediayir.netlify.app/ tool can now do your wikitech year in review. I had 443 edits in 2023 apparently. [20:35:40] nice! I did 139 edits, and BryanDavis was my biggest fan :) [20:36:59] 691 edits :-P [21:29:24] taavi: Tricia's said 513 edits, so I would guess that you have beaten anyone except Stashbot. :) [21:32:26] stashbot: you had 76,248 edits on wikitech this year. Congratulations! [21:32:26] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [21:40:04] bd808: looks like it, although with a surprisingly tiny margin :-P https://quarry.wmcloud.org/query/78875 [21:41:55] nice. Looks like Onfirebot could use a bot flag too. [21:42:44] * bd808 gave Onfirebot a bot flag [21:49:01] https://wikitech.wikimedia.org/wiki/Special:Contributions/Meno25 got into the top 10 by running a double redirect fix pywikibot script in January 🤣