[00:13:44] @forzagreen: https://toolviews.toolforge.org/api/ can give you very high level usage stats in the form of "number of 2xx responses per day", but no granularity on which specific URLs are being hit. [00:17:38] there is an infrastructure level prometheus cluster in Toolforge, but I don't think we have a documented way for a tool to interact with it. [00:51:19] Someone can do a restart os something like that on petscan? It's offline... [00:51:53] https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan#Starting_the_service [00:53:47] Magnus is working on it https://wikis.world/@magnusmanske/113879148267213952 [00:54:53] Oh, ok. Thanks! [10:21:44] !log lucaswerkmeister@tools-bastion-13 tools.lexeme-forms deployed 223cafa209 (l10n updates: ms) [10:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [13:40:12] BTW shouldn't this be deleted on the Telegram side? (re @wmtelegram_bot: Hey guys... Joe Biden here. I've decided to step down from the White House to focus on other projects. Billion...) [13:41:04] yeah, fair enough. (it was spam mirrored from IRC) [13:43:12] thanks [13:43:25] also looks like XTools is down [14:48:51] XTools issue appears to be on the proxy level [14:57:48] musikanimal: the VM "xtools-prod08" is in state "Reboot", but is not actually rebooting [14:58:31] I would try shutoff+start of that VM, let me know if you want me to do it [15:00:04] yeah I tried soft reboot and that happened [15:00:45] I also tried hard rebooting xtools-prod09 which finished quickly, and went back to the same proxy error [15:02:14] I checked in the proxy and I didn't find any errors, so my guess is the proxy->VM connection is failing somehow [15:02:31] hmm okay, tried shutting off xtools-prod08 and it's still stuck in "reboot started" :/ [15:02:47] also FYI https://xtools-dev.wmcloud.org is working, part of the same project [15:02:58] yep that's a different VM [15:03:37] I'll try shutting down the prod VM from the hypervisor level [15:03:58] okay, thank you :) [15:04:48] the hypervisor lists the VM as "paused", which reminds me of T383583 [15:04:48] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [15:05:21] I'll try "openstack migrate" from a cloudcontrol [15:05:49] any chance T384711 and T384642 could be related? [15:05:59] since it sounds like both are some kind of “VM stuck in reboot” thing [15:06:34] they sound related and it seems it happened around the same time [15:08:46] "openstack migrate" fails with "Cannot migrate while it is in task_state reboot_started" [15:09:31] "server stop" doesn't work either [15:12:22] cc andrewbogott if you're around [15:12:53] well thanks for looking into it. I have to catch a flight soon and will be AFK for ~15 hours or so [15:13:39] did you try 'resume'? [15:13:45] musikanimal: thanks, we'll do our best to resuscitate it :) [15:13:54] andrewbogott: not yet, trying [15:14:51] Failed to resume domain 'i-00086a2e' [15:16:15] ok, I'll try a few things [15:18:22] dhinus: for reference, I'm going to 'virsh destroy' on the cloudvirt, which should put it into error state. then I'm going to 'openstack server set --state active' and then 'openstack server reboot --hard' [15:18:28] we'll see if that does anything :) [15:18:40] ack [15:19:13] oh wait, as soon as I did 'virsh destroy' the pending reboot took effect and not it sees to be up [15:19:16] *seems to be [15:20:23] now i'm migrating it anyway just to make sure everything is as it needs to be with the ceph connection... [15:20:45] great [15:20:51] ok all done, I can log in [15:20:55] I'm checking if https://xtools.wmcloud.org/ comes back up [15:21:19] You maybe know this: 'virsh destroy' sounds scary but all it ever does is destroy the running state, aka shut down the VM by surprise [15:21:29] so harmless in most cases [15:21:43] I seemed to remember, but I couldn't find a document confirming it, I'll add it somewhere to wikitech [15:22:21] actually we have https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Fixing_an_instance_that_won't_reboot [15:22:22] It's pretty unusual to need to run virsh commands directly, it's been years since I've needed it. But this cephmon thing seems to get us into a state that nova can't understand [15:22:29] hopefully that was the last of it [15:22:49] ok, that guide seems good :) [15:23:23] yep [15:25:32] lol my airbnb hosts ordered a new washing machine for our apartment but the photo on Amazon was absolutely not to scale, the machine only comes up to my knee :) [15:25:38] We all feel like we're being pranked [15:28:51] haha I didn't even know a washing maching of that size existed! [15:29:04] what is this, a washing machine for ants?! [15:29:52] re: T384642 it seems a different issue, I can ssh as root but not as my user (after adding myself to the project) [15:29:54] T384642: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642 [15:29:59] I have just learned that the word 'lilliputian' is the same in English and Spanish [15:30:50] I guess ssh-key-ldap-lookup must be failing [15:31:03] that or sssd [15:31:35] lookup works fine apparently [15:31:47] check resolv.conf? [15:31:54] anything interesting in auth.log? [15:32:24] Petscan down? [15:32:26] [15:32:26] https://petscan.wmcloud.org/?language=tr&project=wikipedia&depth=9&categories=Vikiproje%20Formula%201&ns%5B0%5D=1&ns%5B1%5D=1&ns%5B4%5D=1&ns%5B5%5D=1&ns%5B6%5D=1&ns%5B7%5D=1&ns%5B8%5D=1&ns%5B9%5D=1&ns%5B10%5D=1&ns%5B11%5D=1&ns%5B12%5D=1&ns%5B13%5D=1&ns%5B14%5D=1&ns%5B15%5D=1&ns%5B100%5D=1&ns%5B101%5D=1&ns%5B102%5D=1&ns%5B103%5D=1&ns%5B828%5D=1&ns%5B829%5D=1&ns%5B2300%5D=1&ns%5B23 [15:32:26] [15:32:27] 01%5D=1&ns%5B2302%5D=1&ns%5B2303%5D=1&ns%5B2600%5D=1&interface_language=en [15:33:03] Yetkin: we're looking into it [15:33:28] taavi: nope, just "Failed publickey for fnegri" [15:34:24] the fact that the petscan web app is also unresponsive makes me think there's some network thing [15:35:54] "have you tried turning it off and on again?" [15:36:02] T384642 mentions it was also stuck during reboot, but then it came back [15:36:03] T384642: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642 [15:36:07] maybe I'll try migrating it [15:37:14] the washer: https://www.instagram.com/p/DFNqxDYRqQs <- it's hard to get a photo that shows the scale properly without a 1-year-old to pose next to it [15:37:25] !log petscan openstack server migrate {petscan5_id} T384642 [15:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL [15:37:46] looks like you just logged in [15:37:52] well I didn't have much hope, but the migrate fixed it [15:38:09] maybe you spoke too soon when you said the other one was the last of it? :P [15:38:09] I hope there were only two of those and not 100 [15:38:16] LOL [15:38:21] I'm going to keep saying things like that all day clearly [15:43:59] xtools is still down for me btw, is that known? [15:44:16] (asking because the discussion above sounded like we were hoping for it to have come back) [15:44:34] we were [15:44:39] ok [15:44:47] I was hoping to, but I see it's still down, same for petscan [15:44:56] let me make sure the other xtools hosts are actually up... [15:46:33] all three xtools VMs look healthy to me. Suppose the service doesn't start on boot? [15:47:35] curl localhost on the vm returns 200 OK [15:48:18] although returns a default apache page, not sure if that's expected [15:49:16] dcaro: earlier today... [15:49:21] https://www.irccloud.com/pastebin/Wn3XBSxT/ [15:49:30] not sure if that's the issue you saw or something else [15:49:36] and of course doesn't help with xtools [15:51:17] let me try that for petscan [15:52:46] yep that fixed it for petscan! [15:52:58] and with "that" I mean https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan#Starting_the_service [15:54:27] for xtools, no idea [15:54:49] lol, sees like petscan could really do that for itself [15:56:35] * dhinus looks for ideas in https://wikitech.wikimedia.org/wiki/Tool:XTools [16:09:45] andrewbogott: sup? Toolforge having issues? Need a hand? [16:10:34] dcaro: I don't think toolforge is having issues that I know of, although I am running a 'find' to clear out .err files [16:10:43] dcaro: did you get paged? [16:12:12] no, you pinged me ~20min ago [16:12:57] you're right, I did :( It was a typo [16:13:06] sorry! [16:13:25] all is well as far as I know, except for xtools which is not really our problem to fix [16:14:01] okok awesome :), got a bit scared it was NFS/ceph or something, phew [16:14:02] xd [16:14:33] nope, i drained cloudcephosd1013 yesterday and it seems to be fine [16:16:30] * dcaro vanishes back into a book [16:22:01] re: xtools I see a lot of errors in /var/log/apache2/error.log [16:28:27] oops I think the VM is out of disk space [16:28:49] I'll try deleting some old llogs [16:29:39] ah-ha, that fixed it [16:30:51] nice! [16:36:00] dhinus: I'd like to duck out and finish my day after dark if nothing else is currently on fire [16:36:49] nothing I'm aware of :) [16:36:52] thanks for your help [16:37:13] I posted some details on how we fixed xtools at T384711 [16:37:13] T384711: XTools is down - https://phabricator.wikimedia.org/T384711 [18:55:27] o/ So when trying to run lima-kilo, I end up witha bunch of pull access denied for images while running things in the default setup. Is there something Obvious I am missing? [18:55:27] Seemingly for images such as toolsbeta-harbor.wmcloud.org/toolforge/calico, but I also see HARBOR_IP=192.168.5.15 =o [19:18:56] https://www.youtube.com/playlist?list=PLvYZ7eFy-VsxFuVVxBVVSnNWJOg2WSVi1 [19:27:16] coincidentally, how would I go about getting op in this channel? ^^ [19:29:33] (IIUC https://meta.wikimedia.org/wiki/IRC/Instructions#ChanServ_commands has the commands that someone else™ would need to use, but I don’t know if there’s a usual process to request it) [19:32:03] This channel used ircservserv to manage the ACL. Ask someone with +F or +f, like b.d808, then someone can do a patch [19:37:35] lucaswerkmeister: ^ that, so send a patch to https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/blob/main/channels/wikimedia-cloud.toml [19:46:29] thx :) [21:18:42] xtools is down again [21:31:38] addsore heyo :) that HARBOR_IP is for the build service to put images in when users do builds, the other one (toolsbeta one) is the source for the images of all (well most) the toolforge/k8s system/platform deployments, you might find more help in the -admin channel, probably during EU hours [22:14:21] It might be that the Calico image is not there anymore, but if so, it has been deleted in the last couple days :/ [22:14:31] Should be publicly pullable [23:17:04] dcaro: I think I was having issues with multiple of the images [23:17:31] dcaro: https://phabricator.wikimedia.org/P72389 [23:18:09] anyway, ill be looking again another day :) [23:25:10] Ack, left a comment there [23:25:20] Have a good weekend! [23:52:57] you too! ty