[10:53:13] I just can't find the user docs about Cinder volumes on wikitech [10:53:45] ok found it https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances [11:07:10] legoktm: hello [11:14:34] Hi hauskater [11:16:19] PM incoming [12:00:46] !log paws implemented limits to storage. Cleared about half the storage used T327936 [12:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:00:49] T327936: Limit paws storage - https://phabricator.wikimedia.org/T327936 [13:36:32] !log admin restarting nova services in eqiad1, trying to free up db connections [13:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:41:32] arturo: did you try [[Help:Cinder]]? I added a redirect using that title several months ago because I too was struggling to find the docs under the more use case centered title. [15:42:32] bd808: I didn't try. Also the search bar will just redirect to https://wikitech.wikimedia.org/wiki/Cinder [15:43:24] *nod* that [[Cinder]] page does at least have a see also link to the end user docs. [15:56:05] !log tools reboot tools-k8s-worker-79, bunch of procs in D state because NFS hiccup [15:56:36] (Stashbot is MIA, so won't actually log) [15:56:45] will log later again [15:56:46] thanks [16:11:18] !log tools Hard reboot of tools-sgebastion-11 via Horizon (yes, I know the logging bot is down) [16:11:56] !log tools hard reboot of tools-sgecron-2 [16:12:26] bd808: could you please kick stashbot back into service? [16:12:44] do we have known working k8s nodes to pin it too? [16:13:08] they mostly work, despite any incoming reboot [16:13:16] no need to pin I think [16:14:31] !log tools rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of T316544) [16:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:14:35] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:15:05] !log tools Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z) [16:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:23:41] !log tools rebooting tools-sgeweblight-10-26 (T316544) [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:23:44] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:32:00] !log tools rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud (T316544) [16:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:32:03] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:39:46] !log tools rebooting tools-sgeweblight-10-17 (T316544) [16:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:39:49] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:40:38] !log tools rebooting tools-sgeweblight-10-22 (T316544) [16:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:41:05] !log tools rebooting tools-sgeweblight-10-32 (T316544) [16:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:42:20] !log tools rebooting tools-sgeexec-10-14 (T316544) [16:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:42:55] !log tools rebooting tools-sgeexec-10-16 (T316544) [16:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:43:15] !log tools rebooting tools-sgeexec-10-18 (T316544) [16:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:43:45] !log tools rebooting tools-sgeweblight-10-30 (T316544) [16:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:44:24] !log tools rebooting tools-sgeweblight-10-16 (T316544) [16:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:44:52] !log tools rebooting tools-sgewebgen-10-2 (T316544) [16:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:44:55] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:45:23] !log tools rebooting tools-sgeweblight-10-24 (T316544) [16:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:45:56] !log tools rebooting tools-sgeexec-10-8 (T316544) [16:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:47:38] !log tools rebooting tools-sgeexec-10-19 (T316544) [16:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:48:43] !log tools rebooting tools-sgeexec-10-21 (T316544) [16:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:50:16] !log tools rebooting tools-sgeexec-10-17 (T316544) [16:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:50:19] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [16:51:50] !log tools rebooting tools-sgeweblight-10-28 (T316544) [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:52:14] !log tools rebooting tools-sgeexec-10-22 (T316544) [16:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:52:46] !log tools rebooting tools-sgeweblight-10-21 (T316544) [16:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:53:04] !log tools rebooting tools-sgeweblight-10-20 (T316544) [16:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:53:34] !log tools rebooting tools-sgeweblight-10-25 (T316544) [16:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:53:50] !log tools rebooting tools-sgeweblight-10-18 (T316544) [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:55:19] !log tools rebooting tools-sgeexec-10-20 (T316544) [16:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:55:22] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [17:00:48] !log tools rebooting tools-sgegrid-master (T316544) [17:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:00:52] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [17:01:50] qstat says "No route to host" [17:03:05] Wurgl: expected, grid control node rebooting [17:03:16] okay [17:06:02] !log tools rebooting tools-sgegrid-shadow (T316544) [17:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:06:05] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [17:06:29] Wurgl: can you try again now? [17:06:43] Works again. Thanks [17:17:44] !log tools Rebuilding bullseye and buster docker containers to pick up openssh-client package addition (T258841) [17:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:17:47] T258841: Add ssh-client support to Kubernetes containers - https://phabricator.wikimedia.org/T258841 [17:43:27] dcaro: I received `error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": got send timeout` on Mon, 15 May 2023 17:41:10 +0000 (roughly two minutes ago). Is that expected? [17:45:32] two minutes ago? that should not have happened :/, looking [17:46:41] I'm still receiving new ones, fwiw. [17:51:59] from which node? [17:53:21] error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": got send timeout [17:54:04] dcaro: https://phabricator.wikimedia.org/P48233 is a complete message incl. headers if that's helps [17:54:06] Me too: error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": got send timeout [17:55:20] emails get stacked up in the mail service and keep trickling out for hours after a big grid outage [17:56:35] You are right. Mail was generated at 16:43:09 and sent a few minutes ago [17:58:24] oh yep, stuck email [18:02:13] it might be possible to speed up the mail delivery by telling to "force" deliver.. not sure if it makes it better though https://wikitech.wikimedia.org/wiki/Exim#force_delivery_attempt [18:02:25] just cause I remembered this from prod mx outage [18:16:20] mutante: or maybe dropping the queue? might harm some legitimate mails, but... [18:16:47] my mailbox looks like this now (mails stopped coming fortunately) https://usercontent.irccloud-cdn.com/file/AQvDc2oF/image.png [18:20:19] Seems like wikibugs may need a kick [18:21:41] sorry for the spam yep [18:22:18] !log tools clear mail queue [18:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:24:17] urbanecm: I have a couple hundred of these https://ttm.sh/WM3.png [18:24:43] yeah, me too (plus a lot of them deleted already) [18:25:06] well, not exactly those [18:25:28] but hundreds for sure. disadvantage of tools submitting a job fairly regularly :-/ [18:26:07] I don't even maintain any tools on the grid - this is root@ spam for existing jobs that the grid has lost track of [18:26:54] !log tools.wikibugs restart all containers [18:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [18:27:25] MacFan4000: done, although the k8s cluster seems to be in a state that it might take a while to fix everything [18:28:03] (k8s will generally though recover automatically if you restart all the nodes or something like that, while the grid needs lots of manual action to recover [18:41:27] tools-static is down. [18:41:33] For example, https://tools-static.wmflabs.org/cdnjs/ajax/libs/angular.js/1.8.2/angular.min.js [18:46:43] !log tools Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:46:50] Iluvatar: ^^ [18:48:18] Thanks. [20:41:15] !log tools rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage) [20:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:44:15] Andrew, https://phabricator.wikimedia.org/T336687#8851966 do you need to give something a kick to get that to work again? [20:45:01] Oh, and I guess the (in)famous petscan screen also went down [20:48:01] I'm not sure, I want to wait a bit and see what recovers from my reboots. [20:51:36] My tool is up now. Thanks for the reboots [21:22:46] Found it at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL (no entry for today yet) (re @MaartenDammers: Oh, and I guess the (in)famous petscan screen also went down) [21:25:36] @MaartenDammers: https://petscan.wmflabs.org/?psid=595690 is up for me [21:26:13] I don't think that petscan should have been affected at all by the Toolforge NFS issues. [21:26:52] In the Wikidata channel some people were complaining earlier. Wasn't getting any output around that time, but now it works again [22:50:44] !log tools Rebuilding bullseye and buster docker containers to pick up make package addition (T320343) [22:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:50:47] T320343: Include "make" in all images - https://phabricator.wikimedia.org/T320343