[07:46:52] @lucas I don't think they have a max runtime [08:23:31] then I’ll have to debug some more, thanks [13:54:58] !log admin recreating the codfw1dev galera cluster according to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Galera -- mariadb is stopped (and won't start) on all three cloudcontrol nodes [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:58:04] andrewbogott: what is happening with mariadb? [13:58:15] (from that message it seems it's down?) [13:58:50] dcaro: I don't know why it crashed but I'm recreating the cluster. Not sure yet if that's going to go well. [13:58:58] (and eating breakfast, back in a few) [14:05:08] back [14:05:42] cloudcontrol2004-dev seems to be hanging trying to cluster [14:05:53] let's see if something happened with port access... [14:15:58] dcaro: I'm trying to get 2004-dev to connect but it seems to be hanging. If you want to look at 2001-dev I'm not doing anything there. 2005-dev is the primary and seems to be happy (apart from not talking to other nodes) [14:19:28] andrewbogott: okok, I'll take a look [14:22:46] Jul 31 13:53:24 cloudcontrol2004-dev mariadbd[3926610]: 2023-07-31 13:53:24 0 [ERROR] WSREP: ./gcs/src/gcs_group.cpp:group_post_state_exchange():434: Reversing history: 354702166 -> 354702158, this member has applied 8 more events than the primary component.Data loss is possible. Must abort. [14:23:09] it seems that there's some partitioning hapenning [14:24:32] yeah [14:24:48] let me make sure there's some actual data on 2005-dev... [14:25:19] there is. [14:25:24] let me know if you want me to do/check anything [14:25:30] So maybe we just reset the other two nodes? [14:25:34] I'm not sure yet. [14:25:57] we can try starting it from 2004 first (maybe it has the latest version of the DB) [14:26:40] if the split did happen already, then we can only reset xd, (and lose a bit of data) [14:27:22] yeah, ok, I'll stop 2005 and see if we can restart the cluster from 2004 [14:38:20] dcaro: resetting the cluster from 2004-dev seems to have worked. [14:38:30] Now we see if it crashes again [14:41:52] 🤞 [16:10:15] !log tools.stashbot Updated to 1b5686e (T342666) [16:11:47] !log tools.stashbot Updated to 1b5686e (T342666) [16:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [16:11:51] T342666: tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 [16:12:41] dhinus: ^ In theory stashbot is ready for the new message sender. Thanks for the patch! [16:15:44] bd808: thanks! will test it in a sec [16:16:34] yay it worked! https://sal.toolforge.org/admin [16:16:46] can we have the new bot voiced in -feed too? [16:16:57] and is there a reason why some messages are done with that and some via wm-bot? [16:16:58] we need to register the new nickname [16:17:16] wm-bot uses a custom logger in a wmcs-specific class [16:17:40] in theory I would like to migrate those log messages to the standard spicerack logger, that now supports logging to cloud [16:18:05] more details at T325756 [16:18:05] T325756: Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 [19:03:52] it was just a good old out-of-memory error after all (re @lucaswerkmeister: I’m seeing some jobs getting killed that I can’t really explain otherwise) [19:03:56] https://phabricator.wikimedia.org/T342519#9056738 if anyone’s particularly curious