[01:11:20] andrewbogott: If you are still about, could you arm keyholder on tools-cumin-1.tools.eqiad1.wikimedia.cloud pretty please. I'm hoping to test a theory about T385847, but cumin go boom without the keys in keyholder. [01:11:21] T385847: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847 [01:12:10] bd808: yep, 5 minutes [01:20:29] bah, no pointer to the pwd on https://wikitech.wikimedia.org/wiki/Keyholder [01:22:25] bd808: try now? [01:23:01] andrewbogott: you got it. thank you [01:23:17] was the password on the puppetserver for me to find in the first palce? [01:23:22] lmk if you need me to cycle k8s workers [01:23:42] yes, labeled 'cumin ssh key password' which I tried on a whim and it worked [01:24:09] nice [01:26:31] andrewbogott: if you have time to babysit restarting tools-k8s-worker-nfs-7.tools.eqiad1.wikimedia.cloud that might fix T385847 [01:26:32] T385847: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847 [01:27:35] bd808: ok, it's draining now [01:28:11] thx [01:33:44] it's back up [01:38:04] andrewbogott: that fixed the NSS lookup problem. :) Now to see if things are happier in lucaswerkmeister's test tool [01:38:11] cool [01:43:35] I think that restart fixed things, but I'll let Lucas confirm. [01:43:48] * bd808 wanders off towards the smell of dinner [12:39:32] dhinus: setting aside the issue of why it's alerting in the first place... is there a reason why I always get dnsleak alert emails but never get a recovery email? do we not send recovery emails at all? [12:52:53] andrewbogott: hmm good question, I see the RESOLVED messages in -cloud-feed but not in emails [12:56:03] I don't see any RESOLVED emails at all sent from "sre-observability@wikimedia.org", while I can find many from root@wmflabs.org via lists.wikimedia.org [12:56:35] (ignore the "via lists.wikimedia.org", wrong copy/paste) [12:56:50] so it looks like the prod alertmanager is not sending the RESOLVED, while the cloud alertmanager is [12:57:10] it seems like alert and resolve would follow the same config [12:58:11] maybe they disabled "resolve" to reduce alert spam? [12:58:15] let me see [12:59:11] I'm trying to think of whether it is useful or not useful to get resolved emails... I think it is! Especially if I'm afk and rushing home to deal with a problem, seems good to know that it's no longer a problem [13:00:12] send_resolved: false [13:00:16] that explains it :) [13:00:31] huh [13:00:32] it can be changed just for our alerts [13:00:39] +1 for doing that [13:01:50] let's try it at least, and then maybe we'll learn why it was like that :) [13:04:25] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1118110 [13:05:48] * dhinus lunch, back later [14:08:35] Raymond_Ndibe: I'm available to review patches, send them my way if you need extra reviews [17:52:53] arturo: thoughts about that tools-db alert? In the 5 minutes before you go to the airport? [17:54:22] dhinus: if still around, something seems to be happening with toolsdb [17:54:37] ...or maybe just with the monitoring, hm [17:57:35] yeah, they're both read-only right now [18:01:12] * andrewbogott remembers that both of them are on airplanes [18:04:45] andrewbogott: I will be on a 2h long train ride in 2h. I can take a look later, but now I'm in the move [18:05:14] arturo: I think it's fixed for now, I'm just confused as to why the db would spontaneously switch to RO [18:05:39] * andrewbogott logs T385900 to keep an eye on things [20:11:34] ok, I'm now in the train with the laptop in front of me [20:12:07] I would like stashbot to also react to ticket numbers in /me action messages [20:12:08] T385900 [20:12:09] T385900: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900 [22:02:41] sorry didn't see the ping, I can have a look now [22:10:27] mariadb crashed and was automatically restarted by systemctl [22:10:43] when it restarts, it's read-only by default [22:11:14] it's a rare situation but it can happen. "SET GLOBAL read_only=OFF;" was the right thing to do, I believe [22:11:28] I'll paste the journalctl logs to the phab task [22:12:27] thanks for debugging [22:12:36] is there any info about why it crashed? [22:16:08] "Feb 07 17:33:16 tools-db-4 mysqld[874]: 2025-02-07 17:33:16 2880737 [Warning] InnoDB: A long wait (487 seconds) was observed for dict_sys.latch" [22:16:12] not clear [22:16:28] "mysqld[874]: Sorry, we probably made a mistake, and this is a bug." [22:17:01] likely an obscure mariadb bug triggered by some operation [22:19:03] the replica server is now back in sync, it took a while to catch up [22:19:21] I suspect there was some big write activity on the primary, but I will check more carefully on Monday [22:19:55] ping me if it crashes again [22:22:31] thanks andrewbogott and arturo for attending the alerts [22:23:49] * dhinus is not sure that is a correct sentence in english, but you get the idea :D [22:26:32] thanks you for showing up :-) I did not do anything just stare at the laptop, It is true that I'm too tired at this time of the day :-S [23:19:08] dhinus: it crashed again :( [23:19:35] log is in the task, it looks like it's complaining about memory allocation, no idea if that's the real issue though [23:32:44] :-( [23:32:54] arturo, are you not on a jet? [23:33:08] not jet* :-P [23:33:36] I got out of the train, but have to wait in an intermediate hotel until the morning [23:33:59] ah, so in Madrid [23:34:39] yeah [23:34:53] The tracking task is T385900, and my approach is to just keep flipping it back to r/w until dhinus is available to consult, at which point I think we should try upgrading. It looks to me like this is a mariadb bug and not something we can configure out way out of. [23:34:54] T385900: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900 [23:35:12] did you check Mem graph to see if the claim of not being able to allocate memory is true? [23:36:02] no! I'm not sure I know how to do that on a VM. [23:36:39] hmmm that's annoying... it's past midnight here so I'm not in the best position to help [23:36:49] dont we have autogenerated grafana panels or something? [23:36:50] I am guessing that in the 2 minutes that ellapse between these log lines it is thrashing something fierce and running itself out of memory [23:36:54] https://www.irccloud.com/pastebin/BuSpccE7/ [23:36:57] but that's just a guess [23:37:33] arturo: we probably do, let me look [23:38:09] dhinus: sorry for the late night ping! If you have time to think about this in the morning, ping me when you're ready and I'll keep myself available. [23:38:14] you could try looking at the active connections, if there's something odd there, maybe some big queries [23:38:50] well well https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb?orgId=1 [23:39:23] "show processlist" after it's back to RW might give a lead [23:39:32] but it's a long shot [23:40:35] you can try upgrading from 10.6.19 to 10.6.20 but I'm not too confident it would help [23:41:26] T385885 [23:41:27] T385885: [toolsdb] Remove apt pinning and upgrade to latest version - https://phabricator.wikimedia.org/T385885 [23:41:31] ...somehow that dashboard is not in utc? [23:44:20] dhinus: I don't think I'm going to do anything this late in the day (it's not midnight here but too late to start an upgrade.) [23:44:37] * arturo falling asleep [23:44:39] yep makes sense [23:44:53] I'll have a look tomorrow in the morning [23:44:56] the innodb IO graph seems to presage the death, so I'll keep an eye on that before I go to sleep. [23:44:58] thank you! [23:45:15] thanks! [23:45:48] arturo: you should go to sleep, there's nothing to be done immediately. [23:46:16] In the near term, the fix is just to flip it back to r/w if you get paged. which you shouldn't for quite a while 'cause I'll get them for another few hours. [23:47:30] you too, dhinus