[09:30:58] mmhh https://sal.toolforge.org/ is down (?) [09:53:46] !log taavi@tools-bastion-12 tools.sal toolforge webservice restart [09:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [09:55:00] !log lucaswerkmeister@tools-sgebastion-10 tools.sal webservice restart # "server stopped by UID = 0 PID = 0" at 9:53:26 according to error.log, with plenty of /tmp/lighttpd-php.sock connection errors in the log before then [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [10:24:07] !log tools point toolserver.org DNS to tools-legacy-redirector-2 T311909 [10:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:24:11] T311909: Upgrade Toolforge legacy URL redirectors to Debian Bullseye or later - https://phabricator.wikimedia.org/T311909 [13:20:41] !log taavi@tools-bastion-13 tools.wikibugs toolforge jobs restart irc [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [13:20:47] bd808: it did it again :( [13:21:27] Grrrrrr [14:16:41] taavi@metricsinfra-puppetserver-1:~$ sudo chown -R gitpuppet:gitpuppet /srv/git/operations/puppet/ [14:17:56] hmm it'd help if I was reading the latset logs [14:20:00] !log h2o@tools-sgebastion-10 tools.stewardbots ./stewardbots/StewardBot/manage.sh restart # RC reader not reading RC [14:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [14:22:22] !log anticomposite@tools-sgebastion-10 tools.stewardbots SULWatcher/manage.sh restart # SULWatchers disconnected [14:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [15:06:05] taavi: The testing deployment at wikibugs-testing has again survived overnight without needing a restart while running the same code as the main deployment. The irc job there now has 4d15h of uptime. I guess this reinforces the "chaotic network disruption" theory I have had in my head? [15:07:34] It seems like if the root problem was noisy neighbors on the Redis service itself it would have an equal chance of affecting both deployments. [15:14:44] According to https://meta.wikimedia.org/wiki/Data_dumps, the XML dumps include the Short URL (w.wiki) mappings, but I don't see anything in /mnt/nfs/dumps-clouddumps1001.wikimedia.org that looks like the right files. [15:14:53] What directory name should I be looking for? [15:20:34] roy649: try /public/dumps/public/other/shorturls/ [15:21:20] Ah, thanks. [15:21:21] other! [15:21:36] I tried every combination of url and short I could think of. [15:21:41] Didn't think of "other" :-) [15:22:13] https://dumps.wikimedia.org/ helps exploring the data structure [16:21:54] https://qrcode-generator.toolforge.org/ [16:21:54] internal server error. is Jayprakash12345 here? [16:22:36] I have to run out now but if it needs an extra maintainer I'm happy to help investigate this later. [16:25:11] https://meta.wikimedia.org/w/index.php?title=Indic-TechCom/Management&action=history I see no human edits in 3 years. other tabs that I checked even longer. (re @jeremy_b: https://qrcode-generator.toolforge.org/ [16:25:11] internal server error. is Jayprakash12345 here?) [16:26:35] or @mahir256, Srishakatux (re @jeremy_b: https://qrcode-generator.toolforge.org/ [16:26:36] internal server error. is Jayprakash12345 here?) [16:35:29] having a look now, but no eta yet on when it will return (re @jeremy_b: or @mahir256, Srishakatux) [17:16:14] ok, lmk if you want help. (re @mahir256: having a look now, but no eta yet on when it will return) [20:50:00] Hello! I have a new VM that closes connection on SSH. `Connection closed by UNKNOWN port 65535`. Console logs show failures starting sssd unlike other hosts I have created today. Any advise on what to try next? [20:50:33] s/advise/advice/ [20:50:51] which VM? [20:51:04] logging-logstash-03.logging.eqiad1.wikimedia.cloud [20:51:51] ok, at least it lets me log in as root [20:52:12] the first puppet run failed: https://phabricator.wikimedia.org/P58930 [20:53:07] cwhite: this is replacing an existing instance with the same name, right? [20:54:30] taavi: I say "yes" because I've recreated it a few times because I hit T280243 [20:54:30] T280243: openstack: Failed to create DNS entry for new instance, designate error 'Managed records may not be updated' - https://phabricator.wikimedia.org/T280243 [20:56:29] Thanks for the quick reply on that one, btw. Sorry to hear the purge triggered designate issues. :( [20:59:09] this specific issue you're seeing with VM replacements is most likely caused by our Puppet 7 upgrade.. I manually fixed logging-logstash-03 which is now running Puppet to change the config needed to let you in, looking to fix the actual issue now.. [20:59:36] cwhite: try now? [21:01:36] That did the trick. Thank you! I probably saw the same puppet problem when I tried creating it as logging-logstash-01. That one didn't show designate problems, but tried multiple times to get a certificate from puppet. [21:02:12] is that still broken? [21:02:42] I've since deleted that host (-01). [21:07:16] ok, seems like we forgot to port the access rules required to delete certs from deleted VMs from puppet 5 to puppet 7. I'll fix that tomorrow (or poke a.ndrewbogott to do that) [21:08:09] Cheers and thanks again :) [21:33:17] * bd808 waves to cwhite and mumbles "we should do lunch one of these days" [21:53:44] * cwhite waves back at bd808, now hungry for some grinders