[02:37:36] new idea while we suffer through replag: T321640 [02:37:58] 18:38:22 <-- stashbot (~stashbot@wikimedia/bot/stashbot) has quit (Ping timeout: 260 seconds) [02:38:42] > but only older tools have them these days (yay?). [02:39:49] I've added one to one of mine, probably should add it to signatures as well [02:39:55] !log tools.stashbot restarted, didn't rejoin after a ping timeout [02:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [02:40:21] T321640 [02:40:21] T321640: Create embeddable version of replag tool for other tools - https://phabricator.wikimedia.org/T321640 [02:40:24] there we go [02:40:34] AntiComposite: yeah, I was hoping to take the lazy way out [02:41:41] I was refreshing https://linkcount.toolforge.org/ earlier, thinking the job queue has broken before realizing it was just replag [02:43:15] huh, signatures is the tool I put the replag banner in. wonder why it's not working [02:43:16] # FIXME: re-implement replag detector (#58) [02:43:18] oh [02:44:21] // TODO: is replag something we still need to care about? meh [02:44:21] "{}; data as of ~~~~~.\n", [07:41:20] !log admin-monitoring deleted leaked nova-fullstack VMs [07:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [08:45:10] !log tools depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:09:37] !log admin running wmcs-novastats-dnsleaks in delete mode [09:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:34:37] !log admin running wmcs-puppetcertleaks in delete mode [09:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:34:43] \o I have built a new VM yesterday and have a PR with a new role for said machine, but I can't get pcc to work. The error I am getting is:" [10:34:49] [ 2022-10-26T10:34:07 ] CRITICAL: Unexpected error running run_host on wikilabels-database-02.wikilabels.eqiad1.wikimedia.cloud: Unable to find fact file for: wikilabels-database-02.wikilabels.eqiad1.wikimedia.cloud under directory /var/lib/catalog-differ/puppet [10:35:01] I'm a bit stumped by how to fix that. [10:45:41] klausman: you may need to sync facts to the PCC builders, perhaps [10:46:03] I've tried https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud but the first command fails with no acces (conn closed) and the second doesn't seem to help [10:46:06] although I believe that was done automagically in the latest iteration? [10:46:42] taavi mentioned that some stuff is synced nightly, but the VM was built yesterday (CEST), so I figured that would have happened [10:47:29] klausman: I'm running the first command. I do have access [10:47:38] thanks! [11:05:00] arturo: did the run complete? I've tested just now and I'm still getting the same error. [11:05:35] klausman: I got a timeout, let me try again, paying more attention this time [11:16:29] !log admin rsyncing /srv/labsdb from clouddb1002 to tools-db-2, launched from a screen session in tools-db-2. this will take a few hours. T301949 [11:56:41] arturo: any news? [11:58:22] klausman: yeah [11:58:26] it failed :-( [11:58:40] https://www.irccloud.com/pastebin/j1t138Z8/ [11:59:06] per that URL there seems to be some missing info in hiera https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud [12:00:07] I wonder if the nightlies have also failed for similar reasons [12:00:45] I wont have a lot more time to investigate on this today, I'm sorry [12:01:25] Sure. Can you point me to someone I can pester with this? [12:01:47] Otherwise I'll ask in #w-sre [12:03:39] klausman: I'd talk to jbond [12:03:53] merci! [12:16:00] klausman: have you followed https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud as that should fix the issue [12:16:30] !log testlabs created VMs arturo-gre-test-1/2 [12:16:31] The first command I can't run, I just get conn closed [12:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [12:16:50] the second command exits with 0, but didn't seem to make a difference [12:17:20] As far as I know, this project (wikilabels) does not match the "own puppet master" scenario [12:17:58] arturo ran the first command but got an error (https://www.irccloud.com/pastebin/j1t138Z8/) [12:20:56] klausman: ack looking [12:21:54] klausman: what change are you trying to test? [12:22:07] 849095 [12:22:26] i.e. `pcc 849095 wikilabels-database-02.wikilabels.eqiad1.wikimedia.cloud` (plus user/auth token) [12:23:13] The change itself might be broken, but I'd expect a different error message for that [12:23:34] Especially since atm, the VM doe snot have any roles assigned. [12:23:47] (I hope that's not the root cause here) [12:26:01] klausman: ack ill take a look, it seems amore genral issue with cloud pcc [12:26:10] thank you! [12:27:06] np [14:30:29] jbond: on a whim, I just ran PCC again, and it worked! [14:31:34] klausman: yes, its slowly processing all the cloud nodes but yes your one has allready been done now [14:42:48] and now all cloud hosts have completed [14:55:30] thanks again! [16:34:37] hi! can i find more information anywhere about the current replag which seems to be now >26h (https://replag.toolforge.org/)? [16:35:28] lustiger_seth: the channel topic [16:36:05] hihi, oops, thanks! :-) [17:20:14] Amir1: is there a reasonable way to add T321562 to so that folks reaching that page from replag.toolforge.org can see this not-quite-db-maintenance issue? [17:20:14] T321562: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 [17:21:44] bd808: at first I thought not but now thinking I can do something. On phone now. Will do it in half an hour [17:22:03] Amir1: <3 [18:31:55] someone has an idea why bots are restarts so frequently? I tried to find some reason using wmopbot logs but I just found the problem is related to the connection, the proccess continue to work but it not receive the signal the connection was closed [18:33:56] !log tools.bridgebot Double IRC messages to other bridges [18:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [18:34:57] (speak of the devil) [19:05:47] !log tools.lexeme-forms deployed 55f9b203e5 (l10n updates: sl) [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [20:31:28] !log tools.lexeme-forms deployed 2feba604c7 (update dependencies, use PEP 655 NotRequired) [20:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [20:34:18] @lucaswerkmeister doubling [20:34:36] !log tools.bridgebot Double IRC messages to other bridges [20:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [20:37:40] btw, I ended up not making a script or alias for this command, because I can just get it from Ctrl+R "Double" ;) (re @lucaswerkmeister: ssh toolforge become bridgebot bb.sh restart "'Triple IRC messages to other bridges'") [20:38:22] (so far my shell history has restart commands for “double”, “triple” and “quadruple” messages to other bridges, let’s see if we’ll ever see more than that…) [20:38:52] 1.5x [21:37:58] danilo: I think it is networking related, but I don't know if it is inside WMCS or between WMCS and libra.chat. I think I saw that taavi made a phab task to track the issue, but I can't find it right now. Something has definitely been different since early October. [21:46:05] bd808: it is https://phabricator.wikimedia.org/T320975 (without description yet), I also noted that wm-bot does not have those issues, so I suspect that is related to something wm-bot does not use [21:50:02] danilo: *nod* I think that taavi is agreeing by naming it a Toolforge problem too. wm-bot lives on a Cloud VPS instance in its own project. That means it shares the Cloud VPS network with things living in the Toolforge project, but Toolforge's Kubernetes cluster adds yet another software defined network layer for packets to transit. [21:51:05] danilo: if you have concrete timing related data that you can add to the ticket, especially anything that might help narrow in on when this started, that could be helpful. [21:55:00] ok, I will search in the logs when exaclty that started and add to the task