[03:58:42] !log tools.masto-collab Updated from e044fa2 to 7fbf277 [03:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.masto-collab/SAL [06:57:06] !log tools.masto-collab Updated from 7fbf277 to 0631e5f [06:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.masto-collab/SAL [11:05:33] https://guc.toolforge.org/?by=date&user=Krinkle [11:05:34] > Error: Database error: Unable to connect to meta.web.db.svc.eqiad.wmflabs [11:05:44] Is this an expected side-effect of the current replica issue? [11:07:47] * Krinkle is reminded of T176886 [11:08:01] T176886: Update meta_p database for new service names. https://phabricator.wikimedia.org/T176886 [11:08:02] T176886: Update meta_p database for new service names - https://phabricator.wikimedia.org/T176886 [11:49:29] isn’t .wmflabs an outdated domain names [11:49:30] !log quarry deployed https://github.com/toolforge/quarry/pull/22 on 3 prod servers [11:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [11:49:50] (*name?) [11:52:10] Krinkle: I think the workaround in https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/guc/+/refs/heads/master/src/App.php#79 might need to be updated again [11:52:39] with .web.db.svc.wikimedia.cloud [11:53:07] and https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/guc/+/refs/heads/master/api.php#60 too [12:00:53] Yeah, domain gets changed every couple of years. Toolserver, toollabs, wmflabs, toolforge.org/, .toolforge.org, what else did we have? 😆 (re @lucaswerkmeister: isn’t .wmflabs an outdated domain name?) [12:01:18] pretty sure toolforge.org/ never existed [12:02:25] Dunno, some tools had 5 different urls by now. I thought that was the last move we had [12:03:10] Might have been wmflabs [12:10:23] tools.wmflabs.org/ was directly followed by .toolforge.org [12:19:43] Could rebuilding the replicas cause tools to error with "MySQL server has gone away" (i.e. all XTools pages)? Are any of the db servers being depooled? [12:20:34] I'm on my phone and can't do much debugging atm [12:45:45] Krinkle: I made T337682 for the guc error [12:45:48] https://replag.toolforge.org/ doesn't show the full info it usually does. Maybe that's the same or related issue of why XTools is down, since it queries the replag db on every request? [12:46:44] lucaswerkmeister - so `.web.db.svc.eqiad.wmflabs` needs to be `.web.db.svc.wikimedia.cloud` now? [12:48:20] Lucas_WMDE: The DNS code for wiki replicas suggests that the wmflabs name is still meant to work at https://gerrit.wikimedia.org/g/operations/puppet/+/e732b82e2c8003b5144f19eeb5ba5843f668f698/modules/openstack/files/util/wikireplica_dns.yaml#58 [12:48:47] Unless it was intentionally removed, should probably work still :) [12:49:09] https://codesearch.wmcloud.org/search/?q=svc.eqiad.wmflabs&files=&excludeFiles=&repos= [12:49:42] Krinkle: FYI with guc.toolforge.org having db issues, musikanimal is mentioning above a similar(?) issue with xtools? [12:50:00] XTools uses the new hostnames I believe [12:50:02] https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Aeshna_isoceles - this responds in 0.043 seconds to tell me that "the requested information took too long to process (timeout 900 seconds)" [12:50:25] https://phabricator.wikimedia.org/T337674 [12:50:32] looks unrelated [12:50:33] but.. [12:50:41] the error is also clearly lying as it cant' be a timeout [12:50:41] It's the "MySQL server has gone away" error, just that usually means the query timed out hence the error message [12:50:52] so could be the same issue in theory. [12:51:05] I'd rather not fix GUC again and instead go for having the hostnames in a working way in meta_p [12:51:12] but maybe that's too difficult? [12:51:49] more tools at https://github.com/search?q=org%3Awikimedia%20svc.eqiad.wmflabs&type=code [12:51:56] * Krinkle adds labs/tools/* to Codesearch [12:54:17] The old hostnames stopped working weeks ago, this I'm certain as all of our tools broke. They may have revived them and now they're gone again too, not sure... Anyway different issue from XTools I would think [12:55:12] well, it seems GUC works until yesterday. But maybe my error is also lying then. [12:55:19] maybe the hostname is fine and its' failing for a different reason [12:58:44] https://xtools.wmcloud.org/ by itself gives the "MySQL has gone away" error, and that shouldn't query anything except the replag db. So I think that's the issue for XTools, maybe. I'm almost at my hotel and will debug more [12:58:53] Anyway we have a lot of upset ppl right now :( [13:00:58] Lucas_WMDE: I've added you to tools.guc maintainers. Want to try deploying it? See: https://gerrit.wikimedia.org/g/labs/tools/guc#deploy-changes (you can skip the composer step) [13:04:35] ok, I’ll try [13:04:45] (fyi Lucas_WMDE isn’t working today, I won’t see those pings until tomorrow) [13:05:33] ooh, spicy shell prompt in tools.guc [13:06:35] if I’m not doing the composer step then I suppose I can do the git pull outside the webservice shell too [13:07:33] !log tools.guc deployed 83837bdcab (update db hostnames from .eqiad.wmflabs to .wikimedia.cloud; also many l10n updates) [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.guc/SAL [13:07:58] ok, same error :( [13:24:23] I get `ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11` when trying to connect to `s7.analytics.db.svc.wikimedia.cloud`, so I'm assuming that's why `meta_p` isn't working [13:25:46] ha, I had just discovered the same thing [13:26:18] so yes, same issue for both XTools and GUC. In the case of XTools, it needs to query `metawiki_p.wiki` [13:26:53] same error for `mysql --defaults-file=$HOME/replica.my.cnf -h metawiki.web.db.svc.wikimedia.cloud` [13:27:20] sounds like we need a task [13:27:49] be my guest, you've done a bit more debugging than me :) [13:28:13] okay, I'm going to hijack https://phabricator.wikimedia.org/T337682 [13:28:18] I assume it's all T337446 fallout, and once that's resolved it'll All Be Fine(tm) [13:28:19] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [13:29:53] @lucaswerkmeister Thanks! [13:30:04] TheresNoTime: yeah, I missed that meta_p is in one of the affected sections [13:44:12] all database things are a dark art to me, I just type things in until it works, *really* breaks or I get told to stop (: [14:35:29] no longer an `ERROR 2013 ...` when connecting to `s7.analytics.db.svc.wikimedia.cloud` (yay!), but a more relatable `ERROR 1044 (42000): Access denied for user ...` — I note s7 is fully recloned (https://phabricator.wikimedia.org/T337446#8886795) [15:13:29] !log tools.lexeme-forms deployed a5e90a0e02 (update dependencies) [15:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [16:05:03] TheresNoTime: I wonder if grants or something are missing then [16:05:41] But I'm like you, despite previously maintaining a replicated DB. They are black magic to me. [16:40:21] great day to work on improving error handling in all of our tools 🙃 [16:48:37] Did you already find the root cause of all the incidents? (re @wmtelegram_bot: great day to work on improving error handling in all of our tools 🙃) [16:49:26] unfortunately I'm not a MariaDB database whisperer :p [17:15:37] The symptoms sounded quite familiar to a long running problem TheDJ told me about. Might be worth comparing notes to confirm or to rule out. (re @wmtelegram_bot: unfortunately I'm not a MariaDB database whisperer :p)