[09:38:55] snapshots nowadays are taking ~11 hours, from 0 UTC. I may revert them to go back to start earlier so they finish at least by 8 am [09:40:33] maybe 7 UTC [12:01:00] hi, following up on Friday's mail I'd go ahead and move dbproxy2001 and db2132 to Puppet 7 next? [12:03:43] o/ [12:04:35] Hello everyone. I’m wondering if anyone had a time to look into our bet cluster database replication issue? [12:16:47] Do we have anyone familiar with Mysql replication and can help us debug the replication problem on beta cluster ? [12:40:10] pmiazga: m.arostegui is on leave today (having had to do a bunch of work on Saturday), and A.mir1 is likewise OOO [12:40:50] I'm expecting them both back tomorrow, though [12:42:38] Worst case scenario the beta cluster will be broken till tomorrow - so far it looks like some (if not all) writes to DBs do not go trough [13:12:16] moritzm: sorry for the late response: I think you're good to go, I don't recall any blockers. Should we maybe postpone until tomorrow to be able to double confirm with m.arostegui? [13:17:35] sure thing, there's no rush [13:31:02] pmiazga: is there a phab item to refer to? I'll mention it at our team meeting today (but as I say, there may be no relevant expert around 'til tomorrow) [13:32:31] Yes, there are couple [13:32:58] @Emperor: so this is the global one for beta cluster DB: https://phabricator.wikimedia.org/T358329 [13:33:07] We found out that because one of Jenkins jobs started to timeout [13:34:05] And the relationship is _ worked on https://phabricator.wikimedia.org/T358236 - after fixing the `addwiki.php` script we found out that Jenkins job is failing - therefore TheresNoTime filled https://phabricator.wikimedia.org/T358329 [13:34:49] And then people started to complain that other things on beta cluster are also broken - for example https://phabricator.wikimedia.org/T358364, https://phabricator.wikimedia.org/T358367 [13:35:15] After some time we narrowed it down to “replication doesn’t work on beta cluster” [13:35:36] https://phabricator.wikimedia.org/T358329 -> but this should be your main ticket [13:35:52] We also tried restarting replication, didn’t help [13:36:18] pmiazga: I am out today but...we do not maintain BETA cluster. I don't even have access to it. And quickly reading the first ticket, it looks like to me that someone dropped the database on the slave, you can try to see if executing this on the slave will work: set session sql_log_bin=0; create database test2wiki; stop slave; start slave; [13:38:00] pmiazga: If you did it 3 times on the slave (https://phabricator.wikimedia.org/T358329#9575922) you will need that 3 times as soon as replication keep breaking [13:38:06] But I am sorry, I am out today, I shouldn't be here [13:38:28] marostegui: sure, sorry, when I pinged you I didn’t know you’re away [13:38:44] pmiazga: No worries at all, you couldn't know :) [13:38:45] Thanks for the tip, we will try that. Go and enjoy your time off [14:08:48] (PuppetFailure) firing: Puppet has failed on an-redacteddb1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:13:02] (PuppetFailure) firing: Puppet has failed on an-redacteddb1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:43:31] is an-redacteddb1001 one of ours? [19:47:28] it's data engineering [19:47:41] WIP, was only installed earlier today [20:51:28] Emperor: yeah it's not, I wonder why we are getting those here [22:13:48] (PuppetFailure) firing: Puppet has failed on an-redacteddb1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure