[01:08:39] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 10 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:03] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 25.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:13:27] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:15:57] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [04:47:33] What is up with all the alerts :-/ [05:09:23] I have cleaned up all the alerts - I am planning to spend the day pretty much catching up with email, if there's something urgent for me to look at, please let me know (I am recloning db1128 meanwhile) [06:10:24] welcome back marostegui [06:11:46] hi arnaudb !! [08:11:22] arnaudb Amir1 jynus I am going to repeat this durding today's meeting, but just so you can organize your week, remember that Thursday morning (9 AM UTC) all database maintenance needs to be finished/stopped in preparation for the DC switch [08:11:26] *during [08:15:09] noted! [14:00:19] topranks: Did you see https://phabricator.wikimedia.org/T344259#9152935 by chance? Any thoughts? [14:00:41] it seems like we're reaching the desperation phase of this ticket, and I'm not sure where else to go :) [14:01:00] I'm just joining a meeting [14:01:05] no worries [14:01:18] I think we can be certain that it won't woork at 1G when booted to debian, given it doesn't from live debian environment [14:01:33] 1G uses all 4 pairs in the RJ45, 100Mb only 2 [14:01:49] So usually this scenario means bad cable, but it could be the port on server or SFP on switch [14:01:58] did we try connecting a laptop to the server? did it work at 1G? [14:02:13] I don't think we did, no. [14:02:31] worth a shot - also moving to a new switch port [14:02:37] I'll update the ticket. [14:02:51] We can reimage at 100Mb but we'll get be diagnosing the same issue post-install I think [14:03:15] I didn't realize that you'd tried it at 1G after booting into the live environment [14:04:04] And I've been operating under the assumption that this might be the same issue as https://phabricator.wikimedia.org/T340055 [14:05:06] AFAIK, that host continues to work find at 1G, but it had been unable to DHCP boot, with the same symptoms here (that the link would drop as soon as it tried) [14:05:10] it could be, what's unusual is we've so many of the FS.com 1G optics with same hardware and no issues [14:06:03] we encountered at least 1 more during that upgrade (both were "fixed" by using a Wave2Wave optics SFP-T) [14:06:32] (not an option in eqiad) [14:06:54] is Emperor not on IRC? [14:07:50] nope, only a cardboard cutout [14:07:57] oh [14:08:24] my IRC client bounced twice overnight, I reauthed this morning but hadn't spotted the nick change [14:08:38] np