[07:36:44] PROBLEM - MariaDB sustained replica lag on s7 on db1170 is CRITICAL: 20 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=9104 [07:36:52] PROBLEM - MariaDB sustained replica lag on s7 on db1227 is CRITICAL: 9.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [07:38:44] RECOVERY - MariaDB sustained replica lag on s7 on db1170 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=9104 [07:38:52] RECOVERY - MariaDB sustained replica lag on s7 on db1227 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [13:10:11] I think I killed two (and counting) replicas of s7 in codfw with back to back schema changes [13:18:32] Amir is become death, destroyer of databases [13:44:07] arnaudb: is db2100 your provisioning? [13:44:26] seems old 🤔 [13:44:30] let me check [13:45:35] nope it's been here for 3 years (a dbstoredbstore_multiinstance) [13:47:21] ah okay [13:47:26] thanks [13:47:54] the second one that broke was because pagelinks in metawiki was partitioned [13:48:06] db2168 [13:48:20] hm [13:48:27] rings a bell, sec [13:49:13] I've seen that for enwikinews (s3) for half of replicas too [13:49:22] I must have failed to run a schema change on it and repooled it too early (21-02-24) [13:49:28] oof [13:49:58] the partition is an old thing, way before your time, don't worry. [13:51:10] should we run a schema update to smooth it over? [13:51:30] or maybe pick a half and clone it over the over? [13:51:33] the other** [13:52:00] nah, I just remove the partition and call it a day [15:19:16] Anyone fancy a +1 on https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1014532 please? I've tested the new container builds OK [15:27:43] {{done}} [15:30:54] thanks :) [17:12:17] jynus: hi, db2100 is down and it's a backup source, wanna take a look? [17:14:29] db2100? [17:14:37] it must be a not yet provisioned host [17:15:11] I haven't provisioned those yet, it will be done next Q [17:15:28] arnau*db above says it's a three year old db [17:15:29] > nope it's been here for 3 years (a dbstoredbstore_multiinstance) [17:15:42] ah, then it is the older ones [17:16:31] is there a ticket or should I create one? [17:17:15] when backup sources crash they have to be reprovisioned from backups [17:18:22] I haven't created anything yet, sorry, a chaotic day [17:18:22] https://phabricator.wikimedia.org/T361037 [17:32:06] it was a memory issue, I think [17:32:25] there are 3 banks maped out! [17:36:40] Amir1, arnaudb the above will take long, as it is currently showing 3 memory sticks as bad, so I will disable alerts on puppet [17:36:54] then send it to dcops for a possible fix [17:47:45] https://phabricator.wikimedia.org/T361037#9662658 [17:49:13] on the upside, I think it shouldn't break any upcoming codfw backups afaics [21:03:50] https://usercontent.irccloud-cdn.com/file/0cDgJpON/image.png [21:04:01] That's me pushing all sorts of schema changes last minute mwhahahahaha