[09:09:24] I have switched over pc1 master [09:17:15] pc1 codfw just failed [09:20:05] yep [09:20:07] expected [09:31:51] root@ panicked :D SMART error (OfflineUncorrectableSector) detected on host: pc1011 [09:32:30] rip [09:32:37] it is still reimagining, we'll see what happens when it comes back [09:36:41] https://usercontent.irccloud-cdn.com/file/uMikvJ30/image.png [09:36:49] sobanski: jobo ^ [09:37:32] Amir1: wasn't me! [09:37:40] Well aware of this one ;) [09:38:03] :D [09:38:38] so the raid is in optimal and there are no errors reported on any disk :) [09:39:05] not even for the disk reported [09:58:29] ok, pc1011 back as pc1 master, heartbeat clean, pc1014 reconfigured to replicate from pc1011 [09:58:34] pc1011 is running bullseye [09:58:39] if you see something strange with pc1, let me know [10:29:53] I am starting to pool db1128 with very low weight on s1, if you see something strange let me know [10:31:29] grants issues, depooling [10:36:05] it was all coming from mwmaint1002, it is fixed now, so repooling again [10:49:23] I need to reboot dbmonitor1002, the remaining users of tendril-legacy.wikimedia.org are probably negligible, so I'd just do it right away? [10:49:48] (to apply the new KVM machine type needed for the Ganeti/eqiad Buster update) [10:49:51] moritzm: go for it [10:49:59] ack, doing that now [10:57:38] done [10:58:02] dbcorch1001 also needs a restart, does that also need to synced in some way? [10:58:11] dborch1001 [10:59:40] moritzm: no it should be fine to just restart it [11:01:58] ok! I'll do that now, then if fine with everyone? [11:02:05] fine by me [11:02:12] 👍 [11:02:20] ack, doing that now [11:10:26] it's back up [11:11:21] orchestrator is up too [11:43:35] marostegui: found this https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753440 :D [11:44:30] oh nice, less queries! [13:59:35] db1128 is Bullseye and it is now serving on s1 with normal weight, if you notice something strange: dbctl instance db1128 depool ; dbctl config commit -m "Depooling db1128" [20:28:04] PROBLEM - MariaDB sustained replica lag on s4 on db2119 is CRITICAL: 15.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2119&var-port=9104 [20:29:04] PROBLEM - MariaDB sustained replica lag on s4 on db2137 is CRITICAL: 55.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13314 [20:30:46] PROBLEM - MariaDB sustained replica lag on s4 on db1142 is CRITICAL: 46.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [20:30:46] PROBLEM - MariaDB sustained replica lag on s4 on db1148 is CRITICAL: 49.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1148&var-port=9104 [20:30:48] PROBLEM - MariaDB sustained replica lag on s8 on db1172 is CRITICAL: 64.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1172&var-port=9104 [20:31:48] PROBLEM - MariaDB sustained replica lag on s8 on db1111 is CRITICAL: 72.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104 [20:33:26] RECOVERY - MariaDB sustained replica lag on s4 on db1142 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [20:33:28] RECOVERY - MariaDB sustained replica lag on s8 on db1172 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1172&var-port=9104 [20:33:32] PROBLEM - MariaDB sustained replica lag on s8 on db1104 is CRITICAL: 41.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1104&var-port=9104 [20:33:36] PROBLEM - MariaDB sustained replica lag on s4 on db2106 is CRITICAL: 40.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2106&var-port=9104 [20:34:14] PROBLEM - MariaDB sustained replica lag on s8 on db2081 is CRITICAL: 19.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2081&var-port=9104 [20:34:30] RECOVERY - MariaDB sustained replica lag on s8 on db1111 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104 [20:34:54] RECOVERY - MariaDB sustained replica lag on s8 on db1104 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1104&var-port=9104 [20:35:42] RECOVERY - MariaDB sustained replica lag on s8 on db2081 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2081&var-port=9104 [20:36:16] PROBLEM - MariaDB sustained replica lag on s4 on db2095 is CRITICAL: 440.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13314 [20:37:54] RECOVERY - MariaDB sustained replica lag on s4 on db1148 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1148&var-port=9104 [20:38:02] RECOVERY - MariaDB sustained replica lag on s4 on db2106 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2106&var-port=9104 [20:39:02] RECOVERY - MariaDB sustained replica lag on s4 on db2119 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2119&var-port=9104 [20:40:28] RECOVERY - MariaDB sustained replica lag on s4 on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13314 [20:41:06] RECOVERY - MariaDB sustained replica lag on s4 on db2095 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13314 [21:09:32] what happened?? [21:10:42] it seems it aligned with deployment, there are talks about it on -operations [21:10:53] T299095 [21:10:54] train got rolled back, apparently [21:10:54] T299095: Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases. - https://phabricator.wikimedia.org/T299095 [21:11:40] is it still ongoing? [21:13:27] no [21:14:18] maybe they could add to scap amir's dashboard to abort early if long running queries start or something likt that? [21:18:17] I have commented on the task with some more timing [21:19:51] marostegui: shouldn't you go rest? I am around anyway, and probably there are no immediate actions to take :-) [21:29:39] the only s4 host pending to recover lag was db2139, but it was because backups, and it is finishing now [21:35:26] There's a risky patch in the train that likely caused it [21:36:31] https://phabricator.wikimedia.org/T293958#7612230