[09:35:25] (SystemdUnitFailed) firing: prometheus-mysqld-exporter.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:25] (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:35] I also fixed db2186 ^ [09:40:41] even if it didn't alert yet [09:44:17] <3 [09:54:38] marostegui: arnaudb are you doing stuff on s3? I need to reclone db2177 from another replica [09:54:45] (corrupted) [09:54:51] Amir1: I am recloning a host in s3 [09:54:58] but it must be almost done, let me check [09:54:59] Amir1: everything ok on my end [09:55:11] marostegui: ping me once done :P [09:55:16] schema update is running on s2 (fyi) [09:55:24] arnaudb: https://phabricator.wikimedia.org/T357189#9582671 can you update the progress on that schema change? it is useful to coordinate things [09:55:34] sure! [09:55:37] thanks [09:55:45] Amir1: I am done, let me bring back db2156 [09:55:59] Amir1: you can use that one, as it is depooled already [09:56:09] sure thing [09:56:28] Amir1: it runs 10.4 so if you are recloning to a 10.6 you will need to run mysql_upgrade [09:56:40] done marostegui [09:56:43] thanks [09:56:59] from what I'm seeing both are 10.4 [09:57:07] cool [09:57:36] if you run the current clone cookbook it runs mysql_upgrade [09:57:48] by default? [09:57:49] nice [09:58:02] it was even your idea afair :D [09:58:07] haha [09:58:18] I don't even remember what I did last week, too many things at the same time XD [09:59:14] I had no idea about it either [09:59:36] jokes aside, this is the second time s3 is getting db corruption [09:59:54] some hosts seemed laggy when I ran the schema update, was it correlated ? [10:00:07] 2105 and 2108 from the top of my head but maybe more [10:00:27] I think it was the mistake I did in the reclone where it immediately asked for the position after stop slave [10:00:40] now it has a one second sleep [10:00:40] ah the sleep thingy [10:00:44] yeah [10:01:24] it's showing up in pagelinks which was getting a lot of writes due to the maint script [10:08:23] Amir1: once you are done with db2156 merge this please: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007277 [10:08:31] (and once it is all green in icinga ofc) [10:08:43] sure [10:08:50] ta [10:37:07] I am an idiot and I just realised I didn't record the coordinates of db2156 master binlog, so I need to reclone again. Amir1 let me know when you are done with db2156 [10:37:16] And please do not merge the puppet change above [10:46:09] Sure thing. [12:32:17] marostegui: I'm done :P [12:32:25] \o/ [12:32:27] thanks [12:32:28] is it up? [12:32:58] it looks up to me [14:36:06] 2024-02-28 14:33:57.348432 db-mysql db2190 -N -e "use azwiki; SET SESSION sql_log_bin=0; ALTER TABLE pagelinks DROP PRIMARY KEY, ADD PRIMARY KEY (pl_from, pl_target_id);" [14:36:06] ERROR 1062 (23000) at line 1: Duplicate entry '677278-542604' for key 'PRIMARY' [14:36:16] db2190 needs reclone too :((( [14:36:31] marostegui: let me know once you're done with db2156 [14:39:17] hahaha [14:39:19] will do [14:39:34] Amir1: have you actually checked if that entry is real? [14:40:09] Amir1: I am now done with db2156, I am sanitizing the sanitarium host but that will take a bunch of hours anyway, so you can take db2156 [14:40:20] not actually but this doesn't happen in other replicas [15:41:05] topranks hosts for T355871 are ready! downtimed for 40mins [15:41:10] (and depooled) [15:41:16] T355871: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 [15:41:20] arnaudb: nice thanks! [16:13:32] well done topranks ! repooling nodes :) [16:13:38] arnaudb: all done on our side you can repool those hosts when you're ready