[00:48:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:57] sigh [07:38:40] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:47] it lost a race with the creation of a redirect. [08:16:10] Amir1 arnaudb you'll need to follow this up with DCOps https://phabricator.wikimedia.org/T373417 [08:16:23] ack [08:17:53] those are new hosts that used yesterday to test Amir1's changes to db-switchover, I wanted to reimage to leave them back clean [08:18:07] But they are all failing and booting up from disk [08:31:50] arnaudb: what is the plan with db2176? [08:33:45] it was behaving badly so I figured lets reimage + upgrade it to discriminate everything but SQL data, then observe its behavior upon next switchover [08:34:27] ok, but if it will take long, don't let it depooled too long, or it will be forgotten [08:34:45] ack [08:49:48] PROBLEM - MariaDB sustained replica lag on s5 on db1161 is CRITICAL: 55.75 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [08:51:48] RECOVERY - MariaDB sustained replica lag on s5 on db1161 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [08:56:05] I've depooled db2124 who suddenly had 15h+ lag [08:57:12] and db2114? [08:57:51] does not appear in dbctl [08:59:20] I will decommission db2114 [09:00:42] so next steps for db2124? [09:02:25] https://phabricator.wikimedia.org/P67869 [09:03:18] nothing on processlist marostegui i'd say restart repl [09:04:18] That won't fix it :) [09:04:41] What's the main issue you are seeing? [09:05:45] Error 'Can't DROP COLUMN `cuc_actiontext`; check that it exists' on query. Default database: 'frwiki'. Query: 'ALTER TABLE cu_changes DROP cuc_actiontext' → i'd say then clone the host from a proper one [09:06:43] That would work yes [09:06:55] But, where is all that coming from? [09:07:27] https://phabricator.wikimedia.org/T370903 [09:08:11] right [09:08:18] so was that alter executed on the master directly? [09:10:02] I don't think so [09:10:31] If replication broke, the only explanation is that it did [09:10:53] So I think you need to talk to Amir1 about it [09:11:15] In any case, recloning would have fixed it, but it would have taken long, I fixed it for now by recreating the column and let the alter table go thru replication [09:11:45] ah indeed, more elegant quickfix [09:11:53] That alter table contained more drops, and they all come from replication (so it did come through replication) [09:11:56] so they are going to also fail [09:11:58] PROBLEM - MariaDB sustained replica lag on s6 on db2124 is CRITICAL: 5.679e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2124&var-port=9104 [09:12:03] there you go [09:12:22] do you want to fix those yourself? [09:12:31] lets try [09:12:39] Also, it is better to create a task to track those things [09:12:47] So we can keep a record of issues [09:12:53] ack [09:13:07] So first, identify the wiki that has failed, then the table, and then the column that it is being dropped [09:13:59] frwiki.cuc_only_for_read_old for the current one [09:14:18] and the table? [09:14:30] cu_changes [09:14:34] great [09:14:43] so creating a dummy column with that name would work, but let's do it the right way [09:14:55] Try to find the column definition for that column (check eqiad) [09:19:22] ALTER TABLE cu_changes [09:19:22] ADD COLUMN cuc_only_for_read_old TINYINT(1) DEFAULT NULL; [09:19:22] given https://www.mediawiki.org/wiki/Extension:CheckUser/cu_changes_table it should be correct [09:20:36] yeah, that should work, add set session sql_log_bin=0; [09:20:45] ack [09:21:09] so try that, and then stop slave; start slave; -- given that the schema change has 3 changes, it is very likely it will fail for the 3rd one, so you'll need to do the same operation for the column that will fail [09:21:18] and I also guess it will fail for the other 3 wikis that s6 has [09:21:33] yeah I'll run the creation for all them [09:21:44] and track everything back in a ticket [09:21:45] so you'd need my alter: set session sql_log_bin=0; alter table cu_changes add column `cuc_actiontext` varbinary(255) NOT NULL DEFAULT ''; and then your alter and the other one you'd need to find [09:21:58] ack! [09:22:02] thanks for yours :p [09:22:20] I can chain all the alters behind a sql_log_bin=0 right? [09:22:26] as long as I don't close the SQL shell [09:23:21] yes you can [09:23:38] This only works because the table is small enough, if it was bigger, your approach of recloning would have been better [09:23:47] we also need to tell Amir1 off for running this on the master :p [09:24:29] repl restarted [09:24:37] (waiting for the next to fail) [09:24:38] let's wait for the next failure [09:24:39] yeah [09:24:51] * Amir1 reads up [09:24:58] PROBLEM - MariaDB sustained replica lag on s6 on db2124 is CRITICAL: 5.759e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2124&var-port=9104 [09:26:09] I didn't run it on master [09:26:14] what? [09:26:22] This came through replication [09:26:33] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db2124:9104 has too large replication lag (16h 1m 22s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2124&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [09:26:59] I'm running the usual auto schema. Only replicas for now [09:27:09] Right, I know what has happened: https://phabricator.wikimedia.org/T373174 [09:27:11] This explains it [09:27:21] set session sql_log_bin=0; USE frwiki ; ALTER TABLE cu_changes [09:27:21] ADD COLUMN cuc_private TINYINT(1) NOT NULL DEFAULT 0; [09:27:51] arnaudb: that works [09:27:56] * arnaudb paste [09:28:17] To avoid stepping on each other toes https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [09:28:30] Yes, we shouldn't have done the switchover if there is a schema change running [09:29:56] my SQL query does not work [09:30:02] * arnaudb debugs [09:30:20] debugged [09:30:25] mispasted [09:31:07] arnaudb: I'd suggest you downtime the host for 1h or so, to avoid noise [09:31:11] ack [09:31:22] sorry for the mistimed switchover :-( [09:31:43] arnaudb: Always check the map for maintenance, this could have resulted on a larger outage :( [09:31:59] PROBLEM - MariaDB sustained replica lag on s6 on db2124 is CRITICAL: 5.796e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2124&var-port=9104 [09:36:51] probably another wiki broke again ^ [09:36:53] it's red [09:37:01] I'm on it [09:37:09] the host is depooled don't worry [09:37:31] Amir1: yes, it will be needed for all the wikis on that host, we discussed it above [09:38:17] thankfully s6 is only four wikis [09:38:23] s3 would have been fun [09:39:08] set session sql_log_bin=0; use jawiki; ALTER TABLE cu_changes ALTER TABLE cu_changes ADD COLUMN cuc_private TINYINT(1) NOT NULL DEFAULT 0; alter table cu_changes add column `cuc_actiontext` varbinary(255) NOT NULL DEFAULT ''; ALTER TABLE cu_changes ADD COLUMN cuc_only_for_read_old TINYINT(1) DEFAULT NULL; [09:39:08] set session sql_log_bin=0; use ruwiki; ALTER TABLE cu_changes ALTER TABLE cu_changes ADD COLUMN cuc_private TINYINT(1) NOT NULL DEFAULT 0; alter table cu_changes add column `cuc_actiontext` varbinary(255) NOT NULL DEFAULT ''; ALTER TABLE cu_changes ADD COLUMN cuc_only_for_read_old TINYINT(1) DEFAULT NULL; [09:39:38] arnaudb: ALTER TABLE cu_changes ALTER TABLE cu_changes ? [09:39:52] ah mispaste [09:40:07] fixed, good catch [09:40:53] arnaudb: I guess labswiki may need it too [09:41:12] ack [09:44:30] replication is catching up [09:44:42] or not [09:44:57] ah the drop is running the other way around its why :D [09:45:23] what do you mean? [09:45:34] the replication is running the alter table to drop the columns [09:45:39] so it takes a bit of time [09:45:55] yes, as I said if the tables are large (a lot larger than this), recloning is the best option [09:46:06] In any case, please always check maintenance map before doing switchovers [09:46:10] it could have been a lot worse [09:46:16] yep [10:06:03] FIRING: MysqlReplicationLag: MySQL instance db2124:9104 has too large replication lag (5h 50m 51s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2124&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [10:14:33] Random note, is there a switchover in codfw waiting? I want to test the wmfamriadbpy changes [10:16:10] yep, let me find you the task [10:16:25] https://phabricator.wikimedia.org/T373330 [10:16:50] s4 is a bit scary [10:17:13] * arnaudb checks for another [10:17:35] arnaudb: what's the status of db2124? is it all fixed? [10:17:53] yep, replag was reducing, it's yet to repool [10:18:10] arnaudb: good, let's repool when it is fixed [10:19:57] Amir1: https://phabricator.wikimedia.org/T373175 → this should wait a bit though to avoid doing 2 switchovers. This one was to prepare the network maintenance - T370852 (hold on I'll tell you when in a bit) [10:19:58] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [10:20:22] okay sounds good [10:23:22] RECOVERY - MariaDB sustained replica lag on s6 on db2124 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2124&var-port=9104 [10:30:18] 18 sept (https://phabricator.wikimedia.org/T373104) 😶 [10:30:46] I guess you can try before, but please update https://docs.google.com/spreadsheets/d/1Hel16vdZyVpev1zD13luaJoG0jC2hxahdNW1x5vC_iY/edit?gid=0#gid=0 if you can [10:54:47] no worries I will do a s5 and revert it back [10:54:58] (do my schema changes on it) [12:03:27] marostegui: Arnaud was saying you wanted to talk to me about a server/switch move? for pc2015? [12:06:08] topranks: I will talk to you in private [14:07:21] marostegui: o/ [14:07:29] elukey: o/ [14:07:41] working only in the afternoon this week, I saw your ping for https://phabricator.wikimedia.org/T373417 [14:07:50] still ongoing or do you resolved? [14:07:55] elukey: can you work with arnaud? I will be gone on friday for two months :) [14:07:58] (still on going) [14:08:16] marostegui: I knowwww I am super sad [14:08:22] I just saw jenn answered, so probably arnaudb should check what she suggested [14:08:35] yes yes exactly, --force-dhcp-tftp is the option for reimage to avoid the issue [14:08:43] arnaudb: ^can you try that? [14:09:54] TL;DR - for some misterious reasons, with lpxelinux a lot of NIC's firmwares have troubles to fetch stuff from HTTP (we do instruct PXE to fetch the image etc.. via HTTP from the apt nodes) [14:10:14] with that option, the DHCP configuration created by reimage forces tftp only [14:10:30] * arnaudb backlogs [14:10:39] (so all is fetched from the dc-local install server, slower protocol but seems more stable) [14:10:45] so far we haven't seen any issue [14:11:56] for 10G nodes it seems the only way to reimage IIUC [14:13:15] on it! [14:15:10] Sorry elukey I was in the middle of a switch [14:15:54] elukey: Yeah, I recall papaul told me about those issues, but I wasn't sure whether in the end he reimaged them with 1G and then moved to 10G once reimaged [14:57:39] is this with Dell kit or Supermicro? I think we've previously installed 10G ms-be* nodes OK without needing --force options [14:58:00] Emperor: dell [14:58:05] Huh. [15:00:46] it depends on the NIC and its firmware, sadly it is not 100% clear what is the problem [15:00:53] long term I think we may need to test UEFI