[04:19:17] 10DBA, 10SRE, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Marostegui) [04:19:25] 10DBA: Pre DC switchover eqiad -> eqiad DB work - https://phabricator.wikimedia.org/T284897 (10Marostegui) [04:43:18] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) I haven't seen anything weird, so I am going to finish up s6 eqiad (not the master, as it requires a switch) and then deploy this change on all codfw sections. [04:46:58] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [04:48:41] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [04:57:50] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [06:47:29] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [07:08:35] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [07:13:05] marostegui, kormat I resetup recently db2100 (s7 and s8) with data from 2 weeks ago. Pinging to check if it requires any schema change reapplied (it will not if those were done using replication) [07:13:29] it had to be so old because we were doing 10.4 backups meanwhile [07:16:18] (obviously not a priority, but I was trying to avoid the issue I brought up a few meeting ago) [07:16:30] jynus: if it is codfw, it will arrive via replication [07:16:35] cool [07:16:41] thanks for the heads up :) [07:16:54] I just trying to be super-comunicative [07:17:05] yep! [07:17:08] specially in this case where the recovery was super-old [07:33:59] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [07:37:02] marostegui: I did a tiny investigation on the 840M rows query and the result is very sad [07:37:18] I have the notification there pending to read :) [07:37:53] let me know once you read it [07:38:07] so we can discuss further sufferings [07:39:00] wilco! [07:58:53] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) s5 eqiad [x] dbstore1003 [] db1161 [] db1154 [x] db1150 [x] db1145 [] db1144 [] db1130 [] db1113 [] db1110 [] db1100 [x] db1096 [] clouddb1021 [] clouddb1020 []... [08:15:40] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [08:28:48] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [08:34:14] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [08:34:37] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [09:45:58] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) [09:46:10] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) p:05Triage→03Low [09:51:40] 10DBA: Rebase pt-heartbeat-wikimedia on modern upstream version - https://phabricator.wikimedia.org/T285082 (10Kormat) [09:51:51] 10DBA: Rebase pt-heartbeat-wikimedia on modern upstream version - https://phabricator.wikimedia.org/T285082 (10Kormat) p:05Triage→03Medium [09:55:34] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) **Starting state**: pt-hb running, mariadb running. **State change**: stopping mariadb **Result**: pt-hb starts logging this once per second: ` Jun 17 09:55:01 zarcillo0 pt-heartbeat-wikimedia[3982]: Ca... [09:57:19] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) **Starting state**: pt-hb running, mariadb running. **State change**: restarting mariadb **Result**: pt-hb starts logging this once per second while mariadb is down, and then continues to work when mari... [09:59:07] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) **Starting state**: pt-hb stopped, mariadb stopped. **State change**: starting pt-hb **Result**: pt-hb exits immediately with this error: ` Jun 17 09:58:11 zarcillo0 pt-heartbeat-wikimedia[22267]: DBI c... [10:03:25] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) **Starting state**: pt-hb stopped, mariadb running. **State change**: starting pt-hb with invalid database name **Result**: pt-hb exits immediately with this error: ` Jun 17 10:02:32 zarcillo0 pt-heartb... [10:05:25] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) **Starting state**: pt-hb stopped, mariadb running. **State change**: starting pt-hb with invalid table schema **Result**: pt-hb exits immediately with this error: ` Jun 17 10:05:07 zarcillo0 pt-heartbe... [10:08:37] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) [10:08:52] marostegui: ^ this looks a lot better than i thought. [10:09:49] you are patching it? [10:10:01] that's separate [10:10:23] i wanted to investigate the current behavior, to see how close/far it is from our expectations [10:10:30] ahgotcha [10:10:37] i've also filed https://phabricator.wikimedia.org/T285082 for patching it [10:10:46] which we should do, but i'm thinking maybe it's not a high prio now [10:10:56] because the existing behavior actually looks pretty sane to me. [10:11:11] I will take a closer look at the states [10:11:56] the only 'questionable' one, IMO, is where pt-hb is running and then mariadb is stopped. but in that case we'll have alerts for mariadb itself. in which case it probably doesn't matter that pt-hb is continuing to run [10:12:05] yeah [10:12:14] cause it will do the right thing once mariadb comes back, right? [10:12:18] it will keep trying to insert things? [10:12:21] yeah [10:12:37] yeah, the most common case is mariadb crash -> mariadb restart [10:12:47] (i double-checked that it was successfully updating the heartbeat table after mariadb returns) [10:12:48] if it reconnects that's cool [10:12:59] jynus: i don't think that's _that_ common [10:13:04] at least, i certainly hope not :P [10:13:10] kormat, ofc [10:13:20] The only other one that can catch us is the reboot one [10:13:24] I mean the most common case of an unschedule mariadb stop [10:28:40] 10DBA: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Marostegui) This is quite a good approach. The only one that can bite us is the one that happens after a reboot: - Both services stopped - Mariadb gets started before pt-hearbeat - pt-hearbeat remains stopped... [10:29:26] marostegui: what'll actually happen, most of the time, is that pt-heartbeat will get started by the puppet run that happens on boot. it'll then fail ~instantly, and we'll get an alert [10:29:45] so it's a bit annoying, we'll need to start the pt-hb service manually to fix things [10:29:51] but i guess at least we'll know [10:30:06] kormat: but if we get to start mysql, on the next puppet run, pt-heartbeat will get started again right? [10:31:17] marostegui: sure. but at that point we have a 'master' that's been running for up to 30mins without pt-hb running [10:31:23] so i'm not sure that's useful :) [10:31:58] yeah, but we'll get the alert before those 30 mins [10:32:10] what I mean is in cases like I simply forget to start it (ie: codfw) [10:32:10] right. so my point is that the next puppet run is moot, [10:32:19] because we'll be getting alerted in the meantime [10:32:33] i guess it's a last-resort fallback [10:32:44] going to reboot a pontoon node to double-check this is what happens [10:32:46] is there a way to bond services together? ie: mysql starts -> check pt-heartbeat and if not, start it? [10:33:02] yesbutno [10:33:09] xdd [10:33:16] systemd can have that logic, [10:33:34] but it managing it to be different between nodes which should and should not have pt-hb running is probably quite complex [10:33:58] yeah, indeed [10:34:38] mm. we don't have an alert for pt-hb running when it shouldn't be [10:35:01] i wonder if i could convince a jbond to accept a patch to the logic for systemd::monitor [10:36:06] even if the next puppet run would clean it up, it will have polluted the heartbeat table, which we'd want to fix [10:38:20] ah. nrpe::monitor_systemd_unit_state would work, i think. [10:50:24] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [11:04:05] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) [11:04:58] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Marostegui) This is all done in codfw, waiting only for the DC switch. s5 and s6 are also mostly done (only pending the masters in eqiad) [11:05:57] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) a:03Marostegui [11:09:32] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) [11:12:13] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) Going to deploy this on s6 on a couple of hosts in eqiad. If all goes well, I will try to deploy it entirely on codfw, so I can do eqiad once the dc switchover has happened. [11:15:15] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) ` root@cumin1001:/home/marostegui# mysql.py -hdb1096:3316 frwiki -e "show create table iwlinks\G" *************************** 1. row *************************** Table:... [11:25:25] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) s6 eqiad progress [x] dbstore1005 [x] db1180 [] db1173 [] db1168 [] db1165 [] db1155 [] db1140 [] db1131 [] db1113 [] db1098 [x] db1096 [] clouddb1021 [] clouddb1019 [] clou... [11:41:06] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) [12:14:50] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) [12:17:18] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) [12:36:20] 10DBA, 10Datacenter-Switchover: Pre DC switchover eqiad -> eqiad DB work - https://phabricator.wikimedia.org/T284897 (10Marostegui) [12:57:47] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) [16:47:56] 10DBA, 10Datacenter-Switchover: Pre DC switchover eqiad -> eqiad DB work - https://phabricator.wikimedia.org/T284897 (10Legoktm) [16:48:16] 10DBA, 10SRE, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Legoktm)