[05:27:27] dhinus: it looks good! very nice job :) [05:46:17] good morning, will start s4 eqiad switchover :) [05:46:29] great! [06:56:58] marostegui: thanks. I want to do a few more changes to that wiki page, I will ping you when I need more reviews [06:57:06] sounds good [07:37:25] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:23] dhinus: when codfw sanitariums were purchased, those were intended to be failovers for eqiad ones, sadly there needs to be additional investment on automation there [07:42:16] the idea being that is eqiad production mw was broken, they could replicate from codfw [07:42:25] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:48] aka it is missing a virtual arrow from codfw to eqiad wikireplicas [07:43:26] jynus: got it, I will update the note on the right side [07:44:43] what's the blocker to manually set clouddb to replicate from codfw in case one eqiad sanitarium is down? [07:45:47] dhinus: investigating position in binlogs and switching them [07:46:14] It is not a blocker, it is just complicated [07:46:32] You need to match the last transaction that happened in eqiad and find that one on codfw sanitarium hosts [07:46:55] I see, so theoretically possible if e.g. there's an hardware issue on one sanitarium in eqiad [07:47:03] it is possible yes [07:47:19] We have wanted to automated that for years, but we've not got resources for it [07:47:24] thanks, I will put that in the diagram [07:47:25] dhinus: it requires automation, and right now it is so dangerous (it would break clouddbs if wrong) that I think it is seldom used [07:47:32] what manuel says [07:47:42] thanks both! [08:37:25] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:25] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:20] arnaudb: what's the status of the old s4 master? can I run a schema change there or is it pooled? [08:49:37] it's repooling atm [08:49:42] I can ^C if you want [08:49:42] ah ok [08:49:45] it's @1% [08:49:47] Yeah, let's do that [08:49:49] sure :) [08:49:54] ANd I will take care of repooling it once I am done [08:50:05] done! [08:50:09] its all yours [08:50:13] thank you! [13:46:57] Did someone start replication on db1216:3318? [13:47:07] Ah, I did, because I am stupid [13:47:08] Nevermind [14:47:35] marostegui: can I play with the old s4 master? [14:48:09] Amir1: not yet [14:48:19] let me know then :D [14:48:35] Wilco! [15:44:21] jynus marostegui I updated the diagram: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Overview_diagram [15:45:01] Thank you [15:46:53] I also created T365717 for a general review/update of that wiki page [15:46:54] T365717: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717 [16:27:03] marostegui: I have some bad news unfortunately [16:27:21] we need to upgrade JunOS on our switches in Eqiad rows E and F - T348977 [16:27:22] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [16:27:47] as the switch reboots we have to do it on a rack-by-rack basis (but hey at least we don't need to take entire row down like we used to) [16:28:13] first db hosts are in rack E1, which I've provisionally scheduled for July 2nd [16:29:12] we can chat again about it - the timing is flexible tbh [16:29:21] but wanted to give you a heads up [16:31:26] topranks: thanks, I'd like this to be led by arnaudb as I'll be out in sabbatical and if there are more of this I will be out [16:31:49] ok yep, shouldn't always fall to you either [16:32:03] I hope won't be too disruptive [16:33:25] arnaudb: if you get a chance please take a look at the schedule / google sheet and let me know if you think it's workable [16:33:34] no rush, and we can definitely push it out if needed [16:34:00] Let's see how arnaudb wants to do this [16:40:41] ack topranks marostegui will come up with a schedule soon and will let you know! [16:40:54] cheers [16:41:34] arnaudb: <3 [16:52:20] Amir1: I am done with the old s4 master [16:52:25] oh thanks [16:52:25] Would you take care of repooling it? [16:52:28] sure [16:52:35] good thanks [18:29:59] Hi all. I'm having trouble developing a procedure to upgrade my MW 1.39 Aurora MySQL databases from 5.7 to 8. One warning I get says "By default zero date/datetime/timestamp values are no longer allowed in MySQL, as of 5.7.8 NO_ZERO_IN_DATE and NO_ZERO_DATE are included in SQL_MODE by default. These modes should be used with strict mode as they [18:30:00] will be merged with strict mode in a future release. If you do not include these modes in your SQL_MODE setting, you are able to insert date/datetime/timestamp values that contain zeros. It is strongly advised to replace zero values with valid ones, as they may not work correctly in the future." [18:30:26] It further says that global.sql_mode doesn't contain NO_ZERO_DATE or NO_ZERO_IN_DATE, thus allowing insertion of zero dates. [18:38:31] So I'm not clear on what I should do here. I can open an AWS support case, but since they don't know MW, any recommendations they give may have unknown implications within MW. [19:48:19] I'm chatting with a Aurora MySQL specialist and at this point he recommends setting SQL_MODE in the database cluster parameter group to “ONLY_FULL_GROUP_BY,NO_ZERO_IN_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,STRICT_ALL_TABLES,NO_AUTO_CREATE_USER”. Does this sound correct from a MW perspective? [19:49:24] The database policy documentation sounds like there shouldn't be any zero date/datetime/timestamp values. Is that correct?