[05:02:58] we are multidc now https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-shard=All&var-role=All [05:43:07] for all wikis? [05:43:47] Amir1: after s1, I will switch x1 [05:44:06] marostegui: yup, for all wikis and all of traffic [05:44:39] sounds good to me, a host is under alter table so it didn't move (sigh), I probably need to run move-replica once it's done [05:45:39] can you double check if db1107 has the alter table done btw? [05:45:57] I want to use use that one to clone another host, if not, just let me know one host that is already done [05:47:16] let me check, which section it is? [05:48:22] ah, s1 [05:48:59] marostegui: it has the new schema [05:49:04] \o/ [05:49:05] thanks [05:49:22] (sorry I had to ask the section because I'm running different schema changes on s4 and s6) [05:49:28] yeah absolutely [06:25:17] db1132 (s1) and db1143 (s4) running the patched version of mariadb 10.6 are being repooled [07:14:38] interesting: https://github.com/charles-001/dolphie [07:15:17] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db1196:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [07:15:19] wow that looks cool [07:15:55] jynus: fancy creating a ticket so we don't forget to test it? [07:16:36] it looks cool, I am not so sure about recommending it (not because I have any concern, it is just that I found it randomly) [07:17:04] But it is probably worth a task to see if we can test it during a hackathon or whatever [07:18:24] sorry, I just don't feel confortable endorsing it without knowing much, I will save it on github by starring it though [07:18:31] oki [07:19:16] I was sharing it because "it looked cool" not as "we should use this" [07:35:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db1196:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [07:52:14] marostegui: so db1133 didn't move from the old master to the new because it had a schema change going on, should I use move replica script for it? [07:52:30] Amir1: you can use db-move yes [07:52:36] db-move-replica [07:52:44] awesome [07:54:57] Amir1: you are not running any more schema changes in x1, right? [07:55:00] I want to switch it back to row [07:56:01] not to my knowledge but let's double check if the schema changes are 100% done before the switch [07:56:14] I don't see any from the dashboard [07:56:17] you can run the schema change with --check to ... check [07:56:31] so I am personally not running any [07:56:39] the last one was the cx_corpora one [07:56:56] I know, If they failed to depool etc. [07:57:20] so sudo python3 change_cx_corpora_T312160.py --run --check [07:57:25] Result: {"already done": ["db1103", "db1137", "db1116:3320", "dbstore1005:3320", "db1102:3320", "db2096", "db2131", "db2101:3320"]} [07:57:28] That is done [07:57:29] That one is completed [07:57:30] yes [07:57:43] and on dc masters: Result: {"already done": ["db1120", "db2115"]} [07:57:45] So from your side there's none left, right? [07:57:58] yeah, from what I'm seeing none left [07:58:06] ok, I will switch back to ROW [07:58:25] marostegui: can I run a triple check for a min? [07:58:30] of course [07:58:31] I promise I'd be quick [07:58:39] no rush [07:59:16] rename_echo_push_indexes_T312975.py says it's appiled in all of x1 (both replicas and masters), we are good to go [07:59:23] cool [07:59:29] I will switch to ROW then [07:59:51] Thanks [08:00:18] once you're done, do we use osc_host.sh file in db tools? I found this https://wikitech.wikimedia.org/wiki/MariaDB#Online_Schema_Changes [08:00:36] no, we don't anymore [08:00:37] I want to clean up that page a bit [08:01:00] cool [08:10:44] we do we have so many hosts depooled in s1? [08:10:51] Emperor: following the swift-proxy slowness of a couple of weeks ago ms-fe1012 was left depooled and swift-proxy not restarted, are you interested in inspecting the proxy ? if not I'll finish the roll-restart and repool the host to wrap things up [08:11:28] marostegui: that was weird to me too, planning to check [08:11:40] db1134, db1119, db1118 (old master) [08:11:53] the last one is intentional. I'm on it [08:12:00] And two on s7 [08:12:12] db1134 is also mine, the schema change is running [08:12:16] not sure about db1119 [08:12:40] marostegui: I don't have anything running on s7 [08:12:57] I will repool db1119 I think it was me that used it to clone db1107 [08:13:50] in s7 we have db1127 (10.6) and db1174 [08:14:13] db1174 is not in SAL, so I am repooling it [08:14:19] It is also all green in icinga [08:15:14] And db1173 from s6? [08:15:38] that should be me running schema change [08:15:40] let me check [08:15:46] ah ok [08:16:16] nope, mine is db1098:3316 now [08:16:27] 08:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:16:35] that's from yesterday [08:16:37] did it fail or something? [08:17:04] ah yeah, I think I forgot to repool it back [08:17:21] cool, double check if it is all green in icinga, and pool it back once you've time [08:18:43] hmm, since there is a schema change ongoing, the code might mistakenly skip depooling and run it live there, I let it be until the schema change it finished which should be done soon [08:18:52] ok! [08:27:27] Amir1: sorry to bother you again, can you tell me a host in s4 that already has the schema change applied so I can use it to clone another one? [08:27:48] hmm, let me check [08:28:55] ["db1143", "db1146:3314", "dbstore1007:3314", "db1138"] [08:29:05] any of these would work [08:29:13] thank you! [08:29:23] I will take db1138 [12:07:56] Amir1: For later (as I am going for lunch) can you let me know a host in s6 that is also done? [12:08:33] I'm out for lunch too 😅😅 [12:10:23] enjoy! [13:15:18] marostegui: ["dbstore1005:3316", "db1098:3316", "db1113:3316", "db1096:3316", "db1180", "db1165", "db1168", "db1140:3316", "db1187", "db1173", "db2171:3316"] [13:15:25] PROBLEM - Check unit status of swift_ring_manager on ms-fe1009 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:15:44] make sure it's not having one the schema change as we speak :D [13:17:01] great!! [13:17:03] thanks [14:12:47] RECOVERY - Check unit status of swift_ring_manager on ms-fe1009 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers