[08:43:59] Amir1: for https://phabricator.wikimedia.org/T402763 we need to update the codfw DC master - can we do it live or with flip? [08:45:02] we also need to run it on s4 codfw replicas and s4 codfw dc master [09:58:20] federico3: thanks. I'm running on s4 in both eqiad and codfw now, I think we can do a flip of s1 or s8 in codfw to make it pick the new schema change [09:58:28] I go do upgrade of s7 [09:58:35] (eqiad) [10:00:02] Amir1: ok. Also when you have a sec can you review my comments about the mariadb version upgrade? [10:00:30] https://www.irccloud.com/pastebin/MYWmgqi0/ [10:02:08] ah that one, yeah, the thing is that I messed it up and switched over the candidate when the candidate also had the bug. So we need to update the candidate and switch them over, I double checked in debmonitor and the old master/current candidates don't have the new version (it needs u2, not u1) in s1/s4/s8 in eqiad still [10:02:18] tldr, ignore the ticket :D [10:02:51] 10.6.22 is not enough [10:03:27] e.g. https://debmonitor.wikimedia.org/hosts/db1193.eqiad.wmnet: wmf-mariadb106 10.6.22+deb12u2 [10:03:41] this is good, but wmf-mariadb106 10.6.22+deb12u1 is not [10:04:10] so s8 is already upgraded, I think cause I ran it for the kernel reboot :D [10:05:12] aaah, I see all three are updated [10:05:35] probably because of the reboot for kernel upgrade. [10:05:45] Thanks federico3 for checking, only thing left is switchovers [10:05:47] yes the kernel upgrade is meant to also upgrade mariadb [10:06:31] BTW I'm displaying drop_rc_new... in https://zarcillo.wikimedia.org/ui/schema_change if it's useful for you as well [10:07:23] can I do s1 codfw master flip now? [10:09:00] awesome [10:09:14] yup, go for it [10:09:36] starting flip now https://phabricator.wikimedia.org/T404178 [10:10:41] if you're running on cumin1003, you'll see how faster the topology changes are :D [10:11:15] I'm on 1002 but... can we update wmfmaria* here as well? [10:11:34] it is already there too [10:11:36] for long time [10:11:51] so you haven't seen the old way, it was so sloooow [10:20:25] Amir1: time sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2212 db2203 recorded 7 minutes [10:20:50] it used to take at least half an hour [10:21:15] and also would make things not so good for mediawiki so users were seeing lagged information [10:22:19] db1181 is now on 10.11 (T399955) now that I do the s7 switchover in eqiad, the ticket will be finished too [10:22:20] T399955: Migrate s7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399955 [10:24:36] federico3: actually one thing that'd be great if you could take a look if you have time: can you grab list of all of candidate masters (just grep "candidate master" in hiera files) and check whether all of them have u2 update, my reason is that if we have to do switchover cause the current master goes down or something, we will end up with the mess on our hands again [10:25:23] if any is missing the update for the semi-sync bug, we can simply just depool it, run upgrade cookbook and pool it back again [10:25:40] but it'll save us later [10:29:26] s1 codfw switchover is done [10:29:32] Amir1: yes one sec [10:30:46] BTW I was using ssh but maybe DebMonitor is our friend here? https://debmonitor.wikimedia.org/packages/wmf-mariadb1011 https://debmonitor.wikimedia.org/packages/wmf-mariadb106 [10:31:13] ah, yeah that's simpler [10:31:32] I was originally thinking of doing xargs with cumin running dpkg list command or something [10:31:44] but this is simpler [10:33:47] we are looking for all candidate masters or only s* sections? [10:35:00] I'd say let's go with all [10:35:14] and only in codfw? [10:35:57] nope, all [10:36:58] if you do some quick script, that'd be simpler, maybe download the list of all u1 hosts, do the grep and do "common" command. something like that [10:38:27] yep I'm doing ssh + dpkg [10:44:32] https://phabricator.wikimedia.org/P83102 [10:44:43] there you go - we have to stragglers [10:45:19] \o/ [10:45:21] Thanks [10:45:44] db1173 thankfully is not a master [10:46:09] it's a replica in s6 eqiad, shall I run the upgrade now? [10:46:37] err.. we have *some* stragglers, not "to" [10:47:13] yup, the point being in case the current master goes down or has issues, the problematic candidate shouldn't become the future master [10:48:10] I'd be grateful if you run the upgrade script on those candidates [10:56:29] ok, running on db1173 [10:57:20] also I should start the 2 schema changes on db2212 (ex master on s1 codfw) [11:01:20] PROBLEM - MariaDB sustained replica lag on s6 on db1173 is CRITICAL: 69.75 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1173&var-port=9104 [11:01:50] odd [11:02:42] stupid me, I forgot --repool [11:03:20] RECOVERY - MariaDB sustained replica lag on s6 on db1173 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1173&var-port=9104 [11:03:48] actually let me test 1186532 [11:06:39] Amir1: can I start the 2 schema changes on db2212 (ex master on s1 codfw) ? [11:07:02] I think it's the missing promotheus restart thing I told you [11:07:06] federico3: sure, go ahead! [11:07:18] it's in the CR on gerrit , 1186532 [11:07:32] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1186532 [11:08:32] looks good, +1'ed [11:10:28] I don't remember if I can run the upgrade cookbook on x1 or needs some special handling [11:10:38] we should document it in the cookbook [11:11:32] x1 should be fine, only thing special for x1 is that it's RBR (maybe Manuel changed it, can't recall) so schema changes shouldn't be run on replicas first otherwise 🎆 [11:11:48] but beside that, x1 is really like a sX section [11:33:18] ok upgrading db2191 [11:34:30] PROBLEM - MariaDB sustained replica lag on s7 on db1236 is CRITICAL: 36 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1236&var-port=9104 [11:35:30] RECOVERY - MariaDB sustained replica lag on s7 on db1236 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1236&var-port=9104 [11:45:25] FIRING: SystemdUnitFailed: ferm.service on db2191:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:27] okay, I don't know what this one is about [11:46:51] claime: I think the ideal fix would be to move the p*ge to "can the app work normally with the database" layer, leaving the other individual checks as regular alerts [11:48:03] I think it only become a p*ge when mediawiki used to go down when any replica went down, now fixed [11:50:48] yup, the main problem is that we don't have a page for when all replicas are down or lagged, so we page for any of them [11:51:15] ...not yet [11:51:27] this is a known problem, just we haven't got to fix it [11:51:39] db2191 is the x1 host [11:52:01] i think we have a race condition somewhere in the monitoring [11:53:12] https://www.irccloud.com/pastebin/q1UQ4zt4/ [11:53:33] ah no this is ferm once again [11:55:19] also federico3, you'd need to add back the weight of old master from the old value of the new master, now i'm seeing a lot of pooled hosts in codfw that have weight of zero. e.g. "db2212": 0, [11:55:20] and it logged out "Sep 10 11:40:17 db2191 ferm[1110]: DNS query for 'dborch1001.wikimedia.org' failed: query timed out" [11:55:59] is it in the master flip process? [11:56:02] current masters always have to have weight of 0 but candidate masters should have some weight [11:56:09] yup it is in the checklist [11:58:35] I take a tiny break and get back to updating zarcillo masters for the semi-sync bug [11:59:00] I found few hosts that have no api/vslow/dump set see https://phabricator.wikimedia.org/T404106 so after the flip i did not set those [12:00:25] RESOLVED: SystemdUnitFailed: ferm.service on db2191:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:31] federico3: I created T403966 as Amir asked me, to do a global review before the switchover, so no worries, it is mentioned there as a TODO [12:21:42] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:22:42] it would be nice, however, to plan to give enough time to that (finish schema changes early and other maintenance) [12:32:52] ok I can eyeball https://zarcillo.wikimedia.org/ui/weights and set weights on replicas that currently have 0 [12:35:09] during the flip I save the original values on the old master but i think we need steps to also save weight from the old relica [12:40:53] I'm going to downtime zarcillo hosts and upgrade them in-place so it should be inaccessible for a bit [12:41:51] first db2185 [12:53:29] ack [12:53:45] db2185 is done, now the master: db1215 [12:55:26] Amir1: are you running the schema change for s4 in codfw for 2025/drop_rc_new_T402763.py ? [12:55:54] I think it was only the old master [12:55:58] but give me a bit [12:56:20] doublechecking [12:56:25] no, on all hosts that haven't had it [12:56:30] yep [12:56:33] shall i start it? [12:56:44] nope, my script is on it already [12:56:47] ok [12:56:49] it'll get to both codfw and eqiad [12:57:19] should i flip s8 codfw master? [12:58:12] one sec, let me finish this and i check [13:03:36] so if you're done with the schema change of rc_new in s8 in codfw, go for it. But please check and make sure the weight is added back plus if zarcillo is working fine since I just upgraded it [13:05:17] I'm saving the weights on both hosts before the changes and diff them and ask before setting them [13:11:03] zarcillo looks ok [13:13:17] ok db2212 is updated for both schema changes: https://zarcillo.wikimedia.org/ui/schema_change [13:13:54] anything I can do for https://phabricator.wikimedia.org/T402763 BTW? [13:15:10] actually it should be all done for codfw once s8 master is flipped and updated [13:16:13] yeah, do it for s8 codfw master [13:16:30] beside that, we are good until tomorrow (s4 codfw master) [13:24:55] https://www.irccloud.com/pastebin/s8wIQlo0/ [13:26:20] do we want to set, say 300? [13:29:13] you need to figure out what it used to be [13:29:24] that's gonna be fun [13:29:51] what was the previous switchover of s8, I think it should be there in SAL [13:31:35] yes I have the logs [13:34:01] let's finish the switchover and we can take a look. Right now, I need to write a comment, someone is wrong in the internet [13:36:36] the flip just finished [13:36:52] I can start the schema change[s] on the ex-master [13:37:44] go for it [13:38:52] [ a little --replicas cli flag would be so handy ] [13:40:51] drop_rc_new_T402763 is running [13:41:08] yup, I dreamt about it too :D [13:41:47] and --live for small sections [13:42:07] Thanks for running it! [13:45:55] --replicas could be mutually exclusive with --dc-masters ... it should be a 3-lines change if we want to do it next weeks [14:01:10] yeah but also the current HEAD of auto_schema is definitely broken [14:01:21] I had to revert my local setup [14:03:35] due to T395241? [14:05:21] uhm I'm using ee1eaa66a4958b4413142f88289057d68765b4e5 for the schema changes [14:05:35] anyhow we'll see later on [14:44:13] meeting over, so regarding the zero weights on codfw, do you need help federico3? [14:44:37] no, i found the old values from logs and restored them [14:45:02] oh cool, I'm still seeing them as zero in https://noc.wikimedia.org/dbconfig/codfw.json though [14:45:17] e.g. db2213 in s5 [14:45:37] maybe not committed yet? :D [14:45:44] not changed yet [14:54:05] yet 2213 had api and vslow set, are you sure we want main weight as well? 2192 (the host it flipped against) had 0 it seems [15:09:22] it's really weird candidate master being in vslow and api. I think we simply should swap that role with something else and give it weight of 1 in general [15:10:36] just 1? [15:10:42] yeah [15:11:00] also what makes it weird? [15:11:09] nah, we do it a lot [15:11:35] It makes the circuit breaking a bit harder but that's future-me problem [15:11:51] uhm you wrote "it's really weird" [15:12:15] ah, I thought you meant about the weight of 1 [15:13:00] for candidate master being in api/vslow. It's just a pain to update weights in multiple groups every time you want to do a switchover and also query patterns is much different making innodb caching not as warm as it should be [15:13:28] you mean make another host main=1 with api/vslow, and make 2213 main=high value and no api/ no vslow? [15:14:07] jynus: sorry to bother, I have a db question and I wonder if you can tell me the risks. the semi-sync bug happens when we are doing switchover so I'm wondering if I can just disable semi-sync right before I start the switchover (and let's say for a couple of minutes) and just re-enable it afterwards. That way the master stops becoming fully unresponsive during the switchover. My only worry is that if I disable semi-sync, it [15:14:07] could lead to issues. Do you think it's okay to do that for a couple of minutes? [15:14:13] ok good to know, future-me should discuss candidate selection with future-you [15:14:34] "you mean make another host main=1 with api/vslow, and make 2213 main=high value and no api/ no vslow?" <- yup! [15:14:48] ok [15:15:45] my naive understanding of semi-sync is that it should be "fine" for a couple of minutes [15:15:47] I cannot say because the script may expect that- or may disable and enable it on its own, either failing or ignoring the change [15:16:12] ah yeah, that's another aspect [15:16:25] then I won't disable it and just let it explode [15:17:02] what script do you use? [15:17:22] like, can you point me to the repo/code? [15:19:44] it's your script [15:19:51] let me find the cod [15:20:05] yes, what I mean is I no longer know where it lives, because of multiple forking, etc [15:20:07] https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/main/wmfmariadbpy/cli_admin/switchover.py?ref_type=heads [15:20:18] wmfmariadb, wmfdb, etc. [15:20:18] here you are [15:22:39] so far it looks like only warnings [15:23:02] it handles the semisinync here: https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/main/wmfmariadbpy/cli_admin/switchover.py?ref_type=heads#L623 [15:24:00] one the new master one before slave move [15:24:06] *on [15:24:29] and on the old one after replication inverting [15:24:48] You could comment and do it manually, I guess? [15:41:58] amir: going to commit this: [15:42:02] https://www.irccloud.com/pastebin/M8YxIfrz/ [15:42:12] sounds good? [15:47:35] SGTM [15:48:14] it'll make connections jump (no slow repool) but meh, once in a while we can do this, as a treat [15:48:31] Thanks Jaime, I think for now, I wouldn't risk it [15:55:18] Amir1: speaking of which, it seems to me we are a bit overly cautious with pool-in speeds on some sections. E.g. during this switch or, flips etc. I'm not seeing disk usage or CPUs jolt [16:03:30] I think the answer is that it depends, e.g. in the secondary dc we can be more aggersive since they don't get that many queries at 100% but something to think about later [16:17:44] we still need a schema change on db2220 requiring a flip (see https://zarcillo.wikimedia.org/ui/schema_change ) [16:17:50] as a heads up, you will see almost 500 "Errors" on grafana, related to bacula [16:18:00] jynus: Jaime, as part of SRE management meeting we are looking ahead for next week's on-call schedule. Wanted to check if you are aware that you are on-call as part of spreadsheet (which happens to be source of truth) . (The reason why I am asking here is looks like in splunk it says Federico) [16:19:47] re: grafana, you can ignore those, I cancelled some jobs and they show as errors, as technically they didn't end successfully [16:20:51] Amir1: we still need a schema change on db2220 requiring a flip in codfw (see https://zarcillo.wikimedia.org/ui/schema_change ) [16:22:16] kavitha: I am happy to be on call during working hours whenever, if I am requiring to work outside of working hours, I will have to talk to HR [16:24:53] Also, I have a few rehab hours on wednesday scheduled, next week [16:25:31] I think rehab is the wrong word in english. I meant physical therapy [16:27:14] kavitha: let me know if that answers your question, if not, feel free to contact me by email [16:30:30] yes, this is for the existing working hours on-call [16:30:46] then no problem [16:30:59] I will fine a person for the few hours I will be out [16:31:00] thanks for confirming. I will make sure splunk is reflected accordingly [16:31:01] *find [16:31:16] no worries, you will have another partner anyways [16:37:05] I can handle splunk myself next week, no worries [16:37:11] will now finish my day [16:59:34] federico3: let it be for today, we have tomorrow too. Worst case, I do it [16:59:46] we have made a lot of codfw switchovers [17:04:38] Amir1: ok. We'll also have to flip s4 codfw master and run the schema changes on it