[07:59:38] mornin' [07:59:45] o/ [07:59:53] starting in -operations in 10 seconds [08:00:50] uf, I said "I'm around BTW" on the wrong channel [08:03:23] all was done, it took around 50 seconds [08:12:24] I double checked, a dns change for the secondary is not needed [08:13:14] secondary? [08:13:47] m3-slave.eqiad.wmnet dns entry [08:13:56] ah no, that hasn't changed [08:14:18] m3 secondary is used for stats and the phab email, etc. [08:15:20] yep [08:31:23] marostegui, Amir1, Emperor: friendly poke to come to the "ERC - Implementation Phase Kick-off meeting" today at 15:00 UTC if you can. [08:39:27] * Emperor should be able to make that [09:16:43] Amir1: how do i tell autoschema to run against s6/codfw? [09:16:59] kormat: you need to specify the replicas, otherwise it will get all of them [09:17:09] so there's no way to say: just codfw, or just eqiad [09:17:15] * kormat grumbles [09:17:29] kormat: you can just put codfw master and run it with --include-masters and that'd work [09:17:31] how does it know which dc is active? [09:18:06] kormat: but once you are ready for eqiad, you'd need to specify eqiad replicas or else it will try to run it on codfw as well (which should be fine really as it will detect it was done already and skip it) [09:18:27] kormat: https://gerrit.wikimedia.org/r/c/operations/software/+/748726 [09:18:50] * kormat grimaces [09:19:24] kormat: however, let's coordiante before you start the schema change to make sure you don't do it on a section that is already having one running. I am deploying one on s3 and s6, not sure about Amir1 [09:24:34] I got an error on yesterday's zarcillo backup, and it could be related to the upgrade [09:24:56] which error did you get? [09:25:35] the error is long, but it is probably due to leaving the HISTORY removal rights on the user [09:25:44] ah that crap [09:25:47] let me get rid of it [09:25:56] which user was that? [09:25:59] if you can check it, if not I can [09:26:06] dump @ 3 ips [09:26:07] I'm running stuff on s2 [09:26:12] jynus: ok let me check [09:26:35] it is added by mysql upgrade to SUPER privileges [09:26:53] jynus: yeah, it is the history thing, let me remove it [09:27:16] (the 3 ips are the dbprov1 hosts, in case of eqiad) [09:27:21] codfw backup ran ok [09:27:22] | dump | 10.64.0.95 | [09:27:22] | dump | 10.64.16.31 | [09:27:22] | dump | 10.64.32.26 | [09:27:22] +------+-------------+ [09:27:25] those right? [09:27:26] yep [09:27:43] a revoke usually is enough [09:27:50] yes I am doing that [09:27:51] although check if it could affect other users [09:28:28] I have revoked it only from dumps@thoseips [09:28:45] that's only think I need to rerrun the backups [09:28:56] let's try if that works then [09:29:02] to double check, there is nothing to backup there, other than zarcillo, right? [09:29:08] or is there orchestrator ? [09:29:11] or something else?= [09:29:34] orchestrator _will_ be there, but not just yet. [09:29:38] ok [09:29:43] pending T301315 [09:29:43] T301315: Move orchestrator from db2093 to db1115 - https://phabricator.wikimedia.org/T301315 [09:29:59] I remember being told not to backup yet, please ping me when things are ready so not to forget :-) [09:30:05] will do [09:30:09] 👍 [09:30:25] Added to the task just in case [09:30:49] ok, will rerun backup [09:40:57] kormat: I'm back to my PC. Do you still have questions about auto_schema? [09:43:31] Amir1: does it downtime hosts before running the check() function? [09:44:15] mmh, I am getting the same error as before. I will double check the host. [09:44:16] it runs check() after downtime and depool, there is a ticket to move it forward [09:44:23] maybe it is something else breaking [09:44:28] jynus: let me know if I can help [09:45:27] yeah, the grants are still there [09:45:50] in db1115? [09:45:53] I removed them :-/ [09:46:09] Amir1: oof, k. and there's no way to run it with `--check`, so it will just print out hosts that have/have not been done? [09:46:20] I will paste them just to make sure we are talking about the same [09:46:33] jynus: hang on, I deleted them from db2093 I think [09:46:52] marostegui: https://phabricator.wikimedia.org/P20770 [09:46:53] kormat: I actually want to implement it but never got the time to do it :D [09:46:59] I can take care if you want [09:48:06] they are gone for sure now jynus , sorry about that [09:48:19] Amir1: does auto_schema know not to bother with depooling/repooling a passive section? [09:48:24] | GRANT RELOAD, FILE, SUPER, REPLICATION CLIENT ON *.* TO `dump`@`10.64.16.31` IDENTIFIED BY PASSWORD '*xx' | [09:48:25] no issue. :-) [09:48:29] (and the other two ips too) [09:48:39] great [09:48:40] kormat: it does [09:48:46] retrying [09:48:52] the whole repool/depool is automatic [09:49:28] Amir1: ok, phew [09:49:37] marostegui: error log is empty now [09:50:06] \o/ [09:50:07] will double check everything looks fine and be done with this [09:53:33] all backups now looking good, only es4, es5 still running [09:54:09] awesome [10:01:03] I'm obviously also around for es5, but will get out of the way :-) [10:02:04] I will monitor edit rate [10:23:53] Amir1: sent you https://gerrit.wikimedia.org/r/c/operations/software/schema-changes/+/762778 for your delight (but i'm not 100% sure i used the right gerrit user for you..) [10:25:26] (sorry, didn't realise the switch was still on-going, please ignore until later) [10:42:38] kormat: responded :D [10:55:23] Amir1: you are not doing anything on s4, right? [10:56:53] nope [10:56:58] s2 and es5 atm [10:57:03] cool, will take it [11:01:09] marostegui: I think this one I'm upgrading to bullseye (old es5 primary) needs PXE fix [11:01:31] I can try without it maybe and see if it works? [11:02:56] ah yes [11:02:58] no, it won't work [11:03:07] let me know when mysql is stopped and /srv umounted :) [11:03:41] okay [11:12:35] clearly you need some sort of automated queuing system in auto_schema :P [11:13:25] taavi: I actually some idea of automated "map" of maintenance based on parsing SAL [11:13:46] never got to implement it though :D [11:14:15] ugh it's not draining :/ [11:30:57] marostegui: done. es1024 is yours [11:33:47] I'm reimaging the codfw sanitarium (db2095) [11:34:14] oh it's already bullseye [11:34:15] nice [11:42:22] then I'm upgrading codfw primary [12:25:07] Amir1: does auto_schema run db-mysql with -BN? [12:25:25] what is -BN :D [12:25:36] 🤦‍♀️ [12:26:00] ah. mysql trips the ascii furniture if the output isn't a tty, ok [12:26:30] *strips [12:31:02] kormat: yup, just trying anything with mysql and pipe it to a file [12:31:33] so it implicitly does -B, but not -N [12:41:15] I'm reimaging db1170, it's s2 and s7 [12:41:26] Amir1: will do it in a sec, I was having lunch [12:45:30] marostegui: all good, in the meantime I reimaged a couple other db hosts [12:54:03] Amir1: you should be good to go now with es1024 [12:54:59] thanks. I get to it once I'm done with db1170 [13:04:20] Amir1: so, the i just added to schema-changes mentions a specific section. is it expected to sent a CR to change that section as i progress with deploying the change? [13:04:40] kormat: nah, just change the file [13:04:43] kormat: we are not doing that no [13:04:55] ok [13:05:07] the plan is to make schema changes easier not harder :D [13:05:12] and, i guess, you just copy that file into your own checkout of auto_schema on a cumin host? [13:05:42] yeah, I want to package it in the future [13:05:50] feel free to pick up that task :P [13:07:55] mmhmm [13:08:38] so.. how do i use this thing? the wikipage says that `--include-masters` will try to apply the schema change to the active dc's primary 😬 [13:08:53] it doesn't [13:08:58] no, it will deploy on the non active master and the sanitarium master [13:09:07] 🤦‍♀️ [13:09:10] as long as you don't explicitly say it [13:09:18] https://wikitech.wikimedia.org/wiki/Auto_schema#Running [13:09:19] (in list of replicas) [13:09:56] > Without --include-masters, it will ask you before running schema change ... (including ... active dc master) [13:11:02] kormat: that is correct but replicas = None won't pick up the active dc master [13:11:06] the wiki page also later says "If it's the master of the active dc. The schema change will run without replication." [13:11:28] oh [13:12:02] the plan is that in the future it should be able to do that as well, so if you put the master in list of replicas, it'll happily do it but not if you don't explicitly ask it to [13:12:43] my plan is that once we are comfortable it works (including checks that if it needs master switchover, etc.) then the implicit check handle it as well but not right now [13:13:06] ok [13:13:12] so, am i good to run this against s6? [13:13:21] sure [13:13:36] you can add a replica first [13:13:39] in codfw [13:13:46] stuff like that [13:24:05] kormat: oops :D [13:24:12] I make that mistake all the time [13:24:27] 😅 [13:30:28] Amir1: ok, so i ran it against a codfw s6 replica. what would you suggest as the next step? codfw s6 primary, with --include-masters? [13:31:04] kormat: I think that would break replication to that replica you changed [13:31:13] * kormat facepalms [13:31:44] so, i need to change the `command` to be idempotent [13:31:51] otherwise everything is going to hurt, at lot, right? [13:32:01] nah, lot of schema changes are not [13:32:07] so it won't work [13:32:15] i do not understand [13:32:26] _this_ schema change can be made idempotent, aiui [13:32:34] lots of schema changes can't have a idempotent command [13:32:47] if so, then yeah [13:33:17] my thinking was something like "get list of codfw replicas, put that in replicas and let it go" [13:33:28] i hate your thinking :P [13:33:31] for other sections obviously this is not needed [13:33:43] I know [13:36:36] can you tell auto_schema to run against a node which has replicas, without replication enabled? [13:37:38] I originally allowed it IIRC but I removed it because it should be all automatic [13:37:57] Amir1: except if the schema change is not idempotent :P [13:38:11] worst case, just change the code for this specific thing now [13:38:29] kormat: yeah, so the biggest issue is the "canary runs" [13:38:42] right. let's just shoot the canaries. [13:38:59] after that, you just run "s5" with None and it handles everything [13:39:39] kormat: for now, let's do the replicas [13:40:49] for those [13:41:21] so that would be: manually create a list of _all_ instances in s6.. that aren't the one replica i've already done? [13:41:35] oh, no. except i should avoid the primary in the active dc? [13:42:23] yup [13:42:47] but not all all, only direct replicas of primaries of dcs [13:42:51] oh, but it still will try to run _with_ replication against the primary in the secondary dc [13:43:00] what's the procedure for that? [13:43:39] actually there is a code for that [13:44:26] but let's just run it on codfw master now given the change [13:44:34] for future cases, I will find a way [13:44:40] ok :) [13:45:03] implementing it should be rather easy [13:45:11] create a ticket please [13:45:26] (for anything you wish to change) [14:20:59] Amir1: task filed as requested [14:22:13] thhanks [15:29:38] godog: do we have specific grants for these ips? https://phabricator.wikimedia.org/T301784 [15:30:43] As the hosts are gone I don't know what IPs they had :) [15:30:57] But I would have thought we had 10.64.% ranges rather than individual IPs? [15:31:22] Ah you meant the prometheus-mysqld-exporter I guess [15:32:04] Changed the task to reflect that. I will check the existing IPs and assume the ones that do not resolve are the ones that need to go away :) [15:36:08] marostegui: I'm in a meeting, but yeah all but prometheus2004 shouldn't resolve by now [15:36:24] godog: no rush: https://phabricator.wikimedia.org/T301784#7711241 [16:01:36] marostegui: cheers, I'll fill in the task description too [16:02:46] marostegui: I'm in favor of eqiad/codfw wide grants FWIW, as opposed to individual hosts that is [16:51:21] Amir1: you working on db1170? [16:51:32] it should be done now [16:51:38] did go kapput? [16:51:51] Amir1: no no, it is alerting on prometheus exporter [16:52:03] if it is a multi-instance I guess you need to disable the service one and reset-failed [16:52:17] ugh [16:52:21] yeah, okay I will do it [16:52:23] sorry [16:52:31] sure no rush! [18:56:45] I have been chatting with Marko from mariadb, there's a workaround for the crash which I am testing at the moment [18:56:53] So far it looks like it worked fine with aawiki [18:57:02] So I am applying it to all the affected tables on db2074 [18:57:15] I will update the jira ticket and our ticket tomorrow [18:57:35] it's been 12h today already so I am ready to leave the computre [18:57:40] computer [19:08:04] marostegui: fixed it