[09:52:09] marostegui: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/757389, it says that db1159 will be moved to m2. doesn't that leave us without a candidate primary for m1 in eqiad? [09:52:32] kormat: All misc hosts have the same candidate (standby host), db1117 [09:52:49] s/hosts/sections [09:53:02] marostegui: ... so we'd have a multi-instance primary if something goes wrong? [09:53:12] * kormat is wearing her Doubt face [09:53:29] kormat: yes, but it is RO, so if the proxy failovers to it, we'd need to enable it manually. That's done in case the master flaps or whatever reason [09:53:37] If we really need to failover to it, then we need to set RW manually [09:53:51] and then reclone the old master as replicaiton won't be changed by the proxy or anything [09:54:09] i can guarantee that at least some part of puppet/monitoring/automation will freak out at a multi-instance pirmary [09:54:29] yeah, but it is sort of an emergency scenario, even if puppet fails, the services will continue to work [09:54:32] but, yeah, i see your point that that is the standard setup for misc clusters. don't know why i never noticed it [09:54:38] recloning the masters shouldn't take long [09:54:53] mm [09:55:03] having one real standby host per section is probably not worth the money for those 4 individual hosts [09:55:08] ack [09:55:38] but it can of course be discussed and/or planned for next budget if we'd feel more comfortable with that [09:57:00] marostegui: my main concern is just that the automation does _not_ have a scenario like this in mind. so some re-thinking might be in order [09:57:37] yeah, but in that emergency scenario we simply change RW on db1117 and that becomes the master, as the proxy does the rest [09:57:50] and we need to reconfigure codfw to replicate from that new master, but that's easy and non impacting [09:57:56] but yes, I get your point [09:58:26] we currently don't have an automatic way to promote a sX candidate to master or at least configure replication on all the other hosts if the master dies [09:58:30] and that does worry me a lot :( [09:58:34] ohh. so not just change RW, but _simply_ change it. ok, that changes everything! [09:59:08] it is just a single mysql command [09:59:17] for misc I mean [09:59:35] if the proxy failsover to db1117, we just need to issue a read_only=off on that host to make it the master [10:00:51] we don't run heartbeat on multiinstance nodes, something to bear in mind [10:01:14] yeah, but it is not a big thing as lag doesn't matter on misc [10:01:17] for that period of time [10:01:59] having db1117:3321 as master would be something very brief, until we recover or reclone the master [10:08:37] "2022-01-25 10:30:33 [ERROR] - Could not read data from frwiki.blobs_cluster26: Server shutdown in progress" [10:08:50] yes [10:11:04] when can I reschedule it? [10:11:34] there was nothing connected to it when I stopped it [10:12:15] should be back today, we've found the same PXE boot issue we saw with es1022, the whole batch had that misconfig [10:12:28] v0lans is updating some cookbooks to fix that [10:12:34] that's weird, there is normally at least 1 connection ongoing to keep consistency [10:13:19] jynus: https://phabricator.wikimedia.org/P19262 [10:17:12] in any case, es backups, because of the very very low concurrency, take from 0am to 20 am on Tuesdays [10:17:31] no big issue if they fail, just you won't have backups :-) [10:18:57] one thing, that's es1020, backups come from es1022 [10:19:25] so it could have been a confusion, no bigie :-) [10:20:05] es1022 was reimaged yesterday and there were no connections when I checked [10:20:22] I always check esXXXX connections before stopping mysql, especially cause of snapshots [10:20:30] I don't have the processlist for that one, so you'll need to believe me [10:24:28] sure, just if I can ask you (this was something I mentined after this shutdown) to either avoid tuesdays or ping me if you want to do it that day to see how we can make it work? [10:25:26] I thought you mentioned it for the failover. But yes, np [10:25:59] this doesn't affect me or my work, just wanting to make sure we have fresh backups for you :-) [10:26:32] ok, thanks [10:50:04] jynus: you can re-run the dumps on es4 if you want. All the replicas are done [10:50:24] ah, thanks, I thought the hw issues were still ongoing [10:50:31] I was about to ask on ticket [10:50:35] not on es4 [10:50:41] those are done [10:51:51] also, as I see it was a hw issue (I just wasn't aware about T299123), there was not much to avoid there! [10:51:51] T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 [11:13:56] jynus: i have a question - i see you used CRC32(GROUP_CONCAT(...)) in db-compare. is there a particular reason for the choice of CRC32? (i've just been doing some benchmarking, and don't see any performance difference between CRC32 and, say, MD5) [11:14:23] kormat: I copied on that regard what pt-table-checksum did [11:14:33] ahh, ok. thanks! [11:14:45] they mentioned crc32 was very fast on some processors [11:18:29] kormat: they claim so here: https://github.com/percona/percona-toolkit/blob/3.x/bin/pt-table-checksum#L12817 [11:18:53] of course I just took they word for it, but I belive they support others too [11:18:56] * kormat nods [11:19:08] yeah they support a few, including sha1/md5 [11:20:25] there is also: https://www.percona.com/doc/percona-server/LATEST/management/udf_percona_toolkit.html but I haven't tried (not it is installed on production) [11:20:32] *nor [11:26:56] one thing to notice also is that the default chunk size is very conservative, querying 100000 rows or so is when performance starts to show (otherwise overead on queries will be the main factor) [11:27:08] *overhead [11:28:53] aye. already played around with that. [13:26:30] I have just removed recentchanges group from s8, if you notice something please let me know [13:26:39] The other big group, watchlist, is still pending [13:46:26] "yay" https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932217 (one of swiftrepl's dependencies is gone post-Buster [13:49:02] Well that'll force our hand [13:50:20] I was hoping we could procrastinate that for longer though [13:54:50] Amir1: I am going to try something with volans, so I am going to take over es1025 from es5 [13:54:54] jynus: ^ [13:55:02] ok [13:55:10] * volans hides [13:55:14] less work for me? Awewsome [13:55:26] * volans declines any responsibilities [13:55:29] why not every day is like this [13:55:43] * sobanski assumes to blame volans for any collateral damage [14:34:42] is this how we do blame-free post-mortems, then? Make sure we have the blame lined up _before_ anything goes wrong... [14:38:24] It's the best way. No ambiguity. [14:38:55] Also, we never said anything about blame-free pre-mortem. [14:39:00] Emperor: even better if specified in the commit message [14:39:42] :) [14:39:55] git commit -a -m "volans said it was a good idea" [14:40:38] :) [15:24:48] Can I get a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/757389 ? [15:25:20] oh, yeah. i was looking at that hours ago, and got distracted by misc/candidate primaries [15:25:34] thanks! [15:37:25] Amir1: es1025 is done, we can do the rest in eqiad tomorrow. There's something we need to run before to fix some of the HW issues. Let's chat tomorrow [15:37:34] I am going to leave the host repooling and logoff, it's been enough for today [15:38:01] awesome, thanks [15:38:44] es2023 should be fine [15:38:47] it is just eqiad hosts