[05:51:46] marostegui: moaning [05:51:52] o/ [05:52:06] o/ [05:57:26] ugh [05:58:08] Wakey wakey ;) [05:59:22] there is no amount of caffeine that makes 05:58 a reasonable hour [05:59:32] ^ [06:00:15] is there a wiki page that describes the procedure? [06:00:56] The steps are listed in https://phabricator.wikimedia.org/T294321 [06:02:16] Also here: https://wikitech.wikimedia.org/wiki/MariaDB#Production_section_failover_checklist, but this one requires a clean up [06:02:40] ta [06:02:50] sobanski: clean-up is implied by 'wikitech'. ;) [06:03:29] are we using db-switchover script? [06:03:49] * Emperor is also watching -operations, but figured the half-asleep witter might be better remaining here [06:04:46] Emperor: yep, with 2 runs [06:04:52] the first run moves all replicas below the new primary [06:04:58] that's done ~30 mins before the switchover [06:05:15] the second run then promotes the new primary, and moves the old primary to be a replica [06:05:58] we had 31 seconds of read only time [06:07:14] marostegui: that'll have to do i suppose [06:08:23] I am going to reimage the old master [06:08:24] kormat: so during that 30m period replication is OLD->NEW->everything else? [06:08:33] yep 👍 [06:11:29] A question unrelated to the failover, is replication on db1102 (s3) disabled on purpose? [06:11:40] sobanski: those are the backups running [06:12:04] Ah. I somehow missed that fact, thanks :) [06:12:47] sobanski: common cause for confusion. tbh i think having the backup sources be named dbXXXX is a bad idea for this sort of reason [06:13:14] +1 [06:13:26] The idea behind that was that they can act as slaves if needed [06:13:28] (naming things is hard, but) [06:15:54] is that it for the switchover? [06:16:02] marostegui: i mean, that sounds potentially useful, but what happens in reality (for me, at least) is that i frequently see orchestrator+icinga+alertmanager saying that nodes aren't replicating, and every time i need to look up in hiera to double-check if these are backup sources or not [06:16:05] (beyond keeping an eye out for 🔥) [06:16:43] kormat: I think we need to explore https://phabricator.wikimedia.org/T266869 as that might solve this [06:17:09] marostegui: tell me how orch tags will solve icinga/alertmanager in the above statement? :P [06:17:20] no, of course, the icinga issue won't be solved [06:17:30] hostname is basically the only thing in common between those 3 envs [06:17:33] tag would be good _too_ [06:18:03] speaking of badly named nodes, db1108 :/ [06:18:31] how hard is renaming a node? [e.g. while its being reimaged anyway] [06:19:23] if you're already reimaging it, annoying but doable [06:19:36] if you weren't planning on reimaging it, tough luck, renaming requires a reimage. [06:19:50] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [06:24:22] ta [06:48:24] Amir1: re: T273054, https://i.imgur.com/o5CW9v5.png [06:48:25] T273054: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 [06:59:07] I have recreated db1108:3352 in tendril as it was failing [08:22:25] good morning [08:22:39] hey jynus welcome back! [08:54:20] I'm trying to understand high level status of reimages- all mw hosts upgraded to 10.4? [08:54:32] mw hosts? [08:54:43] db, sorry, that serve mw requests [08:54:46] ah yes [08:54:48] all done [08:54:59] cool, good job on all people involved [08:55:17] https://phabricator.wikimedia.org/T290865 is all yours now [08:55:31] And this too (the backup parts) https://phabricator.wikimedia.org/T290868 [08:55:38] thank you! [12:59:06] marostegui: do we have a list anywhere of the steps we take before/after dc switchovers? [12:59:21] like, stopping maintenance, checking weights, warming up caches, etc [13:00:04] kormat: the switchover oage? [13:01:05] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki [13:07:15] yep ^ that [13:08:32] If you mean for us specifically, we also have some I think, let me check [13:10:06] i do :) [13:11:09] So this is what we have: https://phabricator.wikimedia.org/T288594 [13:11:15] But I don't know if we have that somewhere in wikitech [13:11:17] I cannot really find it [13:11:28] So maybe we should add it somewhere at least as a quick draft [13:11:43] ok. it's all as i suspected :) [13:11:47] i'll put together a quick draft [13:11:56] thaaaanks [13:45:10] marostegui, sobanski: https://wikitech.wikimedia.org/wiki/MariaDB/Switch_Datacenter [13:55:44] kormat: thanks! [14:04:38] now linked from https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Databases [14:30:26] majavah: hi, can you access https://orchestrator.wikimedia.org/ now? [14:31:21] no, I get an apache 403 [14:31:46] previously it was an idp error message, but now that's a very clear apache default error message [14:32:27] :/ [14:32:52] idp lets you in but apache in orchestrator is unhappy. [14:33:04] that's weird. I'll ask kormat [14:33:16] https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/role/common/orchestrator.yaml#12 [14:33:21] you need to update that too [14:34:06] that makes sense. I want to make sure that's not write access, let me ask [14:34:50] that's idp [14:43:00] majavah: let's try again [14:43:40] seems to work now [14:44:07] awesome [14:44:14] thanks for spotting that [15:10:57] kormat: kibana is better than tendril :P [15:13:58] Amir1: re the orchestrator mail, there's a difference between being ok with some people accessing it, and advertising it for general availability 😬 [15:14:30] I'm ok with the former, I'm less.. enthused at the latter [15:14:56] it's not for advertising it, ops@ is only people who have production access [15:15:20] how many people are on ops@? [15:15:26] ~150+ [15:15:26] so it's much smaller group [15:15:32] I was a bit surprised with the email too [15:15:41] much smaller than..? :) [15:15:49] I'm on ops@ but I don't have shell on production [15:15:50] foundation-all [15:16:01] I'm not sure we'd even advertised it widely among _sre_ [15:17:42] the thing is that dbtree and tendril is used among devs too [15:17:57] I don't want to one day tell them, it's gone [15:18:04] having some proper heads up [15:18:35] Yes, that was going to be taken into account of course [15:36:50] I am having an issue with reimages of certain HP hosts- I want to ask if any has experienced similar issues [15:37:30] when reimaging, partitioning fails because the HW raid device is on /dev/sdb instead of the expected /dev/sda [15:37:54] The exact error I get is: ERROR: =dev=sda matches zero devices [15:38:09] All devices: =dev=mapper=tank-data, =dev=sdb [15:38:25] it is not a partman issue because it worked for non-HP hosts [15:38:49] I wonder if a BIOS update caused the drive letter to change or something else? [15:39:26] or a virtual cd drive to be detected or something [15:54:16] Even if it is not the recipe, I will try a wipe- to see if I can progress: https://gerrit.wikimedia.org/r/c/operations/puppet/+/738259 [15:54:30] (I don't care about the data there) [15:55:39] otherwise I won't probably be able to contact any dcops of help until next week [16:01:51] afk for a bit [16:45:28] I am progressing, now it just says "Illegal OpCode" on boot :-( [16:46:06] I am leaving it here for the day [16:52:22] back [16:53:34] btw, I'm helping on reducing extra parsing T292302 [16:53:34] T292302: CommonsMetadata extension causes every page on commons to be always parsed twice - https://phabricator.wikimedia.org/T292302 [17:16:11] Swift disk is sad, I've opened T295563 [17:16:12] T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 [19:58:18] The patch for djvu is ready finally \o/