[05:24:09] hashar: (or someone else) https://phabricator.wikimedia.org/P33014 is that expected? [08:58:40] marostegui: not at all. The reason is I rolled back yesterday and forgot to push the revert to Gerrit :( [08:59:25] aka https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/826507/ [08:59:40] I screwed that one up :-\ [13:04:58] topranks: XioNoX: Are you happy for me to push out this BGP change today, or is there a better time to schedule it? https://gerrit.wikimedia.org/r/c/operations/homer/public/+/826525 [13:14:35] btullis: yep! let me know if there is any issue [13:22:30] Thanks. Will do. [13:56:07] It seems I'm missing from the user list in homer. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/config/common.yaml#43 [13:57:13] btullis: it's not required to push stuff from the cumin hosts, but useful if you need to troubleshot stuff there [13:57:14] XioNoX: Should I make another patch to add myself, or would you rather push out the change? I thought I had already used homer before, but evidently not. [13:57:21] freel free to add yourself, yep [13:57:46] fyi I have to step away in ~15min [13:58:08] It was the `homer * diff` that was failing. [14:00:00] btullis: failing how? [14:02:10] https://usercontent.irccloud-cdn.com/file/ym5ENQ0U/image.png [14:03:05] btullis: just a warning :) [14:03:21] btullis: ah, did you run puppet on cumin1001 to pick up the new change? [14:04:01] Oh, I see. Sorry, I mis-interpreted skipping device as 'couldn't get diff for device' [14:04:14] > did you run puppet on cumin1001 to pick up the new change? [14:04:23] No, I didn't. Thanks. [14:05:16] OK, trying again. [14:14:29] btullis: everything going well? [14:14:55] Not too well. `ERROR:homer:Device cr1-eqiad.wikimedia.org failed to render the template, skipping.` [14:16:32] https://www.irccloud.com/pastebin/mwOAEOb2/ [14:17:38] btullis: ah, right, jinja2.exceptions.TemplateNotFound: includes/customers/64609.policy [14:18:24] see similar files [14:18:56] btullis: do you mind reverting it, and sending a new CR with that file? [14:19:11] I unfortunately have to step away [14:19:17] Not at all, that's fine. I'll revert now. [14:37:25] Here's the updated CR but I'm happy to wait until next week to push it out, if that helps: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/826579 [14:45:45] marostegui: Amir1: regarding x2 replica connections. The way we'd like to address this is not by special casing MainStash but removing existing special casing and instead utilising the fact that MW applies all these features automatically based on there being replicas configured. In other words, wed like to change db config to not advertise the replicas in the first place. Our options here are to modify the data from dbctl/etcd in [14:45:45] wmf-config, or for it to be modified/excluded in what materializes in etcd in the first place from dbctl. I don't know exactly where and howe we want the separation of concern here. On the one hand, I guess you still want the replicas to be regitered somewere in an easy to find manner what they are for and for switchovers. On the other hand: dbctl's etcd structure is exactly that of MW and making it different at runtime introduces [14:45:45] confusion/on-boarding/intuition struggles. [15:05:07] please read working document for today's incident report [15:21:59] Krinkle: I don't mind not having the replicas registered somewhere, we have orchestrator.wikimedia.org for that, but I would like to have them either fully depooled in dbctl or not showing up, as otherwise it can be confusing [15:23:41] let me know if you want some sort of special handling in dbctl for these [15:23:59] cdanis: Thanks :** [15:24:28] I think we can just leave them with 0 weight and that's probably enough to show that they're really not used (once the bug is fixed) [15:27:22] we already added a `flavor` field to sections for external sections vs core [15:27:32] could do something similar (although I don't have context here) [15:28:38] topranks: Could you comment on https://phabricator.wikimedia.org/T315955? I assume it's just entering something in netbox, which I'm happy to do, but maybe there's more to the task than I'm getting. [16:54:56] yes for some reason te fact is miss identifying ill take a look later tonight sorry had friends turn up earlyier then expected today so been a bit disracted :/ [18:48:18] marostegui: "once the bug is fixed" equals "remove from config". The features about lag checks and chronology protection are enabled if and when a database has replicas provided to MW. We could build a separate feature flag for this, but this is currently zero-config based on there being replicas or not, and there is no use case besides x2 for this. I'd prefer as such to treat the existence of these as an operational detail and not [18:48:18] something MW needs to know about. [18:48:46] we already skip the logic and overhead related to lag checks and ChronoProt when a cluster is master-only. [18:49:08] (e.g. for local development, small third-party farms, PC, and x2) [18:49:33] I'm happy to do that through a post-processing step in wmf-config first. I just wanted to give you both options. [18:50:05] weight:0 will not be enough I think. It will dampen the effect but still register them as generally having replicas. [18:50:19] it's a boolean change in behaviour [18:50:28] * Krinkle copies to task and creates subtask for cdanis [18:51:16] Krinkle: okay, thanks, that's all good context. I need to check but I don't think it should be hard to add another dbctl section 'flavor' that simply doesn't include any replicas in the output [18:51:56] which sounds like it would be enough? [19:05:45] This sounds like dbctl stores data in two places in etcd, one as source and one as output. [19:05:55] If so, yeah, that sounds like it would suffice [19:06:21] yes [19:06:29] that is right :) [19:06:47] the output part very closely follows Mediawiki's data structures [20:03:03] I am following https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings, and there is a bit where it talks about device enumeration being wrong (and "...a reboot of the host "fixes" this"), but I'm seeing labeling that is wrong. sdh1 & sdi1 are crossed for swift-sdh1 & swift-sdi1 respectively, and both sdc1 and sdz1 are getting labeled swift-sdz1... does this ring any bells for anyone? [20:04:20] (for ms-be2067 fwiw...) [20:17:38] urandom: Our last batch of servers had their labels all mixed up on first image. [20:18:51] In our case some of the drives had existing partitions from a previous install attempt. Once those were all cleaned up, a future install labeled things as expected. [20:19:21] Something seems to have changed with Bullseye that makes reordering more common . Some (not very helpful) discussion here: https://forums.debian.net/viewtopic.php?f=5&t=145185&sid=259af33014c42b08d05e12324b6daf39 [20:22:36] it changes every time the host is rebooted... :/ [20:22:56] https://wikitech.wikimedia.org/wiki/Swift/How_To#Rebooting_backends_/_Puppet_is_failing_on_a_recently-booted_backend [20:23:31] "...it's worth checking drive ordering especially of /dev/sd{a,b} is correct; similarly when rebooting swift nodes, check this is correct. If not, reboot until the drives come up in the right order." [20:32:48] that sounds... non-optimal. "Just keep toggling the power until things start working" [20:33:14] bd808: I'm feeling a little better about it now [20:33:17] have you tried turning it off and on and off and on and off and on again? [20:33:59] I've logged into a number of other swift back-ends, and not a one of them has a full set of drives that match their labels [20:34:16] so... maybe it's all good (somehow)? [20:35:32] rzl: and off and on and off and on...yes