[11:23:31] I have something for the dbctl experts [11:23:36] _joe_: Maybe you can help here? [11:23:50] <_joe_> marostegui: moi? [11:23:59] <_joe_> please do tell [11:24:55] I set up pc6 (eqiad and codfw yesterday) and it all went well. However, looks like there's something that attempts to remove pc2016 from codfw after a while. I noticed it yesterday but I thought maybe I forgot commit the change, but it is not, I saw in the evening again and right now again https://phabricator.wikimedia.org/P72131 [11:25:01] It only happens with the codfw instance, not the eqiad one [11:25:22] It seems to rewrite the content of dbctl -s codfw section pc6 edit [11:25:24] From time to time [11:25:29] But again, only for codfw [11:42:08] <_joe_> uhm [11:42:19] <_joe_> I'm trying to make sense of it [11:42:38] I honestly have no idea what it can be, I've never seen this with any other section I've set live [11:42:53] It is not puppet, cause I ran it on both cumin hosts [11:42:56] And it doesn't happen after it [11:42:56] <_joe_> ok, so, first of all we can check if something is writing to dbctl [11:43:31] <_joe_> we can do so by looking at the git log on any puppetserver under /srv/git/conftool/auditlog [11:43:52] <_joe_> so it's removing the instance, right [11:44:14] The instance itself has the right config [11:44:17] It is editing the section [11:44:21] And removing it from the section [11:44:41] So dbctl instance pc2016 edit that looks good, while dbctl -s codfw section pc6 edit looks rewritten with default values [11:45:05] So actually: dbconfig-section/codfw/pc6.yaml looks good on puppetserver [11:45:25] It is the way it should be, but it is like something is trying to overwrite it [11:45:37] <_joe_> the last change I see in dbconfig-section is from you yesterday [11:45:45] https://phabricator.wikimedia.org/P72132 that is the value that shows [11:45:47] Which is correct [11:46:02] But the diff I pasted earlier shows that something is trying to modify it [11:46:13] <_joe_> uhm [11:46:17] <_joe_> -omit_replicas_in_mwconfig: true [11:46:18] <_joe_> +omit_replicas_in_mwconfig: false [11:46:26] <_joe_> what does this mean? [11:46:50] <_joe_> marostegui: it's clear it's some bug in the code that translates those structures in mwconfig [11:46:55] mmmm [11:47:10] Ah, so eqiad has true [11:47:12] Which makes sense [11:47:17] Cause pc doesn't have replicas [11:47:55] <_joe_> can you please modify that back? [11:48:06] <_joe_> or I can if needed [11:48:09] I will have to modify the whole section, but yes, will take me a sec [11:48:44] done [11:48:57] Diff is obviously gone now [11:49:03] <_joe_> ok [11:49:07] Let's see if it gets rewritten after a bit? [11:49:10] <_joe_> so we need to understand why is that the case [11:49:17] <_joe_> nah I had the diff a second ago [11:49:52] <_joe_> so I'm thinking this is probably a quirk of how external sections were implemented (I don't know much about it) [11:50:23] No, the diff is gone now, so nothing to commit (so all good) [11:51:20] <_joe_> yeah so I guess the problem is just that - omit_replicas_in_mwconfig: true [11:51:48] <_joe_> err, "false" [11:51:57] <_joe_> somehow that causes the issue above [11:52:35] And why and what attempts to do it after a while? [11:52:44] Like what tries to modify it? [11:54:46] <_joe_> I don't think that's the case [11:55:06] <_joe_> anyways, I'll take a look at the code if I manage today [11:55:10] <_joe_> but no guarantees [11:57:22] Thanks _joe_ [12:11:15] _joe_: it is back again :( [12:13:43] And both section look the same apart from the master of course https://phabricator.wikimedia.org/P72133 [12:14:07] Same with the masters: https://phabricator.wikimedia.org/P72134 [12:14:36] <_joe_> marostegui: uhh wth [12:15:26] <_joe_> marostegui: I don't see any diff? [12:16:00] I see this in cumin1002: https://phabricator.wikimedia.org/P72135 (external store is expected) [12:16:03] <_joe_> I'm not sure what you're saying [12:16:25] _joe_: What I am saying is that again the host is being removed from pc6 codfw [12:16:40] <_joe_> oh yeah I see it as well now [12:16:42] https://phabricator.wikimedia.org/P72135#289239 [12:16:44] yeah [12:16:51] <_joe_> ok, this must be a problem with the datastore [12:19:50] <_joe_> marostegui: uhhh I think I found the problem, but lemme look once more [12:20:04] I am curious! [12:21:10] <_joe_> marostegui: the key is not in etcd [12:21:19] which key? pc6? [12:21:21] <_joe_> yes [12:21:27] But why only codfw is being removed? [12:21:28] <_joe_> curl https://conf1009.eqiad.wmnet:4001/v2/keys/conftool/v1/dbconfig-section/codfw/pc6 [12:21:52] <_joe_> that I have no idea about [12:22:07] <_joe_> seems like something is writing to etcd and it's definitely not conftool [12:22:07] They are at modules/profile/files/conftool/json-schema/mediawiki-config/dbconfig.schema [12:22:09] <_joe_> oh wait [12:22:30] <_joe_> I guess it's missing from conftool-data [12:22:34] <_joe_> so you add it [12:22:44] <_joe_> when someone puppet-merges [12:22:47] <_joe_> it gets removed [12:22:52] Aaaaah [12:22:53] <_joe_> and no one reads the output [12:22:54] <_joe_> :D [12:22:54] It makes sense [12:23:13] Yeah, but isn't it at modules/profile/files/conftool/json-schema/mediawiki-config/dbconfig.schema where it needs to be added? [12:23:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112017 [12:23:37] <_joe_> conftool-data/dbconfig-section/sections.yaml [12:23:45] it is there too [12:23:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112007 [12:23:58] <_joe_> ah sigh I didn't update my repo sorry [12:24:01] <_joe_> yeah [12:24:05] Ah wait!!!! [12:24:10] It is missing in codfw on that patch [12:24:11] Fixing [12:24:19] <_joe_> it's only in eqiad yes [12:24:34] <_joe_> so I was right, I assumed I didn't update :) [12:24:37] <_joe_> ok, phew [12:24:42] <_joe_> at least we solved this [12:25:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112199 [12:25:20] I was scared that something was writing to conftool from somewhere, that was scary [12:25:49] That patch should fix it [12:26:41] <_joe_> yes [12:26:50] <_joe_> and well something was writing to conftool from somewhere :D [12:27:04] yeah, but for a good reason [12:27:14] <_joe_> I'm not sure why it wasn't in conftool2git though, sigh [12:27:43] <_joe_> I'll go to lunch on a happy note [12:27:53] thanks for the help _joe_! [12:36:36] I go for lunch too, while db2141 s1 tables are rebuilding [12:37:34] s4 dump worked well for db2239, so all looking good [12:39:52] Actually, I need to review it, it sais ok, but backup only took 1h, which is very suspicious [12:40:24] despite size being the same [12:45:46] https://phabricator.wikimedia.org/T383971#10470261 [12:49:41] sorry- this was meant for #data-persistance