[10:35:25] lunch [12:15:07] errand [13:20:02] o/ [15:19:10] \o [15:21:52] o/ [15:40:12] There's a question from SRE on this ticket: https://phabricator.wikimedia.org/T318820#8332742 [16:17:12] mpham will take a look, I think this gets back to the whole discussion about docker vs mediawiki-vagrant [16:19:40] ebernhardson ryankemper do y'all want to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/835596 in the upcoming puppet deploy window? [16:21:11] inflatador: sure, although we should probably follow dcaro's request to move the rule into the other module? [16:21:40] ACK, I thought ryankemper already did that? Checking [16:21:55] i'm curious about his comment though, ""temporary addition, this should be removed after 2022-10-15 [16:22:09] ACK on both counts [16:22:09] suggests we have different understandings of the purpose? We wanted to use this long term, not for a single use [16:23:00] agreed...which also raises the question, should we maybe mount on all w[cd]qs servers? Or worry about that later? [16:23:28] that also reminds me, I need to check on the reload cookbook [16:24:09] the script in the wdqs repo will have to be updated to read from the new paths, can check and assume but i was going to wait until it was mounted so can have direct evidence of the expected file path [16:24:45] assuming is probably fine, it's mounting a filesystem and we know whats there :) [16:27:50] I think that "temporary" message is a reference to the labstore servers that are being decommed (or may have already been decommed by now) [16:29:04] I dunno though! I'll just work on resolving d-caro's comment for the time being [16:29:14] ahh, perhaps [16:35:04] b-tullis did email us about some decommed servers that are still referenced in /etc/fstab on a couple of servers, let me see if I can fix that now too [16:45:37] hmm, if we do it d-caro's way, looks like we'll have to include the new ferm class in a bunch of places. Bah [16:47:00] :S [16:47:48] around now [16:48:27] cool, will have a new patch set up shortly [16:50:24] inflatador: ack, fyi you might want to reset `modules/profile/manifests/dumps/distribution/nfs.pp` to HEAD~1 cause there's a whitespace change still showing up [16:51:16] ah your newest patch already does, nvm [16:52:50] oh yeah, that must've fixed itself because I surely forgot ;) [16:54:12] "profile::query_service::nfs::ferm not in autoload module layout" - any idea what that means? [16:55:30] hmm, maybe because we don't define the ferm service in modules/profile/manifests/dumps/distribution/nfs.pp [16:56:45] it's the same port and protocol, I wonder if there's a better way to handle this [16:57:31] inflatador: howcome we thought we needed to include the new ferm class in a bunch of places? [16:58:30] I'm looking at the only other class in ferm.pp (where d-caro told us to change it) and it shows up in a bunch of places...also the initial patch set I submitted failed with an error that suggests it, let me check it again [16:59:00] can also check what pcc reports, if the desired machines get the rules [16:59:41] actually, it looks like I got the same autoload error in both jenkins runs [17:00:40] so maybe we don't need to add it [17:01:31] I feel like we should just be able to add our servers to the 'srange' value in profile::wmcs::nfs::ferm , but I don't know if that would make life harder for WMCS, or if it's even possible to do [17:03:08] inflatador: having two separate classes in `modules/profile/manifests/wmcs/nfs/ferm.pp` looks wrong to me [17:03:15] I think we just want one class with two `ferm::service` [17:03:23] lemme check how we had it before moving files, secx [17:03:34] cool [17:03:45] it does look like we can mix and match with srange, see https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/analytics_cluster/coordinator/replica.yaml#L34 [17:04:30] I'll have a patch up to try in a sec [17:07:53] I gotta go, back in ~45 [17:08:03] ack [17:08:06] trying PCC on new patchset [17:10:47] looks pretty reasonable: https://puppet-compiler.wmflabs.org/pcc-worker1002/37666/ [17:13:31] yea looks plausible [17:19:22] ebernhardson: okay, I'll give merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/835596 a go [17:28:18] * ryankemper forgot to change the commit title to say clouddumps instead of labstore...oh well [17:34:29] > Error: /Stage[main]/Query_service::Mount_dumps/Mount[/mnt/nfs/dumps-clouddumps1001.wikimedia.org]: Could not evaluate: Execution of '/usr/bin/mount /mnt/nfs/dumps-clouddumps1001.wikimedia.org' returned 32: mount.nfs: access denied by server while mounting clouddumps1001.wikimedia.org:/ [17:34:54] https://www.irccloud.com/pastebin/YxkEpdRx/cumin_puppet_runs.log [17:38:07] hmm, iirc that whitelisting happens through /etc/exports with nfs [17:38:39] and it looks like pcc put us in there, puppet must have run for the ferm rules to have loaded. hmm [17:39:01] maybe puppet doesnt trigger a reload of the nfs server? [17:42:02] seems plausible, i don't see anything in profile::dumps::distribution::nfs that would trigger a reload when writing /etc/exports (doesn't mean it doesn't, only that its not obvious :P) [17:43:33] ebernhardson: ack, so sounds like I should try a manual reload on `clouddumps[1001-1002].wikimedia.org` [17:44:18] yea, usually nfs only needs a reload and not a restart [17:56:25] ebernhardson: hiya! o/ when you have some time could you please take a look at the question in https://phabricator.wikimedia.org/T317682#8332931 about a bug/inconsistency with searchsatisfaction and wprov? [18:02:05] ebernhardson: fwiw the reload seems to have done the trick. they just started internet maintenance on my block though so my internet will be very spotty for...however long [18:02:09] ebernhardson: fwiw the reload seems to have done the trick. they just started internet maintenance on my block though so my internet will be very spotty for...however long [18:02:17] bearloga: sure i can check [18:02:19] oops sorry if that double sent [18:02:26] back [18:02:30] ebernhardson: thank you very much! [18:03:56] fwiw `exportfs -r` will force NFS to reread its exports file [18:28:21] ebernhardson ryankemper do y'all want to do pairing still? I could go either way. ryan and I will probably need to work on https://phabricator.wikimedia.org/T321310 at some point [18:34:09] inflatador: hmm, up to you [18:38:51] ebernhardson OK, let's skip it [18:39:11] I’m on my phone’s hotspot but that inexplicably has 150 secs of ping so I can’t rly do google meet rn anyway [18:39:23] I’m just gonna grab an early lunch, they should be done by the time I’m back [18:39:25] 150s! that's quite amazing :) [18:39:56] ryankemper sure sounds good, hit me up when you get back if you wanna work on the reboots, I'm gonna start on relforge now [19:43:27] looks like the elastic service won't start after a reboot, it's still waiting for "/root/allow_es7" [19:44:46] I guess we should take that out of modules/elasticsearch/templates/initscripts/elasticsearch_7@.systemd.erb [19:45:39] hmm, yea we can drop that by now [19:47:18] 1 sec, I'll get a patch up [19:47:45] surprised that flag doesn't stick around though, i would have expected it to stay until an instance is reinstalled [19:48:22] yeah, I didn't think about that, but you're right [19:50:41] at first glance, it seems to be there on all the prod hosts [20:17:46] got some shards that won't reroute on relforge small...I can get them to work if I do a call per shard but ` curl -XPOST 'localhost:9400/_cluster/reroute?retry_failed=true' ` doesn't seem to work [20:18:02] ^^ is that new for ES7 or did I just screw up the reroute call? [20:19:09] here's the call that does work: https://phabricator.wikimedia.org/P35738 [20:19:32] API ref: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/cluster-reroute.html [20:31:55] inflatador: retry failed is just if they wont assign due to having given up [20:32:08] If you want to directly move shards that’s the shard level reroute you mentioned [20:32:29] inflatador: back in action in 10 mins btw [20:32:35] ryankemper ACK [20:43:00] inflatador: okay, can hop on a meet if you want, or irc [20:44:03] ryankemper cool, up at meet.google.com/ygp-wenz-jcw [22:46:58] hmm, interesting. My input dataset has 495M pages in it, should be all pages in all wikis. There are 595M distinct outgoing_links though :) [23:10:57] and a potential significant annoyance (although solvable). outgoing_links are titles, we don't have namespace numbers or page id's. To turn them into updates going to have to resolve that :S [23:11:54] (also i suspect the discrepancy between existing pages and distinct outgoing_links are redlinks...but i'm a bit surprised if there are 100M unique redlinks across wikis)