[00:25:01] I was thinking "why do I get updates on this ticket from 2014"?.. turns out just because someone somewhere used a string like "failed jobs for hour 2022-12-08-T11:00Z" it got added to the ticket ... T11 lol and sigh :) https://phabricator.wikimedia.org/T11#8453851 [00:25:02] T11: enable DarkConsole in phabricator - https://phabricator.wikimedia.org/T11 [00:43:36] mutante: ugh. that is an unfortunate stashbot bug. If you'd like to file a bug about it I can spend some time scratching my head about how to change the regex to avoid that. [06:12:26] <_joe_> bd808: you can probably just look at the capture length? [06:12:40] <_joe_> there not much chance anyone edits a task under 4 digits [07:32:47] <_joe_> andrewbogott: even if it was, it's at least a couple months away. If everything goes right. I don't think it's a good reason not to fix stuff. [07:33:00] <_joe_> esp when it's easy to fix [07:33:48] <_joe_> for instance, we could switch wikitech to use mcrouter locally instead of nutcracker, without changing much if anything at all [09:01:16] !incidents [09:01:16] 3190 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 ops page eqiad prometheus sre) [09:49:25] I got a supureous: KeyError: key not found: "PARALLEL_PID_FILE" on operations-puppet-tests-buster-docker, but couldn't replicate it [14:57:04] effie, mind if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/868354 in half an hour or so? It should have done its job by then :) [15:45:25] andrewbogott: regarding the next in line question, the answer is, strictly speaking, no cause we got first to migrate testwiki and probably testwikidata too. But after those are done and we are satisfied with the results, yes it is a candidate. [15:58:38] I **think** this PR should have opened the NFS port so wdqs2009 could mount clouddumps, but I don't see wdqs2009's IP in clouddumps1001's iptables rules, even after a puppet run. Suggestions? https://gerrit.wikimedia.org/r/c/operations/puppet/+/867646/5/hieradata/common/profile/dumps/distribution.yaml [16:17:03] inflatador: that's a codfw host [16:17:27] do I read this correctly? that cross DC NFS will happen ? [16:18:31] akosiaris I think so...but I'm mounting the same share on wcqs2001 without problems [16:18:42] inflatador and no, it wouldn't open up the tcp port btw [16:19:04] in modules/profile/manifests/dump/distribution/nfs.pp that variable isn't used in the ferm::service stanza [16:19:17] but rather the $ANALYTICS_NETWORKS ferm macro [16:19:27] bd808: aww, thank you. that's more thatn I expected, originally just wanted to share as some curiosity:) [16:19:42] I expect that is is used in line 39 though, for the /etc/exports file [16:20:19] as in, it is used by the template [16:20:51] yup, sure enough modules/profile/templates/dumps/distribution/nfs-exports.erb uses it [16:21:02] so there you go, I hope I helped [16:21:06] thanks akosiaris , looks like modules/profile/manifests/wmcs/nfs/ferm.pp is the correct place for fw rules [16:21:19] yw [16:28:39] andrewbogott: yes go ahead, though I still think we need to untangle the openstack nutcracker profile from the mediawiki profile [16:28:53] inflatador: Is this a permanent setup for cross-DC NFS, or just for a one-off reload? [16:30:29] inflatador: If it's for a one-off, have you thought about using `transfer.py`? https://wikitech.wikimedia.org/wiki/Transfer.py [16:30:34] btullis it will be permanent as far as the mounts go, but only used for a single reload at a time [16:31:06] cross DC? do we have NFS over TLS nowadays? [16:31:26] inflatador: yeah I was going to ask, is there doc on why NFS was chosen? I quickly went through the tasks and couldn't find any info. We're overall trying to get rid of NFS in the infra (long term). For example it's not flexible, particularly sensitive to network congestion, etc... [16:31:51] inflatador: Thanks. I have to say, I'd try to avoid NFS between DCs as a rule. [16:31:55] XioNoX it's linked here https://phabricator.wikimedia.org/T222349 [16:32:02] as far as the justification [16:35:06] btullis as far as transfer.py, we use the same logic as transfer.py in our data-transfer cookbook. It's possible we could use it for reloads, but I think we'd still run into rate-limiting. It takes ~8 days to do a single reload, so any way we could speed that up would help. (Although to be fair, we only do reloads a few times/yr) [16:35:39] <_joe_> inflatador: I'm curious, why is NFS faster than transfer.py? [16:36:08] <_joe_> is that basically because there is no bandwidth limit? [16:36:28] <_joe_> or some other reason I am not seeing rn [16:36:53] not sure I see the justification of using NFS in that linked task? is there a specific comment I'm missing? [16:37:15] It's in the ticket linked, "With the current rate limit at 2MB/s" is the operative phrase [16:37:25] <_joe_> ack, thanks :) [16:49:03] I didn't see an answer to the TLS question. It should at least be encrypted; we don't trust our transport links. [16:49:21] I haven't read the ticket yet either, or even most of the above backscroll :) [16:49:48] NFS scares me in general, and doubly-so over a WAN, but I'm not informed enough on this particular effort to weigh in heavily [16:50:44] I'm not wedded to NFS, just need a way to get the data loaded [16:51:52] I also have to temper my statement with: I haven't used in NFS in production in a very long time. My past experiences with it in previous decades were generally terrible, but who knows, maybe modern implementations have made some things better. [16:52:05] We're transferring publicly available information, no secrets. It would be possible for an attacker to MiTM and modify the NFS data, but that seems a bit far-fetched to me [16:52:56] on that point though, our general policy rule is that x-dc traffic should be encrypted. In practice, I don't think absolutely every packet actually is currently encrypted, but most of the major/important traffic is. [16:53:21] the snowden leaks, etc, showed us that US agencies can and will tap transport fibers to listen, and could possibly MITM the traffic as well. [16:53:41] inflatador: I think that the suggestions from v.olans here are useful. https://phabricator.wikimedia.org/T222349#7574794 [16:54:25] For a permanent setup I would probably look at setting up an rsync module for this: https://github.com/wikimedia/puppet/blob/production/modules/rsync/manifests/server/module.pp [16:55:19] (true MITM might be hard to do without detection, but they can at least both sniff + inject, which is most of the way there) [16:57:05] btullis I looked at the rsync stuff and seems more geared towards active/passive services , doesn't seem to fit the use case here where we're typically active on all nodes and data reloads don't come from a different server with the same roles, but from an external source that needs to be processed before it can be used [17:08:44] rsync is just a highly flexible file copying tool, so I don't feel that it's really more suited to any specific use case or other. It's just very useful if you want to copy files from place to place, reliably. [17:11:00] there are examples across the puppet tree about how to use rsync::server::module, it sounds to me too like a good fit for this use case. Since from what I gather, the "external" source is actually within our network and we got access to it, rsync should work wonders [17:11:43] and it won't need tricks like mount -o nofail to not cause system wide instabilities in case of network issues [17:11:45] sounds like inflatador meant the puppet implementation of rsync we have and the rsync::quickdatacopy class. that is indeed geared towards active/passive servers, but you can just not use the abstraction and use directly "rsync::server::module" on your source machine and then some timer/cron that pulls from it elsewhere [17:12:23] or pushes to it.. either way [17:16:10] and there is the stunnel option to encrypt it [17:17:36] though I also have not been using that in case where the data is like a public dump anyways [17:20:06] As is the case for most of WDQS, there's definitely room for improvement. Sounds like a-kosiaris and myself will be meeting soon to talk about an rsync implementation. Thanks everyone for your suggestions! [21:45:39] Cannot figure out what's wrong with this `elif` statement for the life of me...anyone around to rubber duck? :D [21:45:44] Here's the block in question: [21:45:50] https://www.irccloud.com/pastebin/KjsOjyG3/ [21:46:10] The line `elif args.reload_data in ['wikidata', 'commons'] and not is_datetime_valid(kafka_timestamp):` blows up but I can't quite figure out why [21:46:49] do you get an error? :P [21:46:50] When I run the cookbook containing that code I get `Failed to import module cookbooks.sre.wdqs.data-reload: invalid syntax (data-reload.py, line 275)` (line 275 corresponds to line 5 in my example above) [21:48:03] That is sort of swallowing the actual error of course [21:49:14] ryankemper: pretty sure that data-reload is not a valid python identifier (note the "-") [21:49:41] Here's the actual patch in question: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/868198/5/cookbooks/sre/wdqs/data-reload.py [21:49:46] ryankemper: full code? [21:49:52] RhinosF1: ^ [21:50:18] Hold on that patch has the line commented out, pushing new patchset [21:50:52] Ye [21:50:54] Okay, https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/868198/7/cookbooks/sre/wdqs/data-reload.py [21:51:56] gehel: you mean the `data-reload` in the error msg? that's just the cookbook name, spicerack is complaining cause it tries to import the cookbook module but fails due to whatever error is causing this problem [21:53:13] ryankemper: right! Forget that I wrote anything. And I'll stop looking at IRC for today. Good luck! [21:53:20] It says there's a syntax error [21:53:26] But I can't spot it [21:53:38] And it's too late for brains [21:53:41] :P [21:53:57] I got paranoid and even tried rewriting the line incase there was some weird unicode demon lurking somewhere [21:54:03] But alas, no demons under the bed as far as I could tell [21:56:20] Normally I bang my head on a wall, go and have a hot chocolate and then realise half way through [21:57:06] I did try and compare 2 folders and forgetting to copy the changes into one of them first [22:00:09] Heh I've made a couple mistakes like that already :P [22:00:45] Well, I realized I might as well move the validation check to the actual extract_kafka_timestamp function and it's working now, so I guess that's good enough for now :D [23:30:21] Hello, I used the 'sre.hosts.decommission' cookbook on netmon2001 (https://phabricator.wikimedia.org/T322695) but I entered the management password wrong, this caused some tasks to fail. [23:30:22] What would be the best way to check if the hosts decommission was successful after the 2nd run (with the correct password)?