[10:06:50] lunch [13:19:40] greetings [13:36:41] inflatador: I think yesterday you mentioned the bullseye upgrade is still underway. We had originally estimated June 3 for Bullseye and Java 11 to go through. What's your thoughts on a new estimated completion date? [13:38:23] mpham I'd say 2-3wks from today, although we'll want to confirm w ryankemper and ebernhardson when they get in [13:40:12] ok thanks. let me know when you have a better idea! [13:47:01] np, I think that is a reasonable assumption based on the mtg conversation yesterday [13:47:09] dropping off my kids, back in ~15 [14:07:50] back [14:09:28] inflatador, ryankemper, ebernhardson: I'm moving our SRE pairing session a bit later (conflicting meeting) [14:25:35] o/ [14:40:04] team (in particular dcausse): any objection to me publishing https://docs.google.com/document/d/15VMaIqE8GiMjnkZpiQjhH-zQqV2wl1eGuB7qOzmvxuM/edit ? [14:58:04] sure [14:59:46] \o [14:59:56] gehel: perfect [14:59:58] o/ [15:00:02] thanks! [15:01:04] getting close on test runner, can now run two short-ish bash scripts and it will create a bunch of containers with mwcli and run the test suite. Just failing some file related things (pdf/svg metadata) and intermittently failing redirect relevancy tests (runJobs.php doesn't respect waiting for 2*$refreshInterval before running a link count) [15:01:40] not sure what to do with the jobs....thinking about dropping the tests which seems unfortunate [15:02:27] but anyways, i guess what i was trying to do is this makes it easy-ish to spin up multiple elasticsearch clusters in the same env and flip between them. [15:39:56] I just published https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/TechnicalInteractionsWithQueryServices. It probably make sense to link it from a few places. Not sure which ones [15:59:20] workout , back in ~40 [16:01:27] dinner, back later [16:45:10] back [17:15:37] Working on the S3/swift/ES stuff now. I've installed the same ES version and plugin outside WMF and confirmed it works with an S3 compatible API...also just confirmed that our creds work with thanos-swift when accessing via S3 API' [17:16:06] next step is to try some API calls from relforge to thanos-swift endpoint and make sure we have connectivity [17:24:01] Relforge does indeed have connectivity [17:30:12] sweet! [17:42:03] yeah, my best guess at the moment is that the repo needs some more settings around specific indices, chunking, etc...will take another crack after lunch. Back in an hr or so! [17:51:22] connect timed out sounds like wrong host/ports being accessed. I wonder if somehow it's trying to contact some other non-s3 aws service. [17:51:43] also, apprently the aws-sdk-java repo is multiple gigabytes :S [17:53:43] (also, thats just the most recent exception in relforge, not certainly whatever is wrong :) [18:17:27] dinner [18:35:03] back [19:25:38] ebernhardson are you able to join the SRE pairing at https://meet.google.com/eki-rafx-cxi ? We're playing around w the elastic snapshot settings [19:28:12] inflatador: sure, 5 min [20:47:43] poking more, appropriate logs come out when creating a new repository and confirms it's reading the right endpoint [20:48:25] the log "using instance profile credentials" is likely our culprit, that means "find the credentials somewhere", and the other option is "using basic key/secret credentials" which is probably what we want. Still dunno why :P [20:52:41] i suspect the keystore needs to have 's3.client.access_key' and not 's3.client.default.access_key', but not sure what the values we need are [21:05:28] interesting [21:05:45] I can try that [21:06:01] as general proof, if we set `-Des.allow_insecure_credentials-true` in jvm.defaults we can provide the key in the repository create request (and it fails with a new error) [21:06:13] currently relforge is running ith allow_insecure_credentials enabled (and puppet disabled) [21:06:37] err, it was `-Des.allow_insecure_credentials=true` [21:07:06] but it leans more into credentials being the problem, and trying some different keystore settings might be worth it [21:07:08] I think it might have something to do with our multiple instances too, in the way that elasticsearch-keystore hard-codes its file location, maybe there are some other assumptions happening too [21:07:33] What's the new error? [21:08:04] Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: txa8d25a51bcef45fba2d7-00629fbd96; S3 Extended Request ID: null) [21:08:14] into more sane error land :P [21:09:38] seems reasonable there doesn't exist a bucket with the ticket number, not sure what it should be though [21:10:21] actually, there is a bucket called T309648 , but maybe the S3 api can't see it [21:10:22] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:10:33] I created it with python-swift client [21:10:50] but it's possible that uppercase T breaks S3 naming rules [21:11:09] there's also one called "rando" with an object called "random_data" if you wanna poke around with that [21:17:04] yup, boto does NOT like the T309648 bucket. But I was using the rando bucket for at least some of those API calls and still getting a failure [21:17:05] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:19:37] https://phabricator.wikimedia.org/P29488 has the example code I was using to access thanos-swift from relforge [21:22:12] inflatador: cna you put the key on one of the relforge hosts where i can copy it? [21:23:53] ebernhardson y, just copied the uncensored script to my homedir on relforge1003 , you should be able to read as root [21:25:43] inflatador: perfect, thanks! [21:27:52] np. I just deleted T309648 on the off chance it's reading the bucket list (LOL) and bombing out immediately when it sees a name it doesn't like [21:27:53] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:37:39] don't think this is our answer, but has some interesting context https://github.com/elastic/elasticsearch/issues/55407 [21:41:50] yea i was looking at the same, boto3 is using path style. But they removed that in favor of magic in elatsic 6.0 [21:42:31] haven't located the magic yet though :) [21:51:56] huh, removed near https://github.com/elastic/elasticsearch/issues/26604#issuecomment-329605847 in time for 6.0, and re-added around 7.4 in https://github.com/elastic/elasticsearch/pull/41966 [21:52:57] i have some memory that the flink containers are using hostname access, but they have different dns magic running in k8s [21:59:38] "The default behaviour is to detect which access style to use based on the configured endpoint (an IP will result in path-style access) and the bucket being accessed (some buckets are not valid DNS names)" [22:00:05] so, we just need a bucket that isn't valid asa hostname :) [22:00:12] I think you're well past this point, but I can confirm that the initial "[es-snap-01] path is not accessible on master node" error happens on my home box if I don't add the s3 values to the keystore [22:01:14] can we get a bucket with an underscore? RFC822, does that actually get treated as invalid? [22:01:30] It should be invalid, let me make one with swift real quick [22:02:11] OK, new bucket called "__dunder__" , give that one a shot [22:04:20] inflatador: back to same error from the T309648 bucket, "amazon_s3_exception: The specified bucket is not valid.". Any chance you did something different with `rando` than with `T309648` and `__dunder__` buckets? [22:04:20] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [22:06:02] I think it has to do with the S3 client/official Amazon version of S3 API [22:06:16] I creating all 3 buckets the same way [22:06:21] creatED that is [22:06:24] with swiftclient [22:06:54] hmm [22:07:43] I'm not sure if it's RFC822, or whatever rules govern DNS A/AAAA recorsd (no underscore, must start with letter)? Kind of foggy but I think that's what's at play [22:08:12] i turned on logging in the s3 client as well, it now prints the tls connection debugging, the request it sends, etc. It's issuing what looks like a reasonable request, hmm [22:08:19] (in relforge-eqiad.log) [22:08:44] it sends to `PUT https://thanos-swift.discovery.wmnet:443 /__dunder__/tests-VsSTWonnSdS2IK42Whs4Ew/master.dat ` to the __dunder__ worked for that portion [22:09:20] interesting [22:11:39] I still swear it has something to do with our multi-instance setup, but not sure how to work around/test that [22:14:14] for the multi-instance and credentials that could be a problem, although right now i'm bypassing that with `-Des.allow_insecure_settings=true` in jvm.options [22:15:31] boto3 is also having problems seeing T309648 and __dunder__ buckets, it only sees rando [22:15:31] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [22:16:13] yeah, I follow you. Are you getting errors about invalid buckets when you use "rando"? [22:17:27] I deleted T309648 FWIW. Let me see if I can setup swiftclient on relforge as well, might help [22:18:59] OK, it's up on relforge1003 ... become me, and source the ~/.swift file and you should be able to run swift cmds. Swift list, swift post, swift delete etc [22:19:35] kk, thanks [22:20:29] feel free to delete __dunder__ if you think it's causing problems [22:20:50] or make your own bucket ("container" in swift-speak) or whatever helps [22:28:32] turns out, we were already triggering __dunder__ behaviour, it has another condition in the same place that fails buckets with uppercase characters (and defaults to path-style access) [22:30:29] but on the other hand, swift doesn't want to return any buckets in list_buckets that go against this listing :S [22:30:41] i guess can try the edge cases and see which one they missed :P [22:36:01] but other than finding some wierd bug to make it work...with elastic removing support for path-stlye access in 6.0 and bringing it back in 7.4, our more obvious options are to find a way for *.thanos-swift.discovery.wmnet to resolve to thanos-swift.discovery.wmnet, and make tls work (hah!), or wait for 7.10 which has the restored option. Or build a custom repository-s3 plugin with this [22:36:03] turned on [22:36:51] actually a custom plugin would be trivial compared to most of this....can't hurt to try [22:37:54] oh yeah, or maybe the old swift plugin would be enough for just the restore? [22:38:53] I also don't understand why it works on 6.8 on my vultr instance [22:38:55] in this case, it's a one line change in the right function to turn on path style access [22:39:33] Ah OK, maybe that is the easiest then [22:39:36] does your vultr instance support the host name lookup? [22:40:11] because we have the hostname support in k8s, the flink updater talks to `http://rdf-streaming-updater-eqiad.thanos-swift` which is embedding the bucket name in the url [22:40:13] like via the metadata service? No, I also tried it on my home lab (vanilla CentOS) and it works there too [22:40:36] i mean via dns lookup, does .vultr.host reutrn a valid hostname [22:40:39] err, valid ip [22:41:26] Oh, I don't know [22:41:33] i think our problem here is that elastic is always trying http://.thanos-swift.discovery.wmnet/ and that never resolves [22:41:49] turning on path style access reverts to http://thanos-swift.discovery.wmnet/ [22:42:36] but now i have to remember how to compile elastic without invoking tests, because those want every jvm from 8 to 15 or some such to test against [22:42:46] ah , indeed vultr does do the DNS magic [22:42:58] host es-snap.ewr1.vultrobjects.com [22:42:58] es-snap.ewr1.vultrobjects.com is an alias for ewr1.vultrobjects.com. [22:44:12] Anyway, I'm super late for making dinner! Will touch base tomorrow, thanks for your help [22:46:19] hmm, my weekly reminder that 16G isn't much memory came early this week :P should know better than running a massive gradle project without stopping all my mediawiki services [22:57:48] y [23:02:18] with the custom plugin, [23:02:46] with the custom plugin, "rando" seems to work. Or at least it doesn't complain. Still doesn't answer the credentials problem, this is bypassing that [23:11:34] yea seems to work. So next step would be to un-cheat the credentials i guess [23:12:24] it still doesn't like Tnnnn or __dunder__, i suspect swift is being a good citizen there and not returning buckets that would be invalid s3 names [23:24:51] hmm, not going to figure out keystore today. maybe tomorrow :)