[00:35:23] meh, cross-cluster errors came back in eqiad. Suspecting that's due to the cluster changing masters? trying to re-apply proper settings [00:41:09] unclear what exactly happened, about 10 minutes ago we got a master_left log message, but the cluster restart finished about an hour ago, i would have expected the error to happen then? [00:49:23] well, its almost 6 not going to worry too much about it. Same as last time (T310924) cleared the cross-cluster settings for chi->omega and then reapplied them [00:49:24] T310924: Investigate CirrusSearch eqiad failures - https://phabricator.wikimedia.org/T310924 [09:51:57] Lunch [13:01:33] inflatador: I'll be late for our pairing session tonight. France has to work (last minute change) so I'll be alone to get the kids to bed. [13:01:59] tanny411: if you are around, we're having Andrea's last meeting in https://meet.google.com/yau-mkip-tqg [13:04:48] coming... [13:10:50] ACK [13:10:52] and greetings! [13:52:18] dropping off kids, back in a few [14:04:03] back [14:40:33] Whoohoo, got the S3 plugin settings working in deployment-prep. Now to work on a snapshot... [14:49:11] o/ [14:49:17] \o/ [14:51:10] may be ~5m late to retro [14:56:54] inflatador: ack [15:01:49] ebernhardson, ejoseph: retro time: https://meet.google.com/eki-rafx-cxi [15:02:11] Trey314159: ^ [15:52:30] gehel: a question that you might be able to answer on meta today -- https://meta.wikimedia.org/wiki/Tech#LTS_plan:_https://github.com/wikimedia/search-highlighter [16:10:01] i wonder which approach is appropriate for the airflow tests that aisha found we were missing. The current test collects all tasks that have a templated field that contains a TemplatedSeq and renders it to make sure it's valid. The updated test i made that triggers the error grabs all tasks of a specific name and renders a specific field to make sure it renders [16:10:15] perhaps something that renders all templated fields everywhere and verifies jinja doesn't fail? [16:12:25] quick workout , back in ~20 [16:12:50] * ebernhardson will never agree with pytest that it's spelled `parametrize` [16:20:41] * ebernhardson shouldn't look at wiki, apparently parametrisation and parameterisation are also valid spellings :P [16:31:42] back [16:45:29] lunch, back in ~1h [17:28:43] random thing to ponder, we sent a bunch of data updates for image suggestions from hive->mjolnir->elasticsearch. Mjolnir reported that the update was 99%+ noops (6k noops/s, 130 updates/s), but when looking in the indices the data wasn't there. Resending one of the data files to _bulk generates updates as expected and the pages now contain the appropriate tags [17:29:37] i'm going to re-send the latest update but not clear on how all the data didn't make it the first time [17:33:25] eritf [17:33:31] or shall I say, "weird" [17:33:37] also, back [17:55:09] ebernhardson playing around with snapshots in beta cluster. Snapshot size appears to be 4.6G, while elasticsearch datadir is smaller (2.9 G). Does that sound reasonable? I'm running the S3 service in a container so I guess it could be partly due to diffs/overlay stuff [18:13:17] inflatador: hmm, comparing to the primary index size reported in _cat/indices or which? [18:14:35] inflatador: in theory i think the snapshot should report approximately the same size as the primary index size, but I've only used snapshots a little bit so there might be something more in there. An extra 50% sounds significant though [18:15:28] yeah, I wonder. Is there a way to get a total from _cat/indices? [18:16:52] inflatador: yea, the last column has it. You can specify an index such as `curl https://search.svc.eqiad.wmnet/_cat/indices/enwiki_content` and you can add `?v` to the end for it to print a header. The last column is pri.store.size [18:17:48] and i suppose randomly because i always found it confusing (and might still be not 100% right) the number of indexed docs is docs.count + docs.deleted, for the longest time i thought the live docs was docs.count - docs.deleted [18:18:00] (those fields are also in _cat/indices) [18:18:35] I see a size for individual indices, but is there a way to get a total for all indices, or is that even what we want? [18:19:53] inflatador: snapshot should be a single index at a time, that reports the total for all primary shards of a single index. can add `?bytes=b` to have it report bytes instead of human readable, and then we can do the following to sum: [18:19:55] curl https://search.svc.eqiad.wmnet:9243/_cat/indices/?bytes=b | awk '{sum += $10} END {print sum}' [18:20:14] where $10 is the column index (starting at 1) [18:20:50] OK, that's what I wasn't clear about, whether we needed to add it up [18:22:26] Looking at the snapshot from minio's perspective, it seems to be a bunch of randomly named files. So I don't think there's a way to get individual index sizes from minio itself [18:22:39] but we probably can from the snapshot API [18:22:40] yea a search index is a bunch of randomly named files :) [18:22:56] each segment has a random name, and there will be multiple files per segment [18:23:27] and i suppose for completeness, a search index is a set of immutable segments [18:25:21] if I add up the indices via your awk above, I get ~5.1 GB [18:25:45] inflatador: which host are you on? [18:25:56] deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud [18:26:36] oh duh, this stuff is distributed [18:26:57] I shouldn't expect a single datadir to encompass all data across the cluster [18:27:19] so 5.1 GB vs 4.6 for the snapshot seems much more reasonable [18:29:20] inflatador: i think the info you want about size is in the following, but has to be post-processed: curl http://localhost:9200/_snapshot/snapperson/snapshot_1/_status?pretty [18:30:37] curl -s http://localhost:9200/_snapshot/snapperson/snapshot_1/_status?pretty | jq '.snapshots | map(.stats.total.size_in_bytes)' says ~5GB [18:32:11] i guess it says 5.1GiB (base 10), which if 4.7 in GB? [18:32:22] i might have my units mixed up :P its the 1000 bytes per kb vs 1024 bytes per kb [18:33:53] cool, ebernhardson we are at https://meet.google.com/eki-rafx-cxi if you wanna join SRE pairing session [18:34:21] i just have a few minutes, gotta be at liam's school by top of the hour to pick him up. Will check when I get back [19:36:00] back [19:40:00] inflatador: how goes snapshotting [19:40:50] ebernhardson getting there, but it appears our puppet changes didn't work with the keystore [19:41:05] :S [19:42:28] since we can't look at values from elasticsearch keystore (at least, not until 7) it's tough to say where it fell down. It may be that the "unless" caused it to be a no-op [19:43:05] **however** the modified date of the one server I've checked does look to be around the time we were working on it [19:43:11] hmm, yea it doesn't seem easy to pull the data out. Maybe i can hax together a thing that uses the es7 jars to look at the es6 keystore, but would probably take an hour or two [19:43:33] or actually, maybe just copy the keystore into a local es7 docker container and see if it reads it? [19:43:41] Ah, now that's an idea [19:44:29] can probably use docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 docker-registry.wikimedia.org/dev/cirrus-elasticsearch:7.10.2-s0 [19:44:34] either of those [19:45:54] ebernhardson OK, I'll grab the image from wikimedia I guess [19:46:35] separately, i'm still annoyed that somehow debian 10.12 has `su -g ...` but debian 9.13 doesn't. Feels like an argument that should have existed for ~30+ years now [19:46:49] not sure the workaround :P [19:49:00] heh, apparently: sudo su root sg - elasticsearch -c "/usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list" [19:49:21] (found by seeing this random `sg` command referenced at bottom of `man su`) [19:50:04] maybe doesn't even need the `su root` part, since it's already root [19:53:35] inflatador: i dunno if its the same on mac (probably not), but on linux to run the elasticsearch container you have to provide `-e discovery.type=single-node` or it fails the elasticsearch production bootstrap checks [19:54:01] ebernhardson no worries, I'm running a bullseye VM [19:54:15] errr actually, it does apply! [19:57:36] inflatador: second sad news, they didn't add show until 7.16 :P Have to use docker.elastic.co/elasticsearch/elasticsearch:7.17.0 [19:58:40] ACK [20:00:19] * inflatador follows up coffee spilling with rubbing eyes after eating chili powdered nuts [20:01:14] lol, i somehow do that all the time after chopping some chilli's...some day will learn [20:16:12] keystore looks reasonable from 7.17, access_key has search:platform and secret_key has a plausible looking password [20:16:31] i copied the one from elastic1050, could try another [20:19:00] copied one from elastic2057, also seems plausible [20:22:30] ebernhardson nice, you beat me to it [20:23:16] seems it might have been too long ago, but at 19:03 (UTC?) we get java.security.AccessControlException: access denied ("javax.management.MBeanServerPermission" "findMBeanServer") with a stack trace that says it's within the s3 related stuff [20:23:29] thats like 1.5 hours ago i think [20:23:51] that would be around the time I was working w MrG on elastic2028 [20:24:37] hmm, does that seem to line up with when snapshotting failed? I guess could always ask it to try again and see if it throws the same error [20:25:30] registering the snapshot failed, I'm back on elastic2028 if you wanna sudo to me and join my tmux [20:26:05] it's still giving the same error, haven't checked the logs yet [20:26:11] It comes from AwsSdkMetrics which sounds terrible :P I don't want any kind of client code that reports metrics to AWS, even if i'm an aws customer running things in AWS. Or maybe thats about local metrics, who knows [20:26:42] still getting "[thanos] path is not accessible on master node" [20:26:44] nope not seeing a new error with same [20:27:50] hmm, i wonder what path it's trying to access. We can probably sneak in with `strace -e trace=file -fp $(jps -q)` but who knows which host is the right one [20:27:59] I got the exact same error in deployment-prep this morning, but it went away when I updated the elastic keystore values [20:28:46] interesting, hmm [20:29:34] inflatador: could it be 660 vs 640? It should be opening read only but who knows, maybe it tries to open as rw? [20:29:49] it really shouldn't though...just random guesses :P [20:30:03] no, it works with 640 in beta [20:30:21] Guessing the value is not correct everywhere [20:30:41] the old "Currently, all secure settings are node-specific settings that must have the same value on every node." [20:30:48] hmm, i suppose can write a quick bash script to collect keystores from everywhere and verify them in the local container [20:31:40] I was about to brute-force keystore updates via cumin, which way would you prefer? [20:31:52] inflatador: brute force seems fine, we don't really need to know exactly which one is bad [20:32:13] cool, I'm hitting codfw and will let you knowhow it goes shortly [20:39:34] hmm, really should disable puppet before doing this, one sec [20:43:34] stopped puppet, pushed keystore values, forced reload of secure settings and got success across all 35 hosts [20:43:42] awesome [20:43:57] success for updates that is [20:44:04] registering the snapshot still fails [20:44:07] ;( [20:50:50] hmm, poking [20:51:29] I think we might try to register the snapshot repo in cloudelastic first, if you haven't already. Less moving parts [20:55:05] ahh, i didn't realize this was emitting an s3 exception, claiming the bucket doesn't exist ... hmm, is `elasticsearch-snapshot` already defined somewhere? [20:56:07] ebernhardson I've been trying a few different buckets, I can confirm that 'elasticsearch-snapshot' exists. Swift creds are on elastic2055.codfw.wmnet if you want to verify [20:56:23] (in my homedir, ~/.swift) [20:56:45] after sourcing that file, you can run 'swift list' to see the all the buckets [20:58:17] hmm, yea it seems fine there. We can turn up the logging levels of the s3 client and the snapshot repo i suppose [21:02:29] I'm for it, but what do you think about trying in cloudelastic first? [21:02:54] set `org.elasticsearch.repositories` and `com.amazonaws` to DEBUG, but still nothing great :S it generates about 20 messages now but most are just saying that the TLS connection was a success [21:03:41] let me check that the plugins package version is correct everywhere [21:04:55] yup, it's there [21:05:12] hmm, any chance the endpoint is supposed to be different? This is usually the swift endpoint but we need the s3 version [21:05:28] i don't remember how that worked before...wonder if i took notes [21:06:57] inflatador: in relforge we have `endpoint: thanos-swift.discovery.wmnet` [21:07:03] I wrote a python script using S3 libraries, it worked there...although I only did something simple with it [21:07:10] inflatador: and changing endpoint in snapshot.json (now ~ebernhardson/snapshot.json) it accepted it [21:07:30] elastic2028.codfw.wmnet:~ebernhardson/snapshot.json [21:07:44] * ebernhardson should turn debug logging back off now, will forget otherwise :) [21:08:43] whoops, accidentally killed your tmux [21:09:13] but wow, congrats [21:09:16] you can use `tmux kill-session` to kill your own :) [21:09:25] no worries though [21:10:17] i suppose random tangent, tmux is fully scriptable so you can do totally silly things that should never be done through it like providing inputs that don't have to be echo'd and such. But race conditions everywhere [21:10:19] hmm, my script also just has the domain name. Why did I think the rest was needed? ;( [21:10:37] because every swift doc we have everywhere says /auth/1.0 [21:10:45] and sometimes, just for fun, it has to be /auth/v1.0/auth/v1.0 [21:10:54] Yikes [21:11:12] On that note, I'm running to the library, back in ~25 [21:11:21] kk [21:18:32] Today I deployed a new version of Toolhub that is compatible with both the current Elasticsearch 6.x version and the upcoming 7.10.2. Toolhub should transparently deal with the service being updated now which gets us off your critical path for the upgrade. [21:20:15] bd808: thanks! We are running a bit behind (obviously, i'm sure) but we almost have the snapshot/restore functionality we needed put in place and back to the os + elasticsearch upgrade soon [21:20:46] i suppose taking longer made it easier on your end at least :) [21:23:19] ebernhardson: yeah, I got to take my vacation and attend an offsite. :) I totally know how the long tail of things happens when doing the kind of semi-large update y'all are working on. No rush from us. We just wanted to make sure we weren't blocking you! [21:38:53] Back [21:40:34] will start the snapshot from elastic2028 shortly [22:06:56] err, hmm. I just deleted http://search.svc.codfw.wmnet:9243/_snapshot/ebernhardson_test but realized i don't know if you had tried to use that. Hopefully i didn't break whatever you were doing [22:07:03] (cleaning up things i created today) [22:07:28] inflatador: ^ [22:07:41] of course these thoughts always come 5s after pushing enter :P [22:14:53] ebernhardson nah, it's OK [22:15:36] I made snapshot_t309648 , which doesn't seem like it worked, no indices listed here [22:18:02] don't see commonswiki_file_1647920262 in _cat/indices [22:18:09] inflatador: hmm, curl http://localhost:9200/_cat/snapshots/elastic_snaps?v says it worked [22:18:09] let me make sure that's the right one [22:18:21] oh nevermind, it took 1s. so it "worked" [22:18:40] I had "ignore_unavailable:true" which was a mistake [22:18:48] inflatador: i would hope you can say `commonswiki_file` and reference it by alias instead of the exact index. But maybe not [22:19:43] ah, let me try that [22:19:55] I guess commonswiki_file_1647921177 is the latest one anyway [22:20:34] every foo_bar_123456 index should have a matching foo_bar alias, the number at the end is the timestamp it was created. We copy the old index into the new one then swap the alias so mediawiki doesn't have to know about changing indices [22:22:53] Oh yeah, g-ehel explained that to me earlier [22:23:33] I started the snapshot with this command: https://phabricator.wikimedia.org/T309648#8042724 [22:24:06] i wonder if curl times out before wait_for_completion=true finishes :) don't think it matters [22:24:16] yeah, it's gonna take a long time [22:24:27] I don't see any traffic going out on port 443 ATM [22:24:55] can see a big traffic jump: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=elasticsearch&var-instance=All&var-datasource=thanos&from=now-1h&to=now&viewPanel=84 [22:25:18] suggests its working, +300MB/s vs a few minutes ago [22:25:52] ah yes, once again I'm not looking from a cluster perspective [22:26:28] +600MB/s now :) Last time i ran this it was fine, but i suppose for random information that eqiad<->codfw link is 10G [22:26:42] there are actually 2 10G links, but they are redundant (not parallel) [22:27:34] probably won't be hurt by cross-DC much then [22:27:50] I'm going to cook dinner, but will check in later unless you think I should stick around [22:28:03] nah this should be fine, go do the things! [22:28:18] taco time! Thanks again for your help today! [22:28:22] np