[09:36:15] errand (picking up new glasses) and lunch [10:15:00] lunch [12:36:43] dcausse: would you have a minute to jump in a meet? [12:36:50] gehel: sure [12:37:03] meet.google.com/jxu-oosg-twa [13:00:29] greetings [13:42:47] o/ [14:08:36] godog thanks for responding to that email. I made a phab ticket for thanos-swift access ( https://phabricator.wikimedia.org/T309715 ) , LMK if you need more info [14:30:43] https://wikitech.wikimedia.org/wiki/Incidents/2022-06-01_Lost_index_in_cloudelastic incident report for cloudelastic data loss. If you have any suggestions let me know, or feel free to edit the page directly [16:15:46] quick workout, back in ~30 [16:26:05] so dumbest idea i've had in awhile, but we don't seem to be having any luck reproducing the 500's on wcqs. What we do have is 2.2T of disk space available on each host, can we tcpdump lo for a few days and hope to align a reproduction with a request? (assuming ` /oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms` is even the [16:26:06] source of problems) [16:27:18] ifconfig claims wcqs1001 has only sent 80GB over lo since it was booted 70 days ago, should be fine-ish [16:34:30] i dunno, seems plausible. I started it up on all the wcqs instances writing to /srv/T306899/, it will write up to 100 1GB files then start rewriting the first files [16:41:12] from frequency of logs, with any luck can try and dig a reproduction request out of there in a couple days [16:43:14] ebernhardson: no objections from me as long as we remember to stop it :) [16:44:22] dinner [16:52:34] back [16:53:10] ebernhardson yeah, sounds good to me, let us know how it goes. If we keep coming up empty, we could ask Observability if you think it would help [17:15:38] i'm in favor of the tcpdump as well [17:15:58] hopefully we can find the needle in the haystack :P [17:21:56] inflatador: i suppose i didn't actually mention, but we shouldn't delete the commonswiki_file index until the new one is ready. Queries still "work" against the existing data even though it's missing a shard [17:22:39] ebernhardson ACK, thanks for checking [18:07:30] lunch/errands, back in an hr or so [18:41:57] lunch [19:19:08] back [21:21:03] thanks e-bernhardson for addressing that blazegraph alert, I need to be more vigilant about those [22:28:48] shrug, i probably glance at my emails too often [22:58:36] What's the over/under on the eqiad cluster being upgraded to 7.10 before Tuesday June 21st? I see the rough plan at T308676 is 3 weeks long with eqiad being last. [22:58:37] T308676: Elasticsearch 7.10.2 rollout plan - https://phabricator.wikimedia.org/T308676 [22:59:05] bd808: well, we have a team offsite where most of us get on planes june 10th [22:59:14] bd808: so, i'd give it a 0 :( [23:00:11] ebernhardson: this is not news that makes me sad. I was asking because I leave on vacation Saturday and return on the 21st [23:00:21] bd808: lol, so works well for you too :) [23:01:37] I have a patch that I think will make things work with both 6.x and 7.x, but I was not too excited about pushing that live tomorrow and then being out of contact for 2 weeks. [23:03:40] ebernhardson: so I'm guessing that maybe cloudelastic gets done next week, then y'all have an offsite that eats a week, then codfw the week of the 20th, and if all is right in the world eqiad the week of the 27th? [23:17:16] bd808: hmm, i'm guessing a bit slower, but hopefully not too far off from there. we are doing the bullseye deployment right now which is up to cloudelastic, i think the goal is to finish that through the clusters before pushing elastic 7 [23:18:35] ack. Thanks for the insight. I will stop panicking about Toolhub being a blocker and emergency deployments from a tropical beach.