[10:38:34] lunch [12:46:48] o/ [12:52:57] dcausse I have to take my son to camp, so will be ~10m late to pairing [12:57:17] inflatador: np [13:37:59] dcausse actually I forgot, I have a dentist appointment too...so no pairing ;( [13:38:12] no worries! [13:57:14] inflatador: hope the dentist doesn't hurt too much :) [13:57:49] perhaps someone else can assist with de-pooling elastic1104, elastic1089, elastic1090 before today's switch upgrade in E1? [14:08:59] pfischer: would you be around to discuss some followups on the graph split updater? [14:22:17] topranks: hmm, it would typically be ryan but he probably wont start for anothre hour or two. Thats done with cookbooks and the rest of use cant run them. [14:22:24] instructions: https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Banning_nodes_from_the_cluster [14:23:14] ban vs depool because in elasticsearch all nodes in the cluster serve queries via internal routing [14:26:38] ebernhardson: is running the cookbook for them something I could do myself? [14:27:59] topranks: should be able to, only note would be that it has to be run three times, one for each known clustergroup (not sure why the cookbook doesn't do that itself) [14:28:20] ok, I will look through the docs and see how it looks [14:28:22] thanks! [14:28:24] because each server has 2 of the 3 cluster groups [14:28:31] topranks: any questions, feel free to come back [14:28:32] gotcha [14:28:49] yep, I will thred cautiously :) [14:30:04] topranks: oh, i was mistaken. I thought clustergroups was omega/chi/psi, apparently clustergroups is not what i thought. So only one time for `search_eqiad` [14:30:40] ok cool, yep looking at the cookbook here it looks fairly straightforward to run [15:01:05] ebernhardson: so I ran the cookbook, but now the network utilization of those 3 hosts has suddenly jumped to full 1G line rate [15:01:16] is that expected? perhaps it's shifting data off them and will calm down once it's done? [15:06:01] topranks: yes I think so, shards are moving to other nodes (https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-1h&to=now&viewPanel=34) [15:06:38] dcausse: ok thanks, I'll keep an eye on it and wait till it dies down before the switch reload in that case [15:12:17] topranks sorry I missed that one [15:12:39] nah no worries it's been educational :) [15:12:51] I'll double-check the spreadsheet again and hopefully not be a blocker next time ;( [15:12:52] cookbook makes it easy even for the likes of me :P [15:13:10] this one was out-of-order as it was postponed from last week, usually we've just been doing Tues/Thurs [15:17:15] hmm, I've seen that moved page test in cindy intermittently fail a number of times :S But good call on the redirects, looks like it might resolve some of our issues [15:19:58] inflatador: network usage on those elastic hosts has died back down, as as has the graph David shared above [15:20:04] do you think it's ok to proceed? [15:20:19] topranks Y, feel free [15:20:31] cool thanks [15:22:13] I double-checked the spreadsheet BTW. I think we have all of our bans scheduled. If there are any maintenances after Jul 23rd LMK [15:23:09] If you do end up having to ban or unban (not that you should have to) I have a tmux window called 'ban' on cumin2002 that has a good history [15:47:57] inflatador: good to know thanks! [15:48:07] we're all done for today, if you can un-ban those hosts? [15:48:13] good to know how it's done anyway [15:48:35] and nope - nothing beyond the 23rd in the current set of upgrades [15:52:37] topranks ACK, will unban hosts [15:53:44] sigh I need a new flink 1.16.1 image, the one we have is based on buster [15:55:15] dcausse I can help w/that if you like . Guessing it won't be ready till next week though if that's OK [15:58:29] inflatador: thanks for the offer, I'll file a task [15:59:35] I can look at upgrading to flink 1.17 too tho [16:00:50] dcausse ACK, feel free to link/tag me when ready. [16:00:55] Working out, back in ~40 [16:38:07] going to try flink 1.17.1, the build passes without a code change [16:39:06] cindy test failures are so random you never know if what you do is improving something :( [16:40:11] thought I was on something this mornings (with how redirects are populated) but now it's failing on something else... [16:41:11] dinner [16:41:39] the moved page one has been failing intermittently for awhile now :( But if it manages to not fail the setup routines (which is how we get dozens of failed tests, usually) thats already a iwn [16:43:12] otherwise...i dunno. The problem with debugging through the browser tests is it's way too verbose, hard to poke through the relevant logging. Maybe some sort of one-off maintenance script or api invoker could loop create->move->delete untill it sees a problem, clearing old data/logs/etc as it goes. [16:44:05] and i have no clue if it's related, but it feels like it fails more frequently in my morning than afternoon [16:46:02] back [17:06:02] dcausse sounds good, hit me up if you still need someone to work on it [17:44:27] lunch, then taking my k8s practice exam. Should be back in ~3h [18:04:04] Missing unmeeting to pair with steve [19:59:26] * ebernhardson has of course been able to run one of the failing feature files 10 times in a row locally without fali :S [20:37:57] * ebernhardson realizes that nodejs is printing line numbers that somehow don't include the comment at the top of the file in the counts, so there are all off by 8 lines [20:38:09] s/there/they/ [22:16:48] well, i'm not happy with it...but adding a 'And i wait 2 seconds' at jut the right place seems to let this run without failing :S [22:17:10] suspect ApiTrait::loadDocuments isn't working as expected [22:19:10] maybe elastic is returning the doc for the pre-move index and the post-move index, and it's correct in post-move but the pre-move hasn't flushed yet? not sure [22:20:12] re: failures in `Moved pages that switch indexes are removed from their old index if they leave a redirect`