[10:12:34] lunch [10:19:04] Lunch [13:32:12] \o [13:35:37] we've got 11 servers to go in EQIAD, but a lot of unassigned shards. I'm wondering if our row/rack awareness change in T391392 didn't work. Also, I don't understand why /_cat/shards shows an unassigned reason but `_cluster/allocation/explain` does not [13:40:05] T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392 [14:00:01] o/ [14:00:54] .o/ [14:02:47] inflatador: I see plenty of IndexFormatTooNewException in _cluster/allocation/explain should this explain? [14:03:29] dcausse good catch. My API call was pointing to CODFW ;( [14:09:21] I'm still a little surprised that we have so many alloc failures, but it's not enough to worry about much [14:25:29] hopefully they'll go down as more opensearch nodes enter the cluster [14:26:06] I hope they'll just unblock automatically without having to do something manually [15:00:48] Yeah, it wasn't a problem in CODFW, but we were using the old rack/row awareness there. That being said, the output of `/allocation/explain` seems to be about general lack of OS capacity vs. row awareness [15:03:53] speaking of, quick CR to start row F (final row) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149687 if anyone has time [15:09:39] ^^ NM, I got a +1 [15:19:39] random thought...i wonder if the way deepcat sparql times out in cirrus is wrong. By that i mean, the sparql query should be providing partial results, but we don't do a partial application? [15:20:06] * ebernhardson should probably experiment [15:22:34] could we do it? if it times out do we get a partial response from blazegraph? [15:22:59] i think i spoke too soon :P I was imagining that a bfs must start emitting results soon, it's not like crazy filters it's just walking? [15:23:09] but now testing queries with curl .... i get 3s time to first byte [15:23:45] not sure why blazegraph would need 3 seconds before it starts emitting any results for the deepcat query, seems odd [15:26:03] and blazegraph timeout handling is rather crap, it'll spit out a java exception in the middle of the json response :/ [15:26:48] oh, hmm yea i suppose can't parse a partial json either, it would have to be tsv format or some such [15:27:45] yes... we could but that might be ugly [15:29:15] curious, if i remove the `order by asc(?depth)` i go from 3s for deepcat query to 0.6s [15:30:08] i suppose that is going to effect how over-the-limit queries run...but we could accept the order it walks them in? not sure [15:30:48] i guess 3s->0.6s makes sense, because in the 3s case it needs the full results to sort, in the 0.6s case it emits the first $limit tuples [15:41:47] yes not sure the order matters much? [15:42:19] if we're past the limit we should send a warning anyways and user should be carefull [15:42:40] i suppose it depends, ordering by depth means we get the top-n closest categories, if we have one at depth 4 then we have all of depths 1,2,3 [15:43:23] curiously while the categoryTree query is implemented with a class called BFS, the nodes don't come out in depth order like i would expect from BFS without the sort [15:45:57] I'm out for the weekend! Have fun! [15:46:05] \o [15:47:03] o/ [15:47:22] categoryTree might rely on gas:service which is multithreaded IIRC [15:48:50] yes i found the impl for it, it's a thin wrapper around gas:service calling a BFS class [15:49:00] i guess the multithreading would explain the ordering though [16:07:05] heading out, have a nice week-end [16:07:16] same! [16:24:53] workout, back in ~40 [18:37:30] Trey314159: i noticed the last glent m1 test ran on de/en/fr, i suppose that's the set we want to use this time as well? [18:38:48] I don't really remember the logic we had at the time. It's certainly a reasonable first test group. [18:39:33] yea i suppose it's reasonable enough [22:44:42] `elastic1110` is our last Elastic host. See ya Tuesday!