[06:47:49] looking into query-main 502s blazegraph seems to restart a couple times (lookiing at wdqs1021) [06:55:58] but that could be a data-transfer I'm not sure... [07:07:38] the 502s I see from nginx on the wdqs-main hosts are not at the time of the ats switch but more around a data-transfer [07:10:25] and not seeing any surge in traffic/load at 23:00 yesterday when the ats patch got deployed [07:29:17] my bad it's not 23:00 but 21 utc [07:29:51] something does not make sense... grepping: sudo journalctl -u wdqs-blazegraph | grep 'Stopping Query Service' -A 7 [07:30:16] I see: May 06 19:27:39 wdqs1021 systemd[1]: Stopping Query Service - Blazegraph - wdqs-blazegraph... [07:30:28] and May 06 20:22:04 wdqs1021 systemd[1]: Started Query Service - Blazegraph - wdqs-blazegraph. [07:30:47] which suggest blazegraph was down between 19:27:39 & 20:22:04 ? [07:35:58] that's a transfer... actually [07:40:48] well... same conclusion I'm not seeing any evidence that traffic reached wdqs-main nodes [08:20:14] NamespacesToBeSearchedDefault is an int->int map and ofcourse get serialized in json as [ 1 ] for [ 0 => 1 ] and { "100": 1 } for [100 => 1] [10:25:24] lunch [13:13:14] o/ [14:05:38] \o [14:09:23] o/ [14:19:30] .o/ [14:45:11] looks like we had another search pool rejection alert a few mins ago, but it cleared on its own https://grafana.wikimedia.org/goto/bZF1F8xHR?orgId=1 [14:45:55] sorry, meant to link https://grafana.wikimedia.org/goto/mtEaF8bNR?orgId=1 but I guess they're both relevant [14:49:23] i suppose my initial question would be to differentiate between increases in requests, vs slowdown in running queries [14:52:41] oh i was thinking poolcounter, this is thread pool which historically has been a single overloaded machine. [14:52:41] damn, I thought I disabled completion indices in eqiad but I guess not...patch forthcoming [14:55:09] ah, I never merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1141903 . fixing that now [14:59:57] so randomly curious, 1081 had most of the rejections. It's cpu went from ~27% before rejecting queries, to 3% while rejecting. network utilization down, disk throughput unchanged, no obvious reason why it stopped doing things and rejected queries :S [15:00:23] interesting. I just banned it but definitely worth a look [15:00:37] would generally imply that queries were filling the search threadpool and waiting on IO, but without changes to IO seems unlikely [15:01:23] Indeed...this happened over the weekend and I happened to be at my desk, so I started banning nodes. The problem just moved to a different node, so it seems unrelated to individual hosts [15:05:53] yup, same thing's happening now. Just banned 1081 and now 1060 is barking [15:19:20] https://spinach.genie.stanford.edu/ [15:55:06] Workout, back in ~40 [15:55:54] i don't fully remember, will the scholarly entries in https://stream.wikimedia.org/v2/ui/#/?streams=rdf-streaming-updater.mutation-main.v2 fall out in due time or is the main-side stream supposed to include those? i was talking with an ic who noticed and i took a note to myself to check on it, thinking i could give them a heads up if it's on the radar [15:57:06] Wednesday meeting if anyone is available [16:32:51] computer crashed :/ [16:45:01] back [16:53:50] oh damn, I missed weds mtg. I pinged r-zl to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142693 [16:54:16] and I guess we need to merge and schedule https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1129182 for a deploy window tomorrow? [17:06:07] heading out [17:20:32] .o/ [18:10:19] lunch/errand, back in ~1h [18:56:33] Trey314159: i've been pondering the ab test, i suspect what we need is to always run the glent query, even in control, but in control only report if there was a suggestion. Then we can subset both control and glent to the queries that would have gotten a suggestion [18:59:07] or maybe post-hoc running the queries against glent would work [21:17:42] ebernhardson: that sounds very reasonable, though I want to think on it some more. Having that label (glent-elligible or not) would allow for other interesting comparisons between control and test. [21:53:56] i just realized after far too long...Special:Search doesn't ask Cirrus for query suggestions if the page exists on the wiki