[13:10:58] .o/ [14:05:01] \o [14:22:13] o/ [14:44:33] ebernhardson: regarding T372914 - The producer already uses a descriptor for stream sources that indicates revision-basedness but we never read that property (never read it according to git history). Could we get rid of it in favour of the event-based flag you suggested or what was the original motivation for the stream-wide flag? [14:44:33] T372914: Add flag revision_based to page_weighted_tags_changed schema - https://phabricator.wikimedia.org/T372914 [14:52:43] hmm, looking [14:55:05] the commit introducing the flag was 770c73897d569b6bbac082c9e0783c50163af0ea [14:55:24] yea, initially it was used here: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/commit/5e178f05e9a84bc09dd3edfe332ba31f7fc6a6e2#23e28ec7be9b3daa07013f3a3af9e967ea7f77d7_230_250 [14:55:51] essentially revision based streams went to tmerger, non-revision based streams skiped the merge [14:57:48] the justification for that going away is here: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/commit/8f52f70c5411fc90ebf6bacf378f62bb43ec32f3#23e28ec7be9b3daa07013f3a3af9e967ea7f77d7_319_329 [14:58:59] so there is a buffering question around allowing some events to skip the merge? [15:56:42] should be deploying graph split fully to prod in 1-2 hours [16:09:25] ebernhardson: Hm, that makes sense, at least for a high volume source stream and it made sense in the context of early fetch (back then, fetch still happened inside the producer). I would assume that - now, that the producer no longer fetches - it’s safe to implement such bypassing since there is no long-running operation that would create backpressure [16:10:35] pfischer: yea that makes sense [16:21:53] looking into a page that should be fixed by saneitizer but wasn't. It's correctly detected by the cirrus-check-sanity api, realizing we don't have great visibility into what happens after that [16:22:31] might also be convenient if we had an extra stream where we could manually inject sanity check events, but not 100% on that yet [16:26:35] the prometheus graph suggests more fixes are being requested on each 2-week loop, but not sure how suggestive that is since we also added private wikis recently [16:26:37] https://grafana-rw.wikimedia.org/d/2DIjJ6_nk/cirrussearch-saneitizer-historical-fix-rate?forceLogin&from=now-90d&orgId=1&to=now&var-k8sds=eqiad%20prometheus%2Fk8s&var-site=eqiad [16:58:45] gehel: will likely miss today's 1:1 btw depending on how long deploying everything takes [17:01:22] ryankemper: good! Oscar has some fever, so I'll probably be busy ! [17:01:37] Ebernhardson : I still should be there for our 1:1 [17:04:15] gehel: kk [18:20:12] huh, thats kinda annoying. My reading of the way mediawiki handles prometheus metrics is that the metric cache is strictly by name, meaning you can't have multiple metrics with different labels and the same name [18:24:33] seems unlikely though? I guess i need to do more testing [18:45:20] yup, i was wrong :P each time you increment the counter it records a sample capturing the current labels, so they can then be overwritten without changing anything [19:33:10] meh, that also then means on every invocation you have to set all labels, you can't just pass a partially labeled thing in to somewhere :S [20:13:27] We're so close on the graph split production rollout. We've got all the traffic steps done, lvs rolling restarts etc, but example queries through the web ui are failing w/ `Request from REDACTED_IP via cp4040.ulsfo.wmnet, ATS/9.2.5
Error: 500, Cannot find server.` [20:13:48] like this one: [20:13:49] https://query-main.wikidata.org/#%23Cats%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ146.%20%23%20Must%20be%20a%20cat%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cmul%2Cen%22.%20%7D%20%23%20Helps%20get%20the%20label%20in%20your%20language%2C%20if%20not%2C%20then%20default%20for%20all%20languages%2C%20then%20en%20language%0A%7D [20:14:11] There's definitely something a little off, maybe our discovery records. Not sure yet [20:15:24] hmm, maybe we still have non-graph-split-wdqs-specific stuff in the nginx config [20:18:33] Hmm, unlikely to be anything with nginx, I don't see anything in the config that jumps out at me [20:19:51] Yeah, must be something with discovery. I get failures when I do `ryankemper@cumin2002:~$ curl https://wdqs-main.discovery.wmnet/sparql`, but not when I do it for `wdqs.discovery.wmnet` [20:20:32] oh, I think I see the problem: [20:20:41] https://www.irccloud.com/pastebin/GpCGPT20/ [20:30:49] Switching those to pooled didn't seem to fix it. Taking a break for lunch now [20:40:44] hmm [20:45:48] back [20:45:56] ^^ will take a look [20:46:36] my best guess is that the ATS mappings aren't quite right [21:00:37] ryankemper getting NXDOMAIN for `wdqs-main.discovery.wmnet` and `wdqs-scholarly.discovery.wmnet`...sounds like the discovery records aren't enabled [21:36:28] ohhh, i'm a dummy. The reason the saneitizer didn't fix this page is that it reports the right problem, but wrongly. The page is both in the wrong index (content instead of general), and not supposed to be indexed (it's a redirect). We reported that it's not supposed to be indexed, but didn't report that it's in the wrong index. So the remediation keeps trying to delete it from an index [21:36:30] it's not in [21:38:49] maybe we should just delete from both indexes for remediations to be safe...