[13:12:01] \o [15:02:08] pfischer: triage meeting: https://meet.google.com/eki-rafx-cxi [15:58:59] hmm, so cirrusbuilddoc doesn't report Wikipedia:REDIRECT as a redirect, but it's in the search engine. Either it returned it in the past, or the updater did that (it has a codepath that updates redirects without doing a full reindex) [15:59:32] of the Minecraft page, specificall [16:11:41] I just noticed that hardly any (if one at all) page_change event related to a redirect comes with a page_id for its target page (which we need to delete the page from ES). As a consequence we do not update the redirects incrementally (via ES extra plugin noop set add/remove) [16:14:03] The EventBus code `$redirectPageIdentity = $redirectTarget->getPage(); if ( $redirectPageIdentity->exists() ) { $redirect['page_id'] = $redirectPageIdentity->getId(); //…` apparently does not what I expected [16:17:31] At least I do get a lot of debug log messages (when running locally against kafka-main) that we are trying to process a redirect-related page_change w/o target page_id (which implies, that the target page does not exist). That would mean that quite often editors create redirects before the target exists. [16:23:42] pfischer: huh, interesting [16:29:58] i suppose in theory that would be fine, if the page doesn't exist yet then when it does get created the redirect will already be available to build into the cirrusdoc [16:58:58] hmm, further weirdness, for Minecraft the bad redirect was created aug 12 by a vandal, reverted ~14 hours later. The Minecraft page has been edited today, and the recorded revision id matches suggesting the update made it through. But the bad redirect is still there, suggesting maybe we send the wrong noop hints on update? Still unclear [16:59:27] Also it seems we have a title/prefixed title mixup somewhere which is allowing the Wikipedia:Wikipedia:REDIRECT (and similar) error [17:01:17] (also we sadly don't have the intermediate event, it's just out of the 7 day window. Looking for a better example where we have all the events) [17:01:26] dr0ptp4kt: 2 mins [17:14:16] Seeing some failed queries pop back up. I think we're going to have to ban our old friend again [17:14:17] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs-scholarly&from=1723983158021&to=1724087589942&viewPanel=43 [17:21:52] ryankemper ;( . Just when I was going to close that ticket [17:23:32] on the positive side, at least they haven't crashed blazegraph again (yet) [17:27:01] lunch, back in ~1h [17:30:54] I wonder what these queries that kill Blazegraph are [17:31:24] Are they very complex and require a lot of analysis, or are they straightforward but voluminous inquiries amounting to "dump the whole database" [17:33:52] hare: generally the former is my understanding [17:34:10] but we've seen both types in the past [17:43:50] hare: there is an example query in our private ticket, it's not all that complicated but it is very very long, uses a bunch of unions, and some syntax i'm not familiar with [17:44:06] not sure if they all look like that, or just that particular example query [18:08:04] back [18:09:05] ryankemper are we sure that dashboard only counts prod wdqs? the promQL query `backend=~"wdqs.*"` makes me think it could match the graph split hosts too? [18:32:04] gehel: running into some issues with my hub that connects monitor/keyboard/mouse/etc, gonna try restarting laptop and see if that helps. few mins late to 1:1 [18:36:08] Do we have a date for the graph split? [18:36:52] hare: we're hoping to have it out this week [18:59:31] dr0ptp4kt: any advice for the best way to track down the new user agent? the most recent gmail user agent we were aware of I don't see logs from after after aug 13. they've probably switched to a new one but not sure how to efficiently track it down [19:00:27] ryankemper: in a meeting, will take look right after that and respond here [19:00:34] ty! [19:00:36] ryankemper I'd look for the domain mentioned here https://docs.google.com/spreadsheets/d/1Sr7b2QvYqs_ispVyMqcDBhqhwq_nsBWJHsrYW8Iebew/edit?gid=0#gid=0 . Still not 100% sure the attacker is back though [19:01:54] I'd check the age of `wdqs-blazegraph` processes as the bad query crashed them every time [19:04:16] I've said this in other venues but, if you haven't heard yet, I will continue operating my own Blazegraph as a unified graph, so you can send people to me if they ask about that. I imagine at some point my unified Blazegraph will crash for good, but hopefully by then I will be using a replacement. [19:04:26] unrelated, I am getting different results in old ( https://grafana.wikimedia.org/goto/Hbe588jSR?orgId=1 ) and new https://grafana.wikimedia.org/goto/pMCbU8CIR?orgId=1 Cirrus failures dashboard, particularly the "failed" category. Any suggestions on tihs? [19:08:05] wrt wdqs abuse, I suspect their current queries are causing timeouts but not directly crashing blazegraph yet [19:08:26] i'm pretty sure it's the same individual due to the similar pattern in failure spikes (~2 hours apart) [19:09:20] And it can't be main/scholarly since the metrics come from trafficserver so even if the metric is counting both there's no way the main/scholarly queries are >=20% of total wdqs query volume [19:12:37] inflatador: hmm, looking [19:16:11] curiously, the old dashboard even shows differences just between the dashboard and the "explore" version of the panel [19:16:42] inflatador: oh, it's the log scale [19:16:53] inflatador: old dashboard has y-axis on a log scale [19:17:13] i think the point of that was to avoid one big spike from turning the rest of the dashboard into a flat line [19:17:40] i suppose the peaks are also lower in the new version though [19:19:09] changing the aggregation period from [5m] to [2m] brings them up higher, i suppose because the error increments are very small, so 4 errors over 5 min vs 4 errors over 2min [19:19:22] not sure what's best to do with that :S Sometimes i like keeping cumulative counts [19:20:36] ebernhardson ACK, we could make a count panel or query also. In the meantime, would you like me to change the scale from log to linear and/or change the agg period? [19:21:26] inflatador: no clue :P I suppose the exact value aren't particularly important, it's more about trends and outliers. Probably log scale is better since we know that sometimes things blow out the scale when an error effects hundreds of req/s [19:32:11] hare: Yes, we've heard about your proposal for longer term support of single graph. Thanks a lot! That's going to help the transition! [19:32:39] pfischer: in case you're around: https://meet.google.com/iti-ahax-yme (JVM stewardship) [19:33:39] ryankemper ACK, sounds reasonable to take a cloesr look then. LMK if you find anything [19:37:52] do we need`CirrusSearch Job Abandonment Rates` anymore? https://grafana.wikimedia.org/goto/1cvx_8jSg?orgId=1 guessing we don't [19:39:20] inflatador: we will still have a small amount of jobs, for archive updates and the php weighted tags api. So it's not completely useless, but i'm not sure how useful it would be [19:44:55] ryankemper: around? [19:45:48] dr0ptp4kt: yeah [19:45:56] moment, will send you meet link [19:48:43] having trouble finding a prom equivalent of graphite's `jobrunner.job-abandon` [19:51:50] `MediaWiki.jobqueue.run.cirrus` as well...will hit up 0lly if the trend continues [19:52:40] maybe we want the flink metrics for this? [19:56:03] inflatador: they might not have one yet, that would plausibly be coming from cpjobqueue [19:57:02] for flink, i don't think there is an equivilent to abandon'd jobs. iiuc abandoned jobs is when the infra tries to run the job and it fails, and it runs out of retries [19:57:23] i suppose we do have some similar-ish concept for fetch-failures [20:05:17] I think we already have alerts for elevated fetch failure rate, but I could be hallucinating that [20:06:10] yea we have a `CirrusConsumerFetchErrorRate` alert [20:16:04] would it be useful to have the failure-related topics on this page, or is the existing "kafka by topic" panel enough? kinda torn on this one [20:52:09] ryankemper: i'm going to do the /etc/hosts line on wdqs1021 so i can do more federated queries from the wiki page [21:04:53] for the job rates dashboard (https://grafana.wikimedia.org/goto/Atui6UCIg?orgId=1 ) , should i be looking at Kafka metrics, something like `rate(kafka_server_BrokerTopicMetrics_MessagesIn_total{topic="codfw.mediawiki.job.cirrusSearchElasticaWrite"}[15m])` ? [21:31:34] ryankemper: i removed the /etc/hosts line. https://phabricator.wikimedia.org/T370754#10075706 [21:56:30] inflatador: hmm, hard to say. I suppose to start off with simply linking it should be reasonable. It's what we do on the SUP dashboards [22:00:36] ebernhardson cool, thanks for the advice today [22:46:49] ryankemper: doing /etc/hosts on wdqs1021 and wdqs1023 for federation in both directions [22:53:15] ack [23:05:33] * ebernhardson can never remember the difference between eventbus and eventgate without looking them up... [23:06:05] ryankemper: i removed the entries from each. have a good afternoon and evening! [23:06:29] * dr0ptp4kt feels like ebernhardson :) [23:06:46] :)