[07:32:43] And I'm back! As always, assume that any message sent to me over the last 2 weeks has been lost! [09:21:51] I am back as well! [13:17:53] o/ [13:25:39] someone is still (probably accidentally) attacking WDQS in CODFW, see T372074 . gehel could use your advice on what to do next. Their user-agent has an email, what can/should we tell them if we reach out? [13:25:52] Feel free to respond in DM/Slack if this is sensitive [13:26:14] also, welcome back to you and pfischer ! [13:26:43] We don't have a template (AFAIK) for contacting users that are putting too much pressure on the service. It is rare enough that this happens AND that we have an email to contact them. [13:27:57] If we do have a contact (which seems to be the case here), we should very much let them know that we are very happy that they find WDQS useful, but that we need them to reduce their load a bit to ensure that other users can also access the service. [13:28:34] We **might** want to implement a grey list of users that we throttle more aggressively ? [13:28:39] Can/should we offer help? [13:29:06] Have we had a look at the kind of queries they send? Do we know if those can be optimized? Or if another API would be a better fit? [13:30:06] We should definitely offer help, in particular invite them to our office hours. And probably making it clear that we only have limited resources to help them (and that those resources also need to support all the other users). [13:30:25] excellent, that is precisely what I wanted to know [13:30:41] just didn't want to set expectations w/out approval [13:31:06] Do we know if this is a use case that is core to Wikidata? Or something very peripheral? That might influence where it falls in our priorities. [13:31:18] no idea [13:33:10] Could you extract a few queries from that abusing UA? (queries are PII, so don't add them to a public phab task) [13:35:03] will do, the private task is T372074 . I've avoided putting PII so far as I don't know if I set the tags correctly for a private ticket [13:37:54] It seems to be private enough that we could share PII on that task. [13:40:09] inflatador: do you want to contact that user? Do you need help? Review? [13:41:27] gehel I will contact the user, but it's going to take a bit of time to write up the email. I'll send it to you for review first if that is OK [13:42:24] send it to discovery-private@wm.o, there might be others who can help wordsmith. [13:43:02] ACK, will do [14:00:47] \o [14:05:32] ryankemper wdqs1022 (main graph split host) is alerting, I suppressed for 5 hrs [14:09:52] looks like it doesn't have the graph data yet [14:40:58] Everyone on Search Platform should have edit access to the draft email for our WDQS abuser now, reach out if not [14:57:28] inflatador: if you have a chance some time today, need to stick a secret key into helmfile defaults from puppet private [15:00:49] ebernhardson ACK, can do [15:01:29] I'll be 2' late, relocating to other office [15:16:33] ryankemper restarting elastic codfw for the openjdk ticket [16:29:48] * ebernhardson is mildly suspicious of the saneitizer graph for codfw being empty [16:36:54] hmm, something about the auth isn't working, deployed to codfw with the token via --set, and all requests failed :( Will investigate logs and test [16:37:11] undeployed the token so it goes back to normal [16:46:07] * ebernhardson realizes this is why we have a staging thing with a null output [16:50:42] dr0ptp4kt thanks for the email review, very helpful [16:55:17] for the WDQS queries, is the actual query above or below the line that contains `req.userAgent` and other PII? [17:06:08] inflatador: not sure, send a query and see how yours is represented in the log? [17:06:14] * inflatador is starting to wonder if we got the right culprit [17:07:04] oh yeah, that sounds reasonable ;) [17:07:37] inflatador: what was the path to the logfile again? it may be easier to tell with a hive query [17:08:19] i'm on wdqs2007 right now so could take a quick look...heads upgotta spruce up and then have a call and then gotta head for work-from-mechanic time [17:09:09] dr0ptp4kt yeah, dcausse was using hive originally [17:09:39] I'm basing it on the fact that 2007 and 2009 were the last to crash and the "bad" UA doesn't seem to have hit 2009 for quite some time [17:10:07] I'm also seeing a lot of sports-related queries, maybe Olympics-related? [17:10:43] in which case maybe it'll just go away ;) [17:13:44] inflatador: the updated user agent mentions RAG, probably retrieval augmented generation, which suggests to me it's being used for some search endpoint [17:13:54] ohboy [17:20:31] I'm QCing my comment in https://phabricator.wikimedia.org/T372074#10052432 against https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&viewPanel=3&from=1723161600000&to=1723247999000 (it's gross, but if you disable all but the wdqs search probes you should get a picture on what changed) [17:57:34] signal is pretty strong when I banned them ~1800 UTC on Friday . Still checking my work though https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&viewPanel=3&from=1723223252350&to=1723244389054 [17:58:17] Lunch, back in ~40 [18:18:47] err, hmm...i can install tcpdump to the SUP container but not sure i'll be able to run it. Running as root inside a container requires ssh access to the host apparently :S [18:19:05] trying to inspect the request/response cycle without logging it all to central logging [18:22:53] i suppose i could put sudo into the container? seems odd but maybe if it's only going to run in staging [18:27:59] Worth a try. I know there is a concept of running a tcpdump sidecar in the same pod, but I've never tried it [18:29:32] dcausse: what kind of triple count do we expect for the main graph vs scholarly graph vs pre-graph-split graph? The reload finished on 1021 last thursday but I only see grafana reporting ~2.6 bil triples, which seems far too low if we expect main graph triple count to be half of the total graph [18:29:37] looks like there's a docker img called wmfdebug that's probably made to use that https://docker-registry.wikimedia.org/wmfdebug/tags/ [18:29:40] ryankemper: he's on vacation [18:29:49] ah [18:29:53] inflatador: oh intersting, thanks. Will check it out [18:30:09] ebernhardson np, here's the docker file https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/wmfdebug/Dockerfile.template [18:31:09] ryankemper: i don't remember the exact numbers, but i was also thinking it was in the ballpark of a 60/40 split [18:31:31] that sounds about right. so something is very wrong with the reload then [18:32:05] :( [18:32:27] Haven't dug through the reload logs yet so hopefully there's something obvious [18:53:21] Weird, no longer see a `/var/log/cookbooks_testing/logs/sre/wdqs/data-reload-extended.log` on `cumin2002`. I guess it got logrotated [18:53:41] (repeating since erik's bouncer dced) Weird, no longer see a `/var/log/cookbooks_testing/logs/sre/wdqs/data-reload-extended.log` on `cumin2002`. I guess it got logrotated [18:54:10] inflatador: wrt the wdqs service disruption, have we identified a new IP and/or user agent to ban? [18:54:26] I looked over the email as well and it looks good, made a couple small style changes but that's about it [18:58:42] ryankemper not yet, we had a couple of crashes this morning and that's about it so far [18:59:29] ryankemper: that sounds lower than plausible as the file is near 670G. moment... [19:03:19] ryankemper I'm ambivalent on whether or not I want to keep chasing the UA/IP combo since they seem to change it pretty frequently [19:04:16] inflatador: we're still actively having disruption, although we're not blowing thru SLOs or anything. https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&viewPanel=43&from=now-30d&to=now [19:07:09] ryankemper: i see `8239058383` when doing a `select (count(*) as ?triples) where { ?s ?p ?o }` at http://localhost:9999/bigdata/#query on a `ssh -N wdqs1021.eqiad.wmnet -L 9999:127.0.0.1:9999` tunnel . [19:08:49] ryankemper ACK, we can look at requestctl again today at pairing [19:09:21] dr0ptp4kt: interesting, I wonder if it's just an issue with the exporter or something then. haven't dug in too deep yet [19:15:54] thanks for cleaning up the email btw [19:27:28] inflatador: wanna hop on a meet for a bit to talk UAs and the sparql logs and stuff? [19:31:52] dr0ptp4kt sure! Ready whenever [19:31:59] sec [20:46:31] Interesting pairing session w/ dr0ptp4kt ...we dug up a reproducer query and it seems like more of a blazegraph bug [20:48:21] reproducer is in https://phabricator.wikimedia.org/T372074#10052432 . We triggered ProbeDowns on a couple of eqiad hosts by running it [20:55:31] a bug in blazegraph? preposterous :P [21:15:21] LOL. At least we found it before sending a mean email ;) [21:24:15] can still send them an email, we can't fix the blazegraph bugs. [21:24:23] perhaps their query can be simplified [21:31:33] ebernhardson: inflatador: ryankemper: regarding T366253: do we need permit from service/ops before creating a low volume kafka topic in kafka-main? [21:31:33] T366253: Create a generic stream to populate CirrusSearch weighted_tags - https://phabricator.wikimedia.org/T366253 [21:32:10] pfischer: not currently [21:35:33] okay, thanks! [21:49:33] perhaps a stupid question...but how quickly after new link recommendations were generated should i expect them in the search index (=visible at https://de.wikipedia.org/w/index.php?search=hasrecommendation%3Alink&title=Spezial:Suche&ns0=1)? would two hours be sufficient for eventgate et al to process it? or should i be more patient? is there a way to learn where things are at (besides asking here)? [21:52:45] urbanecm: a couple hours is probably reasonable, the events have to land in hive and then the hourly job has to see them available and run [21:52:58] work is being done to bring that down to ~10min, getting closer [21:55:18] okay, thanks for the info. and glad to hear the process would be quicker in the future! :) [21:55:18] hmm, or maybe i'm misremebering how that pipeline works...looking for how exactly those go through the system [21:55:27] oh, okay. [21:55:52] urbanecm: the 10min would still apply, pfischer is creating a kafka topic that feeds directly into the streaming updater, and the updater has a 10min SLO [21:56:00] i'm just not 100% sure how the existing ones go through the system [21:56:53] but it's still 10 minutes at some point in the future (as opposed to "events generated today should be processed in 10 minutes"), is that right? [21:57:25] urbanecm: right, the events would have to switch to the new kafka topic [21:57:39] urbanecm: i'm wondering if maybe we broke something, because i'm not finding the hourly job in airflow [21:58:49] urbanecm: so actually, the delay shouldn't exist, you should be feeding directly into the streaming updater already. The job doesn't exist in airflow because they feed directly already [21:59:10] the change with the new one is mostly a unification, all weighted_tags over the same standardized stream instead of defining one per thing [22:00:14] so the question would be, why doesn't your thing get in there :S [22:01:23] oh, okay. in that case, something's probably broken. i started generating new suggestions for dewiki at Aug 12 19:28:05, which is 2+ hours. about 3k of recommendations were generated already, which doesn't match what i see when i search for `hasrecommendation:link`. [22:01:54] fwiw, the 822 results i see there now were there _before_ i started the script to generate more. [22:02:32] * urbanecm has no idea how to investigate further [22:04:06] hmm, i would probably start at the `Kafka By Topic` dashboard in grafana, it does look like there are message flowing in the eqiad topic [22:04:22] then i would double check with kafkacat from a stat* machine to see if those messages are reasonable, looking [22:09:24] we read from the kafka-main cluster, instructing kafkacat to wait for a new event had one come in with only a few seconds delay [22:09:38] dated 2024-08-12T22:09:02Z, so basically they are coming in [22:11:25] taking an event from a 21:55, so old enough to have made it through the update pipeline, i see https://de.wikipedia.org/wiki/Untergang_eines_Herzens at 21:55:43, and indeed it doesn't have the appropriate tag in the doc :S [22:12:42] so then i guess would check the intermediate topic to see which part of our updater is having problems, if it was properly ingested then the update should be in eqiad.cirrussearch.update_pipeline.update.rc0 [22:14:25] hmm, not seeing that page mentioned in there, scanning the evens since 21:50 [22:14:38] implies something wrong with our ingestion :S [22:15:30] the input stream is non-optional, so it should always be read from. And the events it generates are non-optional, meaning it can't read something and then decide that the row doesn't deserve an update...hmm [22:16:21] OK, email sent re: WDQS, hoping we get a response soon [22:21:34] watching the events that are coming through the intermediate topic though, there are certainly recommendation.link things being processed [22:24:37] betwen 21:43 and 22:01 it processed 289 link exists requests, checking if any made it through [22:27:27] some have image recommendations, but not seeing link recommendations for any of them :S [22:28:18] urbanecm: i'm not likely to figure this out in a few minutes sadly, can you make a ticket and will look more tomorrow? [22:28:29] ebernhardson: sure! [22:28:39] thanks for taking a little look now :) [22:39:56] ebernhardson: filled in T372362. let me know if i should add in more details. [22:39:56] T372362: link-recommendations suggestions are not getting processed by Search infrastructure - https://phabricator.wikimedia.org/T372362