[08:59:02] tarrow: just wanted to draw your attention on the work that Erik has done on https://phabricator.wikimedia.org/T334194#8857050 (a fix was just merged for this) we believe that it'll help *a lot* in your setup. Hopefully the fix is easy to backport to the version you're using [09:01:48] dcausse: ooooh! That's mega exciting. I'll take a look in a bit; I'm actually still in Hackathon land and probably won't be thinking too smartly before Monday but that's awesome! [10:06:45] lunch [10:51:47] Looks like there is an issue with WDQS codfw. See #wikimedia-operations . [10:52:43] I'm not near a keyboard for at least one more hour. Not an emergency, but of someone could have a look... [12:14:39] dcausse: do you need help on WDQS? [12:25:45] gehel: looking at the access logs hoping to identify something... not sure we have great tools to identify bad queries [12:28:49] Not sure either. And not even sure how such a tool would look like [12:38:54] dcausse: I have an interview in 20', and I think that Brian as well. So not much help coming from us. Scream in -operations if you need technical help or psychological support! [12:39:11] sure :) [12:40:48] o/ I’m currently working on a core patch for `RedirectLookup`. Is there a irc/slack channel to ask for help? [12:44:39] o/ [12:46:01] pfischer: I don't think there's a clear channel to discuss code changes like that... if you have precise question you might just ask here, if it's more related to the overall design Daniel K. might be the person to ping [12:46:26] dcausse turnilo might have some hints for you re: WDQS outage...assuming it is a bad query they can usually block it at the varnish layer [12:47:41] pfischer: btw have you seen the discussion with Andrew related to this new service on your patchset? we're questioning whether it has enough value to promote it as its own service or not yet [13:29:02] seems like we no longer have wdqs logs [13:29:08] Caused by: java.lang.ClassNotFoundException: net.logstash.logback.appender.LogstashSocketAppender [13:39:18] dcausse: yes, I have. That’s why I followed Daniel’s suggestion of adding a method to RedirectLookup that provides us with a PageReference. [13:42:14] pfischer: oh ok did not see this, thanks! [13:43:07] ebernhardson: o/ https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/381 is ready to go whenever you want! [14:04:54] dcausse, inflatador: how are we on that wdqs incident? Do we have a phab task? Should we send an email to wikidata mailing list? [14:05:25] gehel Interview just ended. Looks like it's still not recovered? [14:05:34] no it's not [14:07:48] OK, let me try the kafkacat method to search for abuse... ebernhardson found our bad query last time with it [14:07:48] https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Kafkacat_method [14:08:12] guessing it is a repeat of https://wikitech.wikimedia.org/wiki/Incidents/2022-11-22_wdqs_outage [14:10:28] Before digging more into solving the issue, we should probably at least send a short message. Do you want me to do that? [14:10:38] gehel sure [14:11:02] do we have a phab ticket yet? [14:11:07] no [14:11:26] I'll let you reply to the email with the ticket once it is created (or a skeleton of an incident report) [14:11:57] I'm just out of an interview too. Let me know if I can help with anything, although I'm brand new to WDQS at the moment. [14:12:17] I have another meeting in 3', I'll just send that first email. [14:12:52] btullis: thanks for the help! dcausse and inflatador might be able to give you more context. Basically, we're looking through logs to see if we can identify a source of queries overloading the cluster. [14:15:38] Ok right, lots of GC in blazegraph here: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=37 So someone is sending heavy queries? [14:17:15] btullis Y, I was using the kafkacat method to check user-agent, but that's not getting me too far. I think we might need to pick a query string topic if such a thing exists. Going to ask in the larger SRE room [14:20:25] OK, I'm also starting to look at turnilo for "host": "query.wikidata.org" [14:20:46] inflatador, btullis : see the security chan [14:20:54] Thanks, will do. [14:42:12] \o [14:51:48] o/ [14:53:58] gehel do you want to keep the meeting in 10m? We could work on the incident if so [14:54:59] incident report will be at https://wikitech.wikimedia.org/wiki/Incidents/2023-05-23_wdqs_CODFW_5xx_errors [14:58:46] inflatador: I'll keep the SRE meeting, but feel free to skip [15:01:14] btullis: are you joining the SRE meeting? Or are you focused on the wdqs incident as well? [15:34:19] dcausse: i'll need a reminder...what was i going to start looking into in the streaming updater? [15:35:11] ebernhardson: re https://phabricator.wikimedia.org/T199220#8874014 (elastic backups in swift) do we plan to continue using the search_backup account? [15:35:38] hmm [15:36:24] ebernhardson: IIRC we discussed looking into weithed_tags, but perhaps working on redirects with Peter might make more sense I dunno [15:37:00] dcausse: i suppose that would be via snapshots, which would go through the s3 endpoint now iirc. We want to generally have that available as an option even if we aren't regularly using it [15:37:22] ok makes sense [15:38:54] i dont remember if that used the search_backup user or not though :S we don't have snapshots currently configured afaict [15:40:45] yes saw that but just answered that we'd like to keep it :) [15:41:55] the other swift account we have is wdqs but I see no containers related to elastic [15:42:14] ok, seems reasonable then [15:42:17] there's the one we use for shipping batch data to elastic perhaps that's the one we use? [15:42:41] yea perhaps we used that one, because it's creds are easy to lookup [15:53:45] https://phabricator.wikimedia.org/T309648 here's the ticket where we set up the Elastic S3 stuff [15:54:49] oops looks like we asked for another user https://phabricator.wikimedia.org/T309715 [15:59:15] doh [16:01:05] workout, back in ~40 [16:12:11] wdqs logging is broken because net.logstash.logback.appender.LogstashSocketAppender is not part of the jar [16:12:56] the log config says it wants to connect to localhost:11514 but nothing is serving that port on the machine [16:13:15] hmm, that should have been rsyslog iirc [16:14:23] or maybe not, maybe thats always logstash udp? [16:19:02] no clue :/ [16:21:10] on elastic machines we use logstash locally [16:22:53] yea i was just looking into those. Those use port 12201 for whatever reason. poking puppet history 11514 seems to have been used with logstash. It's not clear to me how wdqs ever worked though, i've not yet found what used to read that port [16:27:13] I think https://phabricator.wikimedia.org/T232184 would have been the last time we setup logging? I'm guessing https://gerrit.wikimedia.org/r/c/operations/puppet/+/535345/7/modules/profile/manifests/wdqs/common.pp#17 is what was supposed to listen [16:29:25] dcausse: on which hosts is nothing listening on 11514? I checked a couple hosts and rsyslog is there litsening [16:31:08] oh [16:32:22] doh... checked with telnet but it's udp... [16:32:34] netstat shows it [16:32:45] yea that'll do it :) I like `sudo lsof -P -i -n | grep ` to find things [16:33:20] i suppose netstat is similar [16:33:59] netstat -l shows open port but does not help to find what process owns this port [16:36:04] so I guess we just need to fix the logging jar to include net.logstash.logback.appender.LogstashSocketAppender [16:41:10] back [17:22:28] https://phabricator.wikimedia.org/T337327 ticket for WDQS outage [17:36:14] Do you guys remember if there's an email address that routes to our team that people external to the foundation can email? We want to provide an email in the 403 desc of the AS we're banning for wdqs so they can email us [17:36:37] the mailing list? [17:36:45] i don't know if you have to sign up to send a mail though [17:38:49] inflatador, ryankemper: https://gerrit.wikimedia.org/r/c/operations/puppet/+/922584 should fix blazegraph logging, tested on wdqs2007 I can finally timesout&stacktraces [17:42:29] dcausse we're still working on banning that one ISP, will take a look afterwards. btullis if you're still around and wouldn't mind puppet-merging let us know [17:46:12] I also see another giant query timing out from another ISP I think [17:46:44] dcausse NM, I'll take care of merging that now. I assume it will require a restart of blazegraph for it to work? [17:47:01] inflatador: yes [17:48:47] dcausse I merged but we're not going to restart just yet. Just want to make sure the ban really works [17:49:34] kk [17:49:48] If it doesn't we'll want to look closer at the other query you just mentioned [17:50:56] * dcausse crosses fingers [17:54:33] the one I see comes from https://github.com/searx/searx/blob/master/searx/engines/wikidata.py so certainly not new [17:55:16] but the generated query is wild [18:10:45] we're about to roll out the requestctl command and do a rolling restart , which should pick up the logging changes as well [18:15:42] ebernhardson: by the mailing list did you mean the wikidata one? or is there a discovery mailing list that will route to us but not be publicly visible to others [18:16:39] ryankemper: it would be publicly visible, the normal mailing list. we don't really have anything for more private but not 1-on-1 afaik [18:21:05] gehel Ryan and I are still working thru the incident and won't make 1x1 . Ping me if you want to join our Google Meet [18:22:05] dinner [18:23:19] dcausse once you get back, can you paste the suspected bad query into https://phabricator.wikimedia.org/T337327 ? Just in case our current fix does not work [18:27:13] inflatador, ryankemper: let's cancel our SRE pairing session. Either you'll be working on solving it and probably don't need me to get in the way, or you'll need to take a break to recover. [18:27:24] gehel: agreed! [18:39:11] inflatador: https://phabricator.wikimedia.org/P48495 [18:41:22] comes from https://github.com/searx/searx/commit/95bd6033fad53b584ae5be54f2229a6edfb5b6a2 (assuming that a single SPARQL call is better than 2 wikidata API calls) [19:10:57] Ah, thanks for looking that up. GC is down but not completely flat as it is in eqiad, so we may need to block that query [19:13:21] inflatador: if you want to review the start of the SRE meeting, the recording is attached to the meeting invite [19:50:37] back . I assume d-causse is gone for the day, but we might need to write a requestctl rule to block the bad query above...looks like the pattern type we need is 'cache-text/bad_param_q' per https://gerrit.wikimedia.org/g/operations/software/conftool/%2B/HEAD/conftool/extensions/reqconfig/ [19:58:09] inflatador: we should not block if it's not the cause of our problems (and apparently it was not their fault this time), if we believe it's hurting perfs we can post an issue on their github to discuss improvements [20:03:25] dcausse Agreed, I am just wary since the last time we did a rolling restart we got about 90m of stability, then right back to flapping [20:03:45] I don't want to actively block them, I just want to have a rule ready in case the current fix doesn't work [20:04:34] anyway, GC is looking much better so hopefully our fix worked [20:06:13] inflatador, ryankemper: if WDQS is stable again, could you post an update on the mailing list (reply to my email), and maybe on the phab task? [20:06:18] * gehel is out for today! [20:06:31] (unless you need me, in that case: scream loud enough!) [20:07:06] gehel Yes