[04:28:28] Still seeing some flapping on wdqs codfw hosts. Most logical options are to tighten the existing requestctl to completely the ban the user we limited to 1 connection earlier, or alternatively to go spelunking for another candidate to ban [05:37:26] Reload on 1021 finished. Haven't had time to check integrity yet [06:25:46] I think the reindex-ochestrator is missing the last backfill somehow [10:48:48] lunch [12:49:57] looks like it's back to the drawing board on wdqs ;( [12:55:29] o/ [12:55:36] yes :( [13:00:03] 2011 and 2013 just fired, checking now [13:03:25] I'll try banning that IP/UA combo completely..probably won't change anything but at least we can rule that out [13:09:35] the ip might have changed or it's perhaps not the culprit [13:12:28] yeah, I would expect to see something in turnilo when we look at TTFB that correlates with IP or at least ASN [13:33:32] OK, the IP/UA combo is now completely banned. Let's see if it makes a difference [13:34:23] if not, I think we're going to have to find the bad query...open to suggestions on the best way to do that. I can hit up Observability once we have a plan [13:55:32] inflatador: seems like the UA of the offending IP changed [13:55:44] it now has an email address [13:56:17] they might have noticed the ban and have adapted their UA? [14:49:23] heading out, have a nice week end and see you all in a couple weeks [14:53:09] I'm starting to wonder if there's something wrong with the throttling filter in codfw, I'm seeing pybal and prometheus getting throttled [15:32:03] we have a couple of options here: failover to EQIAD to test the theory, or suppress alerts in CODFW during the long weekend. I think the latter is the right call as no one's around [16:10:24] update: filter still doesn't seem to be working even after tweak [16:57:45] working on wdqs has led me to constantly misspell "fqdn" as "fdqn" [18:19:50] inflatador: lol I always make that fdqn mistake [18:32:54] inflatador: I’ll monitor the situation for an hour or so before signing off btw [18:33:32] ryankemper sounds good, you following in #security? [18:34:44] ah no I just saw the phab comment, catching up now [23:06:52] Meh, still have instability. I was seeing requests from the same IP as before in superset so I tightened the requestctl ban to ban the ip regardless of user agent, but I still see alerts so there is probably another culprit [23:10:48] Starting to think we may have been barking up the wrong tree [23:11:34] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&viewPanel=43&from=1720658320188&to=1723245013057 This graph shows our failed query rate has been exceptionally high since the end of july