[06:22:25] <zpapierski>	 it seems we again need to take down the WCQS :(
[06:23:13] <zpapierski>	 journal is 3.3T and there's 83gb left on the device - I highly doubt that it will survive the next update
[06:24:09] <zpapierski>	 actually, it didn't survive the last one, space left came from cleanup 
[06:31:16] <zpapierski>	 if nobody objects, I'll allert the commons-l and start the cleanup process
[06:51:54] <dcausse>	 zpapierski: while we're here it might be a good opportunity to do T284040 if no-one objects 
[06:51:54] <stashbot>	 T284040: Enable blank node skolemization on wcqs - https://phabricator.wikimedia.org/T284040
[06:53:23] <zpapierski>	 we could - it's still downloading the dump
[06:54:44] <zpapierski>	 (and from a looks of it, it will take a lot of time)
[06:55:12] <dcausse>	 if we're downloading from our own dumps.wikimedia.org yes it will :/
[06:55:19] <zpapierski>	 can we not?
[06:55:56] <dcausse>	 we use https://dumps.wikimedia.your.org/ generally from the prod machines
[06:56:22] <zpapierski>	 will it work from cloud?
[06:56:45] <dcausse>	 no clue, never tried
[07:00:30] <zpapierski>	 it does, now that's a difference
[07:02:39] <zpapierski>	 est. 6m vs 90m
[07:08:37] <zpapierski>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/698451
[07:12:07] <zpapierski>	 and https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/698452
[07:14:31] <zpapierski>	 gehel: help with this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/698451/
[07:44:18] <dcausse>	 zpapierski: just realized that https://gerrit.wikimedia.org/r/c/operations/puppet/+/698451 is based on a patch that we might not ship (shards increase on relforge)
[07:44:47] <zpapierski>	 my mistake
[07:45:08] <zpapierski>	 rebased that
[07:45:56] <dcausse>	 thanks!
[07:55:13] <zpapierski>	 ok, message sent
[07:55:33] <zpapierski>	 I manually modified puppet files and update script for now (I needed to disable auto update anyway)
[07:55:41] <zpapierski>	 so skolemization should be there after the update
[07:59:58] <gehel>	 zpapierski: sorry, late start this morning, looking
[08:00:07] <zpapierski>	 no worries
[08:01:00] <dcausse>	 we might want to add a quick note in the e-mail related to this reload (just pointing to https://www.mediawiki.org/wiki/Wikidata_Query_Service/Blank_Node_Skolemization should be sufficient) 
[08:01:09] <zpapierski>	 ah, true
[08:01:55] <gehel>	 zpapierski: just to make sure: I can merge this and apply it, it will not cause trouble on the current journal?
[08:02:15] <zpapierski>	 current journal is no good anyway
[08:02:48] <gehel>	 :/
[08:03:02] <zpapierski>	 yeah...
[08:03:36] <zpapierski>	 I remember when we discussed it in February, 3 months seemed like enough time to introduce streaming updater to WCQS, but we were too optimistic
[08:03:38] <gehel>	 merged and applied on the production puppetmasters
[08:03:42] <zpapierski>	 thanks
[08:03:46] <gehel>	 still needs to be updated on WMCS
[08:04:36] <zpapierski>	 yep\
[08:04:40] <zpapierski>	 I'll take care of it
[08:04:44] <gehel>	 thanks!
[08:05:02] <zpapierski>	 once this updates done (I already applied those changes there manually)
[08:07:51] <zpapierski>	 update's in progress, along with skolemization options (still need to merge wdqs script changes and update our puppetmaster )
[08:08:17] <zpapierski>	 (but that can be done afterwards)
[08:08:47] <zpapierski>	 ok, I need to relocate, be back in ~40m
[08:08:59] <gehel>	 zpapierski: I think the puppetmaster is updated every 30', so if you wait long enough, it's just going to happen
[09:22:50] <zpapierski>	 not the one for qcqs
[09:22:56] <zpapierski>	 s/qcqs/wcqs
[09:23:13] <zpapierski>	 we have our own, since we can't keep secrets with the WMCS one
[09:32:38] <zpapierski>	 huh - https://admin.phacility.com/phame/post/view/11/phacility_is_winding_down_operations/
[09:33:03] <zpapierski>	 apparently they still plan to maintain phabricator, but I'm guessing this will have a short expiration period
[09:37:39] <zpapierski>	 https://phabricator.wikimedia.org/T283980
[09:37:53] <zpapierski>	 Forgot  about the bot - T283980
[09:37:53] <stashbot>	 T283980: Phacility (Maintainer of Phabricator) is winding down. Upstream support ending. - https://phabricator.wikimedia.org/T283980
[10:01:47] <zpapierski>	 dcausse: I created a simple Api for suggest like you proposed - do I have to register it to use it ?
[10:02:24] <dcausse>	 zpapierski: yes, see APIModules in extension.json
[10:02:34] <zpapierski>	 looking, thx
[10:02:37] <zpapierski>	 after that?
[10:03:28] <zpapierski>	 ApiPropModules or ApiModules? what's the difference?
[10:03:28] <dcausse>	 then call api.php?action=the-name-you-chose&params=...
[10:03:35] <zpapierski>	 ah, cool
[10:14:30] <dcausse>	 lunch
[11:19:37] <zpapierski>	 ok, my API is working, now how to I test it...
[11:21:57] <zpapierski>	 ok, I think I know how
[11:51:05] <gehel>	 zpapierski, dcausse: those tickets are still assigned to Maryum. Should you take them over? T273095 T273098
[11:51:05] <stashbot>	 T273098: High Availability Flink - https://phabricator.wikimedia.org/T273098
[11:51:06] <stashbot>	 T273095: Deploy Helm Chart - https://phabricator.wikimedia.org/T273095
[12:08:00] <zpapierski>	 hmm, I'm not sure which tasks are done with which tickets - there's been a lot of changes
[12:08:30] <zpapierski>	 can we spend 4.5 minutes during triage discussing this?
[12:08:46] <zpapierski>	 lunch break
[12:19:56] <dcausse>	 gehel: I took them over, they should be in review, feel free to assign them to me if this makes more sense 
[12:33:50] <gehel>	 dcausse: I assigned them to you, it makes more sense to me since Maryum should not be expected to push them forward anymore
[12:34:25] <dcausse>	 sure
[12:34:41] <gehel>	 zpapierski: I'll try to remember to raise the point during sprint planning / triaging, but please raise it if I forget
[12:35:36] <gehel>	 zpapierski: is there anything left to do on T261119 ?
[12:35:36] <stashbot>	 T261119: Architecture review of Flink based WDQS Streaming Updater - https://phabricator.wikimedia.org/T261119
[12:36:14] <gehel>	 sub tasks are closed and as far as I remember, we don't have another review planned with Ververica
[12:51:19] <gehel>	 dcausse, zpapierski : blazegraph journal on wdqs1012 is > 2TB. Do you want to investigate something before we scrap it and recover from another node?
[12:53:35] <dcausse>	 gehel: according to https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=32&orgId=1&from=1621011941259&to=1623007039654 it's clear that it was due to the free allocator bug so I'm not sure we can learn more at this point
[12:54:27] <gehel>	 that would have been my guess :)
[13:02:06] <gehel>	 dcausse: did ryankemper talk with you about reimaging wdqs1009 ?
[13:02:52] <gehel>	 it's the last node still low on disk space, but since it is the only one with a skolemnized journal, there is no simple way to recover that journal after reimage (T280382)
[13:02:52] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[13:03:07] <gehel>	 mpham: ^^
[13:03:57] <gehel>	 there is no emergency to reimage, but I think this test server has been exposed to users for long enough, it is unlikely that we will get more reports. We probably want to send an email to wikidata@ before taking it down.
[13:04:33] <gehel>	 dcausse: are there other reasons to keep this skolemnized journal? For internal testing? Or should we just trash it and move back to the "standard"
[13:05:51] <gehel>	 ryankemper: for when around: T283269 seems to be completed. Or is there something else that needs to be done before closing ?
[13:05:52] <stashbot>	 T283269: Cirrussearch elasticsearch rolling operation cookbook causing alerts - https://phabricator.wikimedia.org/T283269
[13:06:43] <dcausse>	 gehel: I prefer to keep it for testing, esp with k8s coming up
[13:07:15] <gehel>	 dcausse: ok, I'll add a note on the ticket.
[13:08:36] <dcausse>	 I'm fine putting it down for a couple days if it helps the re-image assuming we have enough space somewhere else to save the journal
[13:09:34] <gehel>	 I think it is easier to just wait :/
[13:10:06] <dcausse>	 sure
[13:10:29] <gehel>	 zpapierski: should we close T264873 ? We have sonar enabled on all projects as far as I know. There are still issue of analysis being broken on merge commits, but that's rare enough that I think it make sense to wait until we move to gitlab.
[13:10:30] <stashbot>	 T264873: Ensure that SonarQube is commenting on gerrit code reviews of the Search Platform team - https://phabricator.wikimedia.org/T264873
[13:10:41] <gehel>	 damn, it's starting to rain
[13:10:44] * gehel needs to relocate
[13:30:02] <gehel>	 ryankemper: can we close T274788 ?
[13:30:02] <stashbot>	 T274788: potential disk issue on wdqs1010 - https://phabricator.wikimedia.org/T274788
[13:31:46] <gehel>	 ryankemper: same for T267927. I'm not sure what the status is with the reimages. And before closing this one, we might want to send an update to wikidata@ to let user know about all the issues this should fix.
[13:31:46] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[13:32:06] <mpham>	 gehel: catching up now. when will reimaging for T280382 become an emergency?
[13:32:07] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[13:32:28] <zpapierski>	 gehel: agreed - I lack the energy to get to the bottom of the merge commits analysis, let's wait for gitlab with that
[13:32:39] <mpham>	 Are we still having issues with WCQS?
[13:32:49] <zpapierski>	 service is down
[13:32:58] <zpapierski>	 and will be until probably Wednesday
[13:33:07] <zpapierski>	 nothing we can do about it
[13:33:14] <mpham>	 ok. did we announce yet?
[13:33:29] <zpapierski>	 yep, I sent a message
[13:34:15] <zpapierski>	 I didn't want to wait - it takes a long time and due to disk space, there was already a data corruption
[13:34:15] <mpham>	 zpapierski: mailing list? weird, i don't see it yet
[13:34:33] <gehel>	 zpapierski: I moved the ticket about sonar comments
[13:34:53] <gehel>	 mpham: it won't ever be an emergency, this is a test server
[13:35:23] <gehel>	 we just need to remember to reimage it when we can
[13:36:55] <gehel>	 this also reminded me that we haven't followed very closely with our users on the tests of skolemnized blank nodes. We need to end this test at some point. Maybe the best is to keep that server exposed until we move the streaming updater (and skolemized blank nodes) to production
[13:37:28] <zpapierski>	 mpham: commons-l?
[13:37:44] <gehel>	 It always seems that we're almost there, but it's been a few month of being almost there. So maybe it make sense to already push this skolemnization forward.
[13:37:46] <zpapierski>	 I didn;t send to wikidata this time, after we were scolded for crossmailing
[13:38:09] <mpham>	 ah, ok. i guess i'm not on the commons-l. 
[13:38:22] <gehel>	 zpapierski: the previous time, we were scolded for sending it only to wikidata and not commons :)
[13:39:04] <mpham>	 gehel: i think it makes sense to push it forward too
[13:39:22] <zpapierski>	 ah, you're correct
[13:39:23] <gehel>	 mpham: should we return T262265 to the backlog? I think we should focus on finishing the query completion work before starting work on WCQS
[13:39:23] <stashbot>	 T262265: Provide real-time updates for WCQS - https://phabricator.wikimedia.org/T262265
[13:39:49] <gehel>	 zpapierski: can you cross post it to commons@ as well? With an appology for the cross post?
[13:39:57] <zpapierski>	 sure
[13:40:07] <zpapierski>	 you mean to wikidata?
[13:40:15] <zpapierski>	 I already posted this to commons-l
[13:41:21] <mpham>	 it sounds like we're realistically not going to get to WCQS at all this quarter, so it makes sense to push it back to backlog in favor of finishing out other planned work
[13:41:27] <gehel>	 zpapierski: yes, of course!
[13:41:57] <gehel>	 zpapierski, dcausse: given the conversations around k8s, I think that T280166 is mostly done. Could you check?
[13:41:58] <stashbot>	 T280166: Investigate using session cluster for Flink - https://phabricator.wikimedia.org/T280166
[13:43:33] <zpapierski>	 mostly, we wait for the SRE review
[13:43:41] <gehel>	 mpham: I moved that WCQS ticket back to the SDAW column on the backlog board
[13:43:59] * gehel should review this current workboard more often!
[13:44:20] <zpapierski>	 we all probably should :(
[13:50:54] <mpham>	 ok thanks
[14:08:53] <gehel>	 Fire fighters intervention on the way back from Oscar's school. I might be delayed a bit.
[14:15:24] <zpapierski>	 dcausse: I'm writing a test for query completion api, but I wonder - how should I make sure I have some relevant data there?  
[14:15:44] <zpapierski>	 as in - is there some standard for that, or do I need another api to feed the completion index?
[14:19:15] <dcausse>	 we can expose an API just for testing that can trigger data ingestion
[14:20:38] <dcausse>	 see $wgAPIModules['cirrus-suggest-index'] = 'CirrusSearch\Api\SuggestIndex'; in tests/jenkins/Jenkins.php this is exposing an API that rebuilds the title completion suggester
[14:20:59] <dcausse>	 and can then be controller via integration tests
[14:22:18] <dcausse>	 any objections to add Aisha to analytics-search so that she can access airflow?
[14:29:45] <zpapierski>	 thx, will look into that
[14:30:30] <dcausse>	 basically writing a maint script that populates the index based on some data file might be ok
[15:01:17] <gehel>	 ryankemper: spritn planning: https://meet.google.com/qho-jyqp-qos
[16:12:37] <ebernhardson>	 huh, turns out the jar you upload to archiva and the one it gives you back have different hashes? Don't remember that happening before, but just did a maven release for glent, copied the .jar from target/ but it doesn't match what archiva is holding
[16:15:43] <ryankemper>	 hmm, I wouldn't expect that either...weird
[16:18:35] <ryankemper>	 gehel / dcausse: catching up on the `wdqs1009` discussion above. gehel you mentioned it's probably easier to just wait - do you mean wait till after we go live with the streaming updater and no longer need wdqs1009 to be a snowflake?
[16:18:58] <ryankemper>	 it seems like we could re-image 1009 with minimal disruption...basically a day of it being out of commission while we transfer its journal to 1010, reimage 1009, then transfer back
[16:19:39] <gehel>	 ryankemper: either way works for me. Since that server is exposed publicly, we want to announce that down time in advance.
[16:19:46] <gehel>	 ryankemper: your call!
[16:19:47] <dcausse>	 ryankemper: I'm fine with whatever is easier for you
[16:19:57] <gehel>	 dinner time, back later
[16:21:59] <ryankemper>	 okay I'll e-mail the wikidata mailing list and schedule that work for tomorrow (will let them know that query-preview will be unavailable during the maintenance window)
[16:23:02] <dcausse>	 thanks!
[16:24:22] <dcausse>	 dinner
[16:27:59] * ebernhardson tries to remember if there is any remaining reason wikimedia/discovery/analytics is deployed to stat1007
[16:48:05] <absorto>	 hey! quick question, when searching with integers only, is stemming off? e.g. 202
[16:49:00] <ebernhardson>	 absorto: typically numbers shouldn't be stemmed, so yes? Is there a more specific problem
[16:49:14] <absorto>	 I would expect it to find 2020 or 2021, but it seems I can only get this when suffixing the query with the tilde
[16:50:05] <ebernhardson>	 absorto: ahh, thats not really stemming. Stemming means dogs->dog or similar. Numbers are found as exact tokens
[16:50:39] <ebernhardson>	 absorto: you can do something like 202* which will expand 202 into ~1k random terms that start with 202
[16:53:31] <absorto>	 ebernhardson: Uhm, I see. Yes, the wildcard would work perfectly, but my client doesn't think that this is practical. Anyway to configure this behavior (other than adding the tilde with JS)?
[16:55:00] <ebernhardson>	 absorto: not really, the ~ and * and such are things that happen inside the lucene query parsing engine, we just provide the string
[16:57:25] <ebernhardson>	 in a more specialized use case something would extract the data into a specific field and range query on that, but that's not something particularly easy to do. So for example some pattern matching could extract everything \d{4} or some such from text into it's own field
[16:57:42] <Trey314159>	 absorto: there is also some normalization, so that  ²２𝟚₂ is the same as 2222. If you want to limit results to mostly 2020 and/or 2021, you could try `insource:/202[01]/` —but it's an expensive query that will time out on large wikis unless you add some regular search terms, like `insource:/202[01]/ NBA`. You will get some false positives from the regex like "20120204234030" in a URL, but it's close.
[16:58:13] <absorto>	 ebernhardson: ok, got it. thank you for your answers :-)
[16:59:09] <ryankemper>	 ebernhardson: lots of mw logs for `Pool error on CirrusSearch-Search:_elasticsearch:  pool-queuefull`, any ideas on what'd be causing this? maybe the sane-itizer?
[16:59:25] <ebernhardson>	 ryankemper: shouldn't be saneitizer, usually means a big spike in traffic
[16:59:39] <ebernhardson>	 elasticsearch percentiles dashboard has traffic data
[16:59:50] <absorto>	 Good idea, Trey314159. Is it possible to intercept the normalization? Or you mean doing the insource:// directly in the query? 
[17:00:25] <ebernhardson>	 p95 spiked about 9 hours ago, still increasing. p50 spiked at same time but is decreasing
[17:01:14] <ebernhardson>	 flultext qps is about double normal, ~1k qps vs 500 last week
[17:02:45] <ebernhardson>	 ryankemper: yea, poking at the percentiles dashboard (qps by type at bottom in particular), the problem seems to be a bot of some sort, spiked traffic from 650qps to 1800qps for a few hours, then went away and came back at a lower rate thats tapering off
[17:03:26] <ryankemper>	 ebernhardson: yeah seeing that too. definitely seems like the full text is the culprit, lines up nearly perfectly with the poolcounter spike
[17:04:45] <ebernhardson>	 not really sure what we do. Can do some analysis and try and figure out some signature of the requests, but who knows if that goes anywhere
[17:04:52] <Trey314159>	 absorto: you can't really stop (some of) the normalization. It even occurs when you use quotes—you can think of it as "advanced lowercasing". Regex searching does search for exact characters, but doesn't pay attention to word boundaries like regular searching does. And I was suggesting using the insource regex in the query, like so: https://en.wikipedia.org/w/index.php?search=insource%3A%2F202%5B01%5D%2F+NBA
[17:05:19] <Trey314159>	 If you have more details on your use case, I/we can try to help you find a query that works better for what you are tring to do.
[17:08:26] <ryankemper>	 ebernhardson: sanity check...are allow/reject being plotted on different scales in this panel? https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=2&orgId=1&from=1622945262851&to=1623085183053  the numbers of allows is higher than rejects - as expected - but visually it looks like rejects is a higher number
[17:09:00] <ebernhardson>	 ryankemper: "stacked" means A sits on top of B, and the total height is the total req's received by everything
[17:09:05] <absorto>	 Trey314159: got it. the use case is quite broad, in short, users want the same behavior they get on the likes of Word or Acrobat Reader, meaning that searching for 202 they would get matches for 2020, 2021, etc. Since using regex is not user friendly and forcing insource for all queries to catter for this need would be very expensive, I don't think there is much options left.
[17:09:36] <ebernhardson>	 ryankemper: so, the bottom to the first line is successes, and between the first line and the top line is the failures
[17:09:45] <ryankemper>	 gah...stacked gets me again
[17:10:05] <ebernhardson>	 it's never clear what the right way to visualize things is .... too many options
[17:11:02] <ryankemper>	 stacked and that other setting that just connects nulls with a line are the bane of my existence :P
[17:11:29] <Trey314159>	 absorto: ahh, like searching on a page in your browser. Got it. Unfortunately, fulltext search doesn't really do that very well.
[17:12:42] <ryankemper>	 Anyway back to the important stuff...yeah so the poolcounter is doing its job of protecting how much we're slamming mediawiki. but on the other hand if I just let it work itself out it's not guaranteed that whoever this actor is is going to stop flooding us
[17:13:03] <absorto>	 Yes, exactly, Trey314159. Bummer, but expected, I imagine it is because the scale is very different on each of these cases. Thanks a lot! 
[17:13:25] <Trey314159>	 No problem! (And yeah, the scale is a big part of it!)
[17:13:37] <ebernhardson>	 ryankemper: indeed, but we only have very broad tools to deal with it. I think we can either create ip blocks, or maybe user agent blocks if the bot was nice enough. We could also be able to contact the bot, but I don't find a ton of success there
[17:13:53] <ryankemper>	 Right
[17:14:51] <ryankemper>	 Well looking at the actual number of rejects, it doesn't feel like a super concerning number of rejections
[17:15:06] <ebernhardson>	 it means average users are getting failures though
[17:15:34] <ryankemper>	 Well...actually at current levels it's 10% of all requests so that is concerning actually (if it persists)
[17:15:39] <ryankemper>	 Yeah
[17:18:11] <ryankemper>	 ebernhardson: what's the difference between `entity_full_text` and `entity` in https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=1623006099108&to=1623085934837?
[17:18:22] <ryankemper>	 er sorry `entity_full_text` and `full_text`
[17:18:39] <ebernhardson>	 ryankemper: entity means wikidata, basically
[17:19:00] <ryankemper>	 ah, ofc
[17:19:26] <ryankemper>	 so full_text spiked up 2x but `entity_full_text` is something like 30x where it was before this spike...although it's not clear how much it's contributing in absolute numbers
[17:19:57] <ryankemper>	 However in the previous spike we had 2x `full_text` but no increase in `entity_full_text` and we didn't see any rejections
[17:20:09] <ryankemper>	 so I would wager that it is the `entity_full_text` that's really tipping things over the edge
[17:20:17] <ebernhardson>	 seems plausible
[17:21:02] <ryankemper>	 ebernhardson: I'm really rusty on the relationship between wikidata and elasticsearch...what wikidata-related stuff do we actually index?
[17:22:29] <ebernhardson>	 ryankemper: they are a wiki page, we index them like any other wiki page. The difference is wikidata has it's own query builder separate from the typical one. Mostly i think this handling is around how we choose which languages to search based on the user language, iirc
[17:22:53] <ebernhardson>	 (because wikidata has labels/description separated into ~300 separate language fields)
[17:28:37] <ebernhardson>	 hmm, with them dominating the entity_full_text logs should be easier to identify. A random hour earlier in the month has ~34k reqs/hr, 10-11 this morning had 1.1M. See if there are any obvious tells...
[17:29:31] <ryankemper>	 ebernhardson: what's the best way to trace the source requests? do we ship logs of the requests themselves to logstash?
[17:30:07] <ebernhardson>	 ryankemper: every cirrus search request generates a `mediawiki_cirrussearch_request` event, these are accessible in hadoop 
[17:30:10] <ryankemper>	 (if so presumably we could make a visualization of the source IP and/or user agent of the `entity_full_text` request logs and hopefully identify the source)
[17:30:31] <ebernhardson>	 those logs can be processed and loaded into relforge, from which visualizations could at least in theory be made
[17:31:24] <ryankemper>	 okay so it'd take some wrangling
[17:31:35] <ebernhardson>	 some basic groupby's, these requests are all coming from a browser that is identified as HeadlessChrome
[17:31:59] <ebernhardson>	 (sadly it seems these do not contain actual user agents we could use for blocking)
[17:32:32] <ryankemper>	 that's more evidence for the bot theory
[17:32:54] <ryankemper>	 ebernhardson: what host are you poking around on? I need to learn from your bash history :P
[17:33:56] <ebernhardson>	 ryankemper: any analytics host that can run pyspark, stat1007.eqiad.wmnet is a common one. Most people probably wouldn't do it how i do though. I think you could use the sql lab in superset: https://superset.wikimedia.org/superset/sqllab/
[17:34:19] <ebernhardson>	 ryankemper: the table of interest is event.mediawiki_cirrussearch_request
[17:34:54] <ottomata>	 not following at all but it sounds like you are trying to troubelshoot some stuff that is happenign now?  the data there lags by a few hours
[17:35:12] <ebernhardson>	 ottomata: not really troubleshoot, but have a strategy to block users that send a thousand reqs/sec
[17:35:22] <ebernhardson>	 ottomata: they've been sending them for ~9 hours, so its in there :)
[17:35:26] <ottomata>	 :)
[17:35:26] <ottomata>	 ok
[17:37:02] <ryankemper>	 `DB engine error` -> `(MySQLdb._exceptions.OperationalError) (2005, "Unknown MySQL server host 'analytics-slave.eqiad.wmnet' (-2)") (Background on this error at: http://sqlalche.me/e/13/e3q8)` from superset...I'll try it on a host instead
[17:37:12] <ryankemper>	 first things first tho, will do a quick update in #wikimedia-operations
[17:38:03] <ebernhardson>	 ryankemper: i guess you don't get bash history that way, but mostly i did equiv of: select browser_family, count(1) from event.mediawiki_cirrussearch_request where year=2021 and month=6 and day=7 and hour=10 and array_contains(elasticsearch_requests.query_type, 'entity_full_text') group by user_agent_map['browser_family'] as browser_family
[17:38:33] <ebernhardson>	 but slightly different because i'm using a programatic api instead of direct sql
[17:49:14] <ryankemper>	 ebernhardson: very unfamiliar with spark, if I want to do raw SQL does it make sense to do `spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64`? Running that fails to connect to the metastore: `21/06/07 17:48:19 INFO metastore: Trying to connect to metastore with URI thrift://analytics-hive.eqiad.wmnet:9083`
[17:49:51] <ebernhardson>	 ryankemper: hmm, you probably need to `kinit` first
[17:50:19] <ebernhardson>	 ryankemper: at least, that would be my guess.  Otherwise that should all be plausible.  If i want to run spark i just run `/usr/lib/spark2/bin/pyspark --master yarn`
[17:51:56] <ebernhardson>	 limiting by ip address doesn't look like it would help much. Top ip has ~10k reqs/hr and declines from there. Suggests the bot is running on a cluster of some ort
[17:52:20] <ryankemper>	 ah looks like there's some sort of kerberos-related onboarding I'll need to do in the future because my user doesn't have creds
[17:52:40] <ebernhardson>	 yea seems possible, kerberos is separate afaik
[17:53:02] <ryankemper>	 what about that makes you think it's running on a cluster? just the pattern of "tons of requests that decline over time"?
[17:54:38] <ebernhardson>	 fwiw the top ip's resolve to  ipv6.gae.googleusercontent.com
[17:54:54] <ebernhardson>	 ryankemper: 1M reqs per hour, but top ip has 10k reqs
[17:55:02] <ebernhardson>	 ryankemper: so, reqs must be coming from 100+ ip addresses
[17:55:39] <ebernhardson>	 (using a very naive filter that if entity_full_text increased from 30k/hr to 1.1M/hr that "everything" is basically the bot in there)
[17:56:05] <ryankemper>	 ah I see, yeah that makes sense
[17:56:30] <ryankemper>	 very thoughtful of them to not make it trivially traceable to a single IP :P
[17:57:09] <ryankemper>	 FWIW looking at the `qps` chart again it looks like the number of `*full_text` is linearly declining...but at a rate that will take ~8 hours for the `entity_full_text` to go back to almost zero (assuming it really is linear)
[17:57:23] <ebernhardson>	 i wonder how identifiable clouds are at request time...probably not particularly easy but if we had a function isCloudIpAddress(...) we could give them their own limited pool counter that allows a few hundred req/s but chops them off without bothering normal users
[17:58:33] <ryankemper>	 that'd be interesting
[17:59:06] <ryankemper>	 so...given that these are split across a ton of IPs, that makes actually taking action (esp. w/o the user agent) exceedingly difficult
[17:59:37] <ebernhardson>	 hmm, can maybe lookup the useragent using these ip's in the webrequest log, sec
[18:00:40] <ebernhardson>	 probably: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/91.0.4472.77 Safari/537.36"
[18:00:56] <ebernhardson>	 So, no information about who
[18:02:29] <ryankemper>	 why would all that applewebkit / safari stuff be popping up in a headlesschrome ua?
[18:02:45] <ebernhardson>	 I'm not sure how bad it would be, but at a high level of the 1.1M reqs on the hour i'm looking at, basically all of them came from 2600:1900:2000:* which seems to be gae
[18:03:03] <ebernhardson>	 ryankemper: user agents all pretend to be each other. It's the way :)
[18:03:51] <ryankemper>	 haha
[18:05:17] <ryankemper>	 Heh https://stackoverflow.com/a/1114297 is a great simple explanation of that
[18:06:23] <ryankemper>	 ebernhardson: So if we just nuked `2600:1900:2000:*` does that seem likely to catch a lot of innocents as well?
[18:06:40] <ryankemper>	 I guess one crappy heuristic: how many requests don't match that in that same hour?
[18:10:15] <ebernhardson>	 hmm, for HeadlessChrome all but 2 requests that hour match the ip pattern. for all entity_full_text about 40k req's would have made it out of 1.15M
[18:13:39] <ryankemper>	 per grafana we were at somewhere between 10 and 18 `entity_full_text` per second before the spike, which works out to [36000, 64800] reqs per hour
[18:14:16] <ryankemper>	 (trying to see if that just under 40k number lines up with the kind of volume we saw before)
[18:14:19] <ebernhardson>	 yea, the 40k is about what we'd expect
[18:14:31] <ryankemper>	 tricky because i'm obv looking at the qps from several hours ago...but at least at a high level the numbers seem roughly what we'd expect
[18:16:13] <ryankemper>	 from a utilitarian perspective, accidentally banning 4k legitimate requests/hour would be the breakeven point (given 10-12% of requests being rejected currently)...although in this case it'd always be the same innocent IPs getting blocked which would be a lot worse than just a random 10% in any given window
[18:16:49] <ryankemper>	 oh I guess that also depends on how granular the poolcounter is...does it just reject 'cirrussearch requests' period or, say, 'entity_full_text' requests
[18:17:40] <ebernhardson>	 well, with poolcounter we have a named pool and have to decide which pool requests go to. For the most part we only have pools by request type, so regex has a tiny specialized pool and more_like is super heavy and gets its own pool, most everything else is bulked together into one thing
[18:17:58] <ebernhardson>	 so we would have to write some php to decide if a given request should be given a special named pool
[18:19:46] <ebernhardson>	 at a high level, the pool is just a named cluster-wide semaphore 
[18:19:59] <ryankemper>	 one way to think of it is we'd be banning a net of ~175 `entity_full_text` qps (at current volume) in exchange for getting back 1000 poolcounterrequests/sec
[18:20:57] <ryankemper>	 so we come out way ahead in raw number of requests, and probably have a much more favorable profile in terms of who "deserves" to have their requests get thru (ie punishing the offender)
[18:21:47] <ryankemper>	 but it also kind of depends if the bot owner is going to notice that their application has been spitting exceptions for several hours (once they get banned) and turn it off, because otherwise if we lift the block we get hammered again
[18:22:47] <ebernhardson>	 I'd be tempted to ban the combintion of 2600:1900:2000:* combined with User-Agent including HeadlessChrome to stop the pain for normal users, but thats not something we can leave permentnatly in place
[18:23:45] <ebernhardson>	 Putting in place some sort of bulk-requests pool counter also seems like a reasonable thing to add, shouldn't be too hard and gives us a place to add checks in the future 
[18:24:58] <ebernhardson>	 checking with traffic (in #mediawiki_security), it seems the normal place for ban hammer would be there
[18:27:56] <ebernhardson>	 on the other hand, the overall request rate is declining and they might be mostly done before we get much in place
[18:29:42] <ryankemper>	 yeah in the medium term putting a bulk-requests pool counter in place would help, but not for this specific incident so we should probably look at that as something to put in place soon but not now
[18:30:40] <ryankemper>	 banning the `combination of 2600:1900:2000:* && User-Agent including HeadlessChrome` at the varnish-level as a stop-gap would definitely help now, but if we don't intervene we'll probably be back to normal throughput in several hours (assuming they don't start up some new task which is not guaranteed ofc)
[19:37:53] <ebernhardson>	 hmm, CirrusSearch master doesn't pass CI. Looks like something with the node 12 upgrade
[20:07:04] * ebernhardson worries thats going to mean a cindy rebuild as well...
[20:23:02] <ebernhardson>	 well, we can probably avoid fixing today. But going to be another fun project to migrate that suite to node 12 (it's on 10 which is EOL a month ago)
[21:09:58] <Trey314159>	 I have just re-discovered the node 12 problem that Erik found. I know we collectively need to upgrade stuff, but isn't there a less traumatic way of doing it? Ugh.
[21:11:14] <ebernhardson>	 Trey314159: i poked releng about it, james was in a meeting but will look at it in a bit. I suspect we can still run the node 10 job for a bit
[21:11:31] <Trey314159>	 Cool. Thanks, Erik!