[02:27:04] hi! is there a known reason search would be slower while we are running out of codfw? [02:28:56] We are investigating a ~200ms slowdown in a feature that depends on search. It started exactly a week after the DC switchover, the timing doesn't correlate with anything, some search results are cached for a week so that would be [02:29:06] ...a possible explanation. [05:08:30] tgr_: hmm, that is odd. I glanced at https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&refresh=1m and confirmed there's no overall slowdown for codfw, so it does seem that the slowdown is likely specific to whatever type of query that feature is relying upon [05:10:15] that 1 week timing is indeed suspicious. although if the performance decreases due to a relevant item dropping out of the cache, one would think it'd get stuck back in the cache basically right away (or more likely that it would never drop out of the cache in the first place, if we're talking about an unchanging query or something) [05:10:43] perhaps dcausse can take a look today, or e.bernhardson once US morning rolls around [05:10:55] The queries are combination of hasrecommendation:, articletopic: and template exclusion. The cache is on the client side. [05:12:31] I'm not at all sure this is related to ES, I was just asking in case you were aware of something going on. That's the only thing I could think of in terms of timing. But I'll see if we have more direct metrics. [05:13:14] Also you are right the cache delay explanation doesn't really make sense. [07:25:38] tgr_: I vaguely remember a very similar perf issue you or Kosta reported when cross-DC was first enabled (trying to find a the ticket) [07:33:32] that was around when we rolled out to es7 in addition to enabling MW cross-DC, a lot of moving pieces, it was T317187, unsure if it's similar [07:33:33] T317187: GrowthExperiments Special:Homepage: investigate performance regression since September 6 2022 - https://phabricator.wikimedia.org/T317187 [07:50:48] sigh.. commons is using the wikidata query... https://commons.wikimedia.org/w/api.php?action=query&format=xml&list=search&requestid=de&srlimit=10&sroffset=0&srsearch=Klemke,+ulrich&srwhat=text&cirrusDumpQuery [07:53:49] this is failing with "Parse error on Cannot search on field [labels.en] since it is not indexed" [11:18:22] lunch [14:56:04] \o [14:57:22] hmm, shouldn't commons have gotten the new field? I wonder what i did wrong during reindexing...i guess i only checked the log files for failures but didn't actually check all the indices for expected new mappings/fields [15:02:10] cindy also seems to be very tedious recently...wonder whats made it more flakey [15:08:27] o/ [15:08:45] ebernhardson: for commons we flip labels with descriptions so it's why :/ [15:09:18] but looks like nobody complained yet, search on ns=0 only seems very rare [15:10:49] for cindy I was planning to finally stop being lazy and look into it after removing its vote for the third time :) [15:10:50] oh! yea that makes sense [15:57:28] cindy is hammered by spam :( [15:58:07] the cleanup script might not be working tho... [15:58:10] https://cirrustest-cirrus-integ02.wmcloud.org/w/index.php?search=insource%3A%22a%22+insource%3A%2Fb+c%2F+-rashidun&title=Special%3ASearch&fulltext=Search [15:58:57] I wonder if we can restrict page creation to some local ips [15:59:39] or remove the webproxy perhaps? [15:59:42] hmm, i saw a test fail in a similar way a week or two ago [16:00:36] i suppose i found the proxy useful, the test suite prints out the api queries when it fails and i repeat them to see if the results made it there later, or what was off with the results / page sources / etc. [16:00:47] yes me too [16:00:57] there must be a way we can limit edits to localhost though, hmm [16:01:07] hm... looks like we always use Admin so we could disable anon edits? [16:01:30] hmm, yea perhaps we can disable registration. Do we also set a custom password? [16:02:32] probably https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_editing_by_all_non-sysop_users and https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_account_creation [16:02:50] well, i guess if only sysops can edit then no care about account creation [16:03:35] these are IP edits, will test $wgGroupPermissions['*']['edit'] = false; first and see [16:05:03] wait no, mailbox is full of "Someone, probably you, from IP address 192.168.122.1, has registered an account "EdnaMarko83717" with this email address on commons." [16:09:33] ok went with the most restrictive options, no account creation and sysop only edits [16:10:23] ebernhardson: no objection to drop the old cindy instance (cirrus-integ)? [16:10:42] dcausse: oh i thought we dropped that some time ago, certainly it can be deleted [16:35:06] sigh... ofcourse this breaks tests that want the page creation link to exist :( [16:37:24] hmm, i guess we need to ensure it's always logged in when running queries? [16:39:58] we do login for api based tests but probably not for browser ones [16:49:24] hmm, my csrf tokens in airflow 2 seem to time out really quickly, but the docs claim the default should be 3600s which isn't so quick. Not seeing anything in our custom config that changes that default...curious [19:15:32] ebernhardson: out of curiosity why do you need csrf tokens with airflow 2? [19:16:32] dcausse: the user interface uses them, for example i open the variable create dialog, then poke around in some files and the old airflow 1 ui to decide the right dates. When submitting the form it says i have no valid csrf token [19:17:15] similarly if i tab into a dag that i previously opened and attempt to turn the dag on, the dag turns itself back off with no indication of why. Reloading the page i can turn it on/off, assuming thats also an expired csrf [19:18:07] oh weird, never had issues with our airflow 1 instance and did not browse much on our airflow2 instance... [19:19:58] dcausse: airflow 1 has a bit of a bug...they were supposed to set WTF_ENABLE_CSRF, but they actually set ENABLE_CSRF, so it was never turned on [19:20:21] (wtf = wtforms, not the other obvious acronym :) [19:20:32] https://github.com/apache/airflow/issues/8915 [19:22:16] lol how can you name constants like that :) [19:23:02] yea, i took a bit of a double take and had to see what flask.wtf was :) [19:23:48] yes I'm on their page... the logo kind of make sense tho :) [19:27:29] meh, that was not smart to put a broken fix for cindy before a 6 patches chain... [19:28:46] :) [19:36:14] something about spark3 is slower ... the extract_general_subgraphs step of subgraph_metrics_weekly took 30min for it's last run, i'm re-running that same run in the new airflow instance and its been running over an hour now, just the step it's currently on (a count() call) is up to 43 min [19:36:41] it's not a tiny count, but also not insane. ~5GB per partition * 200 partitions [19:37:10] gc time isn't terrible, suggests it's not memory pressure [19:37:22] maybe something to do with shuffle service [20:04:59] hmm, reading the code this should be super simple...doesn't make sense. It basically reads triples from a table and does a count distinct() (over ~14B rows). big, sure, but nothing earth shattering