[01:47:09] 👍 thanks [05:10:09] mszabo: Google purged almost all usage of pcre in Google-controlled code and replaced it with re2 where possible. [05:10:48] ori: unsurprising :) it'd probably make sense for AF at least where both the regexes and the input are user controlled [05:11:33] the only re2 binding for php that I found is more than a decade old, so I wrote a basic one, it's not a huge task [05:22:06] a nice hackathon project would be to track the cost of each filter and expose it to editors via Special:AbuseFilter [05:22:37] AF is responsible for a big chunk of edit latency IIRC [05:26:24] it's already measured but not displayed in a central location [05:26:30] only on per-filter pages [05:28:45] TIL! [05:30:47] " Of the last 7,003 actions, this filter has matched 0 (0%). On average, its run time is 0.41 ms, and it consumes 1 condition of the condition limit." [05:31:09] this is actually the filter from T385395 heh [05:31:10] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [05:31:15] seems benign enough... [05:42:49] https://github.com/mszabo-wikia/php-re2 pushed some incomplete messy implementation [06:22:41] mszabo: I somehow forgot that you switched over to the Dark Side and now work for the WMF. They're (and we're, as Wikimedians) lucky to have you. [06:28:38] What's your reasoning for wanting RE2 for AF? What does worst-case behavior look like for PCRE on PHP 7? Don't the recursion and backtracking limits effectively prevent it from being an abuse vector? [12:44:32] ori: great question. I don't necessarily see AF as an abuse vector here since it's already a tool limited to trusted functionaries, it's more that the combination of user-written regexes and user-generation content can be an explosive combo as we just witnessed. [12:46:25] The problem I see here is that the backtrack limit (pcre.backtrack_limit) needs to be at least as long as the input string. In our case, the article wikitext is 558161 bytes, but catastrophic backtracking already occurs with pcre.backtrack_limit=400000 [12:46:58] the current backtrack_limit in production is 1M, and it has historically been higher due to larger inputs in different contexts - e.g. T201184 [12:46:58] T201184: CirrusSearch jobs sometimes fail with "RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted" - https://phabricator.wikimedia.org/T201184 [12:49:02] for regexes that we control, it's easier to audit and change them as needed, but it becomes difficult with hundreds of regexes across disparate local wiki AFs, so it might be worth to use an engine there that offers guaranteed linear time execution [12:51:39] https://phabricator.wikimedia.org/phame/post/view/64/laughing_ores_to_death_with_regular_expressions_and_fake_threads/ was another case of a pcre-driven incident, which spawned T173574 (although that doesn't seem to have come to fruition) [12:51:39] T173574: [Investigate] Non-backtracking regex parsers - https://phabricator.wikimedia.org/T173574 [16:55:27] Daimona found the task that I was thinking of in my first response: https://phabricator.wikimedia.org/T240884 [17:47:20] WDYM, "but catastrophic backtracking already occurs with pcre.backtrack_limit=400000"? [17:58:33] ori: if I run php -dpcre.backtrack_limit=400000 re.php, where re.php just tries to match the problem pattern against the article wikitext, it times out still