[08:12:13] 10serviceops, 10WMF-JobQueue, 10Wikimedia-production-error: Make changeprop-jobqueue error handling/httpbb tests better behaved: Uncaught Error: Class 'MWExceptionHandler' not found in /srv/mediawiki/rpc/RunSingleJob.php:42 - https://phabricator.wikimedia.org/T352265 (10matmarex) 05Open→03Resolved https:... [09:38:16] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) a:03brouberol [10:29:42] 10serviceops, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10elukey) +1, looks good (IIUC the new estimation are similar from the original ballpark figures, if not... [10:50:45] Something happened to parsoid during the night https://grafana.wikimedia.org/goto/7F2IUHOIk?orgId=1 [10:51:25] Trying to work out what, there was a cxserver alert around that time [10:52:09] A restbase flap as well [10:56:25] and kibana is crashing my firefox [10:56:29] oh this is going to be fun [11:03:00] redis timeouts during Excimer calls, ok [11:10:19] Ah no, looks like it's poolcounter timeouts [11:11:48] uh...extactly at 03:00 - that feels odd [11:12:01] yep [11:12:27] nothing in SAL [11:12:35] no, I saw nothing either [11:14:54] poolcounters definitely are being more utilized as far as number of sockets, but network traffic has been halved [11:15:33] "Parsoid PHP Production All Events" on the parsiod-php dashboard is elevated since then as well - but I'm not exatly sure what that counts [11:15:51] jayme: absolutely every log event from PHP prod [11:15:59] s/PHP/parsoid/ [11:16:02] ah [11:16:26] It's basically this search https://logstash.wikimedia.org/goto/965336e63cffd48e4882d2b6d84c1d5b [11:16:56] sorry, this search without the timeout filter [11:17:46] the timeout log is full of timeouts for rest calls [11:18:07] ah, that's what you said already :) sorry [11:18:58] yeah actually I wanna see something [11:21:27] jayme: lmao out of all the messages from parsoid since 0300, 1,727,384 hits for url:"/w/rest.php/commons.wikimedia.org/v3/page/pagebundle/File%3ABrezina_-_Brunelli" [11:21:33] 2,336,605 hits total [11:22:03] And it's got page numbers higher than 1500 [11:22:53] yeah, it has 2186 [11:23:34] last edit was 3:16 [11:30:08] if you except that URL we are at a rather normal rate of messages [11:31:21] problem is it still has a ton of pages to parse [11:33:36] I'm pretty ignorant about pagebundle...is this parsing all of the "pages" in backhround via some job? [11:33:59] (e.g. that's why thumbnails are missing for a bunch of them on the commons page?) [11:35:12] So am I [11:39:20] ihurbain, nemo-yiannis, can you maybe enlighten us? [11:39:43] AIUI there is this bot https://commons.wikimedia.org/wiki/User:SchlurcherBot adding metadata to all of the pages which probably causes parsoid to fetch the pagebundle for all of them [11:40:00] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) I've taken a quick stab at creating a chart... [11:40:01] but I don't understand why that is particularily slow [11:49:54] looks like we had the bot blocked in 2021 already because it was causing overload [11:49:56] https://commons.wikimedia.org/w/index.php?title=Special:Log/block&page=User%3ASchlurcherBot [11:53:29] I need to run - biab [11:54:49] ack, I'll try to find more info in the meantime, but I'm not sure I can without a better understanding of pagebundle. We'll see [12:24:17] https://logstash.wikimedia.org/goto/6a62aed2fba4423e325eca398e2f1aae [12:24:22] cc akosiaris ^ [12:25:19] The bot doesn't do that many requests but this little requests has triggered 2M log lines in the same timeframe [12:27:49] interesting [12:28:59] I wasn't expecting a detailed hardware description of where the bot runs when I clicked that link [12:30:34] And not find the UA? Same. [12:31:35] Although now that I look at it more closely, there are a few Alexa UA requests to the api for the same content group (Brezina Brunelli something) [13:21:13] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10JMeybohm) Cool. I think we could/should deploy this vi... [13:29:17] claime do we have any logs that show that the same page was accessed/rendered before ? [13:29:38] i am trying to figure out if its related to last deployment or not [13:32:15] nemo-yiannis: checking how far back I can go [13:32:29] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) Good call, I didn't think of that. Would yo... [13:36:23] nemo-yiannis: It's been accessed by pageviews bot sporadically, and once that I can see by a non-bot user in November [13:36:39] well, accessed, I have matching URIs [13:37:11] The non-bot access is the Global usage for one of the pages [13:37:54] So it probably hasn't been accessed actually ever in the last 6 months [13:58:57] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) A thought: do we want to enable egress to s... [14:34:25] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10JMeybohm) We want charts to explicitly define the serv... [15:02:22] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) Alright! I just thought I'd asked. [16:40:16] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) > I'd prefer to have something to review an... [16:47:33] 10serviceops, 10MediaWiki-DjVu, 10Shellbox, 10Structured-Data-Backlog, and 3 others: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 (10thcipriani) >>! In T352515#9415206, @Clement_Goubert wrote: > We've moved the affected job (AssembleUploadChunks) ba... [17:06:20] 10serviceops, 10MediaWiki-DjVu, 10Shellbox, 10Structured-Data-Backlog, and 4 others: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 (10Clement_Goubert) >>! In T352515#9421751, @thcipriani wrote: >>>! In T352515#9415206, @Clement_Goubert wrote: >> We'v...