[07:20:49] I'm afraid the answer is "grep" [07:21:19] (typically combined with cumin to grep on all the frontends at once and collect the results) [12:42:40] Amir1: the schema change on s5 is done (except DC masters), shall I move to another section? [12:46:48] s3 maybe? [12:52:24] ok [13:24:07] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1180&var-port=9104&viewPanel=3&from=now-6h&to=now [13:24:23] https://usercontent.irccloud-cdn.com/file/StlAKYK5/grafik.png [13:24:39] Probably that maint script is running every three hours or something, it's too clean [13:25:55] looking at 2 days, it's indeed repeating [13:26:19] with the first peak at https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1180&var-port=9104&viewPanel=3&from=1744265297778&to=1744267095738 [13:26:37] oof https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1180&var-port=9104&viewPanel=3&from=now-24h&to=now [13:27:01] The peak was probably on other replicas, until we set vslow in s6 so it was picking a random replica every time [13:27:04] We will see if the slow query log catches up [13:27:15] s/up/it [13:27:58] in the meanttime I ran --check on auto_schema on s3 but not started it [13:28:52] Fri 2025-04-11 15:15:00 UTC 1h 49min left Fri 2025-04-11 12:15:00 UTC 1h 10min ago mediawiki_job_growthexperiments-updateMenteeData-s6.timer mediawiki_job_growthexper [13:28:58] Was gonna say [13:29:14] that's the only maint script in the timer that matches and is running every three hours [13:29:20] that's the only job that repeats every 3h [13:29:31] at 15 past [13:29:42] and the time of s6 run matches down to the minute [13:30:07] it's running all shards at the same time btw [13:30:14] s/shards/sections/ [13:30:28] so idk what's up with s6 that makes it more expensive there [13:30:30] that should be fine, I have a feeling something is specific in one of the s6 wikis [13:30:46] Maybe a missing index 😭 [13:31:02] or it can be too many mentors [13:31:10] or a bot that is a mentor [13:33:55] in comparison vslow of s2: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1182&var-port=9104&viewPanel=3&from=1744294159813&to=1744378323933 [13:34:06] it's cute. 300K vs 3B [13:34:26] aww it go up to even 1.2M [13:35:31] sorry I can't read, 3M for s6 [13:36:45] still it's quite weird, s6 is the smallest or second smallest core section [13:37:23] you have try really to be able to break it [15:08:12] so. The next one is bound to happen in ten to twenty minutes [15:16:47] yup, something is destroying it. even show full processlist is hanging right now. [15:20:55] kill the script, see if it recovers? [15:21:03] That'd be definitive proof [15:25:25] > SELECT /* GrowthExperiments\MentorDashboard\MenteeOverview\UncachedMenteeOverviewDataProvider::getFilteredMenteesForMentor */ user_id,user_editcount > 0 AS `has_edits` FROM `user` WHERE user_id IN () [15:25:41] claime: it'll show up in slow log [15:26:10] Ouch, WHERE IN [15:27:08] Rows_examined: 90507125 [15:27:13] and Rows_examined: 33718395 [15:27:38] honestly, disable the script until next week [15:28:06] it's not critical to functioning of the site [15:34:41] Only for s6? [15:34:45] Or for everything? [15:34:57] Can you give me a task to link to as well? [15:40:44] hah, you already did a nuclear option patch x) [15:40:45] claime: I sturggle to find in gerrit all the time https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135983 [15:41:00] other sections are not that much better either [15:42:25] Rows_examined: 704534479 [15:42:30] That's 700 million [15:50:25] Amir1: I'm cc'ing urbanec.m and michael on the patch so they're aware [15:50:31] if you're ok with that [15:53:46] and +1'd [15:54:09] sure, they are on offsite, so it'll take some tiem [16:11:36] Yeah it's just for awareness, not requiring signoff