[05:11:38] I am switching s6 codfw master [05:11:43] sanitarium master that is [05:11:49] So I am stopping codfw master [05:19:52] PROBLEM - Check unit status of swift_ring_manager on ms-fe1009 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:33:25] s6 done, going for s7 now [06:14:42] RECOVERY - Check unit status of swift_ring_manager on ms-fe1009 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:02:36] jynus: when would it be a good day to stop db2078 (misc multi instance) for a couple of hours to clone db2160? [07:02:51] any time now [07:02:56] \o/ [07:02:56] until monday [07:02:58] Doing it now then [07:05:01] Amir1: db1132 fully repooled with P_S disabled, all yours for the test [07:05:54] marostegui: awesome, going to depool it first :P [07:06:04] :) [07:06:13] Yeah, I pooled it to get it warmed up [07:19:35] marostegui: https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/superbusy-without-p_s.106.svg [07:23:56] Going to compare that with the one yesterday [07:25:02] So there's a bit of an improvement without P_S [07:27:48] We can maybe try to tune a bit the thread pool? [07:27:52] Thoughts? [07:28:00] above my paygrade [07:28:05] XDDD [07:28:34] Amir1: On 10.4 you were never able to generate those errors after 700 connections? [07:28:47] marostegui: I didn't try, I can [07:28:55] Amir1: let's try if you can yeah [07:29:01] my biggest worry is that is it fine to depool two hosts? [07:29:09] Amir1: yeah, it is ok for a bit [07:29:15] we can repool both of them later [07:29:17] ok [07:29:19] I want to see how 10.4 behaves [07:34:01] marostegui: good news, it went kaboom after seven hundred [07:34:10] \o/ [07:34:12] Yay!! [07:34:24] \o\ |o| /o/ [07:36:16] Amir1: ok, can you repool both hosts? [07:36:24] sure [07:36:29] let me make the fancy svg [07:36:51] What I am going to do is disable P_S on the other 10.6 hosts (s4 and s7) and leave it for a few days to see if it gets reproduced with one of the "cache scenarios" if you know what I mean [07:37:06] wink wink [07:37:09] XD [07:37:19] As I am sure it won't take long [07:37:37] And s8 as well [07:39:11] https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/superbusy-with-p_s.104.svg [07:40:20] They pretty similar [07:46:39] Amir1: to confirm, you are repooling both hosts in s1? [07:46:46] yup [07:46:50] great [08:48:21] I am trying to think if we end up having to disable p_s on 10.6.....that's not great but I guess we'd need to come back to pt-kill :( [09:10:06] if there is something you will need from me soon, please speak up before my vacations... otherwise I will focus on monitoring improvement rather than starting one of the other big projects [09:11:37] jynus: thoughts on the regression? [09:12:32] my guess it is either pool of connections tracing or memory tracing, which is what is new on p_s since 10.5 [09:13:18] disabling parts of p_s or tuning pool of connections to avoid that shouldn't be too hard- when I first implement p_s I had to tune it a lot due to our memory size [09:13:20] I guess I could try to disable thread pool, but I prefer to go for p_s for now (as it is less impacting I guess) [09:13:41] disabling p_s is navigating blind, I don't like it [09:13:53] what do you suggest? [09:14:03] disabling only the memory parts [09:14:11] let me see [09:15:07] or the threads [09:16:04] so either performance-schema-instrument='memory/%=OFF' [09:16:53] or thread_instrumentation | NO [09:17:05] I will try the first one then [09:17:20] I think seeing P_S as the cause is a great step, but not the last [09:18:53] note I didn't review any of the P_S initial tuning in the last 2 upgrades, there may be a lot to adapt there [09:21:07] ok, disabled it on db1132 with: performance-schema-instrument='memory/%=OFF' on my.cnf [09:21:45] e.g. obviously I don't know if that will work, but if it does while keeping the rest of P_S it would be great! [09:21:52] yeah definitely [09:22:02] as we would keep process and query performance information [09:22:16] I am repooling it to warm it up and then I will ask Amir1 to run the test and see what happens [09:22:22] Ifn ot we can go also for thread_instrumentation [09:22:51] memory and thread instrumentation is what would make sense to me for thead contention [09:23:01] Yeah [09:23:05] Hopefully it is one of those [09:23:40] also because were are not in a hurry- if we were full 10.6 now, we take the drastic approach [09:24:08] as we have time -I think-, we can tune and test 0:-), and wait for upgrade [09:24:23] which BTW also made the bullseye + 10.4 upgrade a win [09:24:38] that was a brilliant idea of yours, marostegui [09:25:02] haha whatever we can do to work less! [09:25:15] yeah, but imagine- we would be in a pretty bad situation right now [09:25:19] yeah [09:25:28] We'd have combined many variables [09:25:34] We could think it could be even bullseye [09:28:28] and sorry to be the "conservative" person on the gang, but I come from the backups (risk avoidance) perspective- only upgrade when we are very sure things work as expected :-P [09:29:18] :) [09:32:43] on a related note, it was clear and still is clear with things like MDEV-23936 that not all parts of mariadb receive the same attention :-( [09:33:06] Yeah, I am thinking about pinging someone [09:33:44] ps was never recommended because "performance" [09:34:09] I was pinged. I'm afk so unless you know a way to ssh to production on a phone (with yubikey support). I can't do it atm [09:34:18] (suprising, real-time tracing has an impact on performance) [09:34:32] Amir1: not a rush! [09:35:00] jynus: fwiw mw does sampling for perf measuring [09:35:17] so does P_S! [09:36:00] my complain is if mariadb ignores a functionality, of course it gets spoilt! [09:36:23] (but later they want to boast about it on the marketing materials) [10:00:35] It is crazy how big otrs is [10:02:54] T138915 [10:02:55] T138915: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915 [10:03:29] https://i.imgflip.com/6lxa3s.jpg [10:04:58] XDD [10:05:02] yeah I am aware of that ticket [11:49:22] I am going to take some air between meetings, take a break/lunch outside [13:10:25] Amir1: you can go for db1132 (it is pooled and warm) [13:11:11] sure [13:27:49] marostegui: https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/superbusy-with-p_s.mem.106.svg [13:28:14] died with 600 but took longer. Maybe some sort of "hanging connection"? [13:28:26] otherwise the connection should close and be done [13:28:41] we can try disabling also thread_instrumentation as jaime suggested [13:29:10] I will do that after our meeting [14:31:35] root@db1132.eqiad.wmnet[(none)]> UPDATE performance_schema.setup_consumers [14:31:35] -> SET ENABLED = 'NO' WHERE NAME LIKE 'thread_instrumentation'; [14:31:35] Query OK, 1 row affected (0.000 sec) [14:31:35] Rows matched: 1 Changed: 1 Warnings: 0 [14:31:41] Amir1: can you hit db1132 again? ^ [14:31:58] doesn't it need restart? [14:34:59] Doesn't look like [14:35:04] | thread_instrumentation | NO | [14:36:56] awesome [14:37:43] yeah, enabling/disabling it fully requires a restart, but most instrumentation tuning is dynamic [14:42:57] died again :( [14:45:56] 600 too? [14:46:27] yeah [14:46:31] (if you don't mind - doesn't have to be today- tracking all tests done on the ticket (apologies if already done)? That way we can not lose track of them 0:-) [14:46:42] Yeah, I was just writing that [14:46:54] ah, ok sorry, didn't mean to pressure [14:46:54] 600 only? I was hoping it would be over 9000 ! [14:47:19] and with that useless comment, I am gonna show myself the door ;-) [14:47:20] it's not even really 600 concurrent threads, it's lower [14:48:08] akosiaris: "Amir1, what does the concurrency say about mariadb's regression": https://www.youtube.com/watch?v=SiMHTK15Pik [14:48:10] all from cumin1001, maybe we make a mess in network because of that [14:48:53] aaaah, I got the reference now [14:48:57] lol [14:50:20] https://phabricator.wikimedia.org/T311106#8055788 [14:50:21] akosiaris: could I get 5 minutes of your time later this week for some rubber duck session about bacula? [14:50:23] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1132&var-port=9104&from=now-1h&to=now&viewPanel=10 [14:50:49] jynus: can we say next week, say Monday instead? This is a tough week for me. [14:50:54] yeah [14:51:29] it won't be anything weird I have some doubts and need to talk with someone about them, and who's better :-D [14:51:53] also permision to break it, but we won't talk about that