[00:13:29] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:16:41] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:51:09] Going to switchover es1, es2 and es3 masters (RO sections() [06:52:06] are es4, es5 hosts 10.6? [06:52:53] jynus: there is one 10.6 host in es4 and none on es5 [06:53:03] The es1, es2 and es3 are just needed for kernel reboots [06:53:29] becase I am having lately issues with backups [06:53:52] only es1022 is 10.6 [06:53:58] what sort of issues? [06:54:39] for example, yesterday all 4 backups on both eqiad and codfw failed at the same time, a few minutes after 0 hours (When it starts) [06:55:05] jynus: That's weird, that host has been there for weeks though [06:55:21] then it is not related to 10.6 [06:55:21] So es4 has 1022 and 2022 with 10.6 [06:55:27] the rest are 10.4 [06:55:34] es5 has 0 10.6 [06:55:43] but it is weird all failed, specially both dcs at around the same time [06:55:55] network stuff perhaps? [06:55:57] the hosts also didn't show anything special [06:56:25] network seems weird on both dcs at the same time [06:56:37] but both hosts too [06:56:51] was it at the time of codfw power outage? [06:57:03] no, it was later [07:22:31] I have sent you a reminder invitation [07:22:50] got it [07:22:58] this is in case a truck runs over me, so we remember to reenable backups for es [07:23:39] they have been disabled because they are running now, and if we let them with the original schedule they will backup the same run from last week, so I disable and enable tomorrow [07:26:27] XDDDDDD [07:26:34] I hope I don't need to use that invitation [08:07:03] backups failed again, this time, for now, only on codfw [08:07:22] mmm 10.6? [08:08:01] one at least [08:11:10] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=es2022&var-port=9104&from=1655880489262&to=1655880988695 [08:12:56] nothing on the logs at a first glance [08:12:59] crashes, reboots etc [08:15:36] maybe, for some reason the db overload if recently restarted or something [08:15:48] hit ratios was low, but that is expected for a dump [08:16:00] it was rebooted a week ago for kernel upgrade [08:18:07] or maybe hitting some kind of corruption or hw failure [08:18:36] nothing on dmesg [08:19:31] so no worries on backup side (other than being anoying) but wondering if this is another data point regarding 10.6 [08:20:24] after all- the error is the same (connection lost/unavailable) and this is a very slow stress test [08:20:57] But that host has been there for weeks [08:21:20] Since May 26th from what I can see [08:21:37] I am speaking facts, not saying they make a lot of sense :-/ [08:22:00] yeah :( [08:26:30] on the good side, we are getting free memory from space vacuum: https://phabricator.wikimedia.org/P29958 [08:26:54] -2GB for background processes! [08:28:35] and that is not our custom sys, it is p_s [08:29:05] I suggest one of the first things to try is to disable p_s, I doubt mariadb tests with p_s=on :-( [08:29:43] you mean for https://phabricator.wikimedia.org/T311106 [08:29:44] ? [08:29:51] yeah [08:29:56] Good point [08:30:00] I have one doubt though [08:30:20] Not sure how to proceed with all the tests, ie: we disable p_s, and should I repool the hosts and wait to see their behaviour? [08:30:24] negative memory counts should be a trivial bug, but who knows what else could be broken there [08:30:30] Otherwise we can do many things and end up not knowing which one was it [08:30:40] marostegui: I would first keep things the same to reproduce the issue [08:31:06] then I am trying to see if this bug is reported on jira [08:31:07] Let me add the suggestion to the task for now [08:33:05] We should probably have a small meeting with Amir.1 once he's back next Monday to see how we can try to reproduce this simulating MW [08:33:31] ticket implementation: https://jira.mariadb.org/browse/MDEV-16431 [08:33:45] and: https://jira.mariadb.org/browse/MDEV-6114 [08:33:46] haha simon! [08:33:48] new in 10.5 [08:34:01] but lots of comment regarding memory leaks/increase usage [08:34:09] I think we contained that by tuning the defaults [08:34:36] I have noticed weird memory issues though [08:34:39] But of course who knows [08:34:57] https://jira.mariadb.org/browse/MDEV-23936 [08:35:39] "it has a reproducible (on Windows, or maybe generally with threadpool)" [08:36:02] p_s and the pool where the 2 things I suspected first [08:36:17] I never suspected about p_s [08:36:37] in any case, vote for that issue if you can [08:36:57] Done! [08:36:59] memory instrumentation is useful, but not when it returns negative numbers [08:37:03] I can try to ping valerii as well [08:37:09] And see what he saw [08:37:49] I will add a comment [08:38:00] +1 [08:38:11] For now I have depooled all 10.6 from s1, s4, s7 and s8 [08:38:13] is 10.6.7 the latest version of that branch? [08:38:20] We are running 10..68 [08:38:23] 10.6.8 [08:38:26] ah [08:38:28] well, that [08:38:47] that is the latest [08:43:42] sorry, I should have been more explicit [08:44:05] I meant to test with p_s disabled as one possible factor [08:44:19] not necesarilly disable things now [08:45:02] that and the pool of threads- even if losing both would be a huge loss for us [08:47:37] e.g. last time they changed the pool of threads, 10.0? we had to tune a buffer that was too low [08:47:38] yeah yeah [08:47:45] I am not changing anything now :) [09:13:22] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:26:00] i will share a google doc later today to collect potential projects for Q1 (OKR planning etc) - just as simple bullets for now, no OKR language yet. [09:26:15] to start off that discussion for next week [09:27:15] that reminds me I need to fill out the other doc, question_mark hehe [09:27:17] I will note that [09:27:25] :) [09:27:56] yeah that's helpful for looking a bit further out and knowing what will come after [09:29:05] question_mark: was that a spreadsheet? I cannot find it [09:29:20] yes [09:29:28] let me link you to the doc actually, it has a link to the spreadsheet as well [09:29:38] ah cool [09:29:41] thanks [09:31:07] sent [09:31:16] can you tell Filippo is sitting in front of me? ;) [09:33:39] riccardo is here as well, shall I give him some db work? should prevent him from getting more rusty ;) [09:34:08] lol [09:34:36] question_mark: tell him we are hiring! [09:34:45] * Emperor is looking forward to the day when swift nodes reboot first time reliably [09:35:13] Emperor: how could you possibly be certain with a single day ;) [09:35:21] within [09:35:50] I have some robust "disaster recovery" testing plans ;-p [10:20:58] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:02:41] db1124 restarted recentl, maybe? Icinga: CRIT: read_only: "True", expected "False" (disabled alerts due to being a test host) [13:02:54] ah yes [13:02:57] I did it this morning [13:03:01] fixing it now [13:03:03] thanks for the heads up [13:03:13] done [13:03:14] no big deal [13:03:41] I have to check db* alerts as sometimes (quite often) they are backup sources and I am breaking them [13:19:37] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:36:49] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:06:29] Hey, https://gerrit.wikimedia.org/r/c/operations/puppet/+/800739 got merged but there are issues with the column still being in the view. Is this just analytics needing to run maintain-views ? [15:08:15] RhinosF1: it needs to run maintain-views [15:08:25] do you have a wiki for me to check? [15:08:30] marostegui: altwiki [15:08:47] that's s3? [15:08:53] https://phabricator.wikimedia.org/T311148 [15:09:03] checking [15:09:23] s5 [15:09:24] checking [15:09:41] mmm it works on clouddb1021:3315 [15:09:43] let me check the others [15:09:59] enwiki's view looks updated [15:10:11] so not sure if it got only half done [15:10:13] yeah, 1020 is missing [15:10:22] let me fix that [15:10:29] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:12:50] RhinosF1: fixed and closed the task: https://phabricator.wikimedia.org/T311148#8020199 [15:12:53] thanks for the heads up [15:14:04] marostegui: no problem, thanks for looking so quick! [15:24:11] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:59:20] le sigh, it's always about this time of day that one node needs seemingly-endless reboots :( [16:00:08] I'm not going to get to any more swift hosts for T310483 beyond what I've updated on there just now. Hopefully ms-be2063 will get its drives up in the right order before my taxi shows up... [16:31:10] OK, this is taking the mickey. After a dozen reboots of sdb and sda coming up swapped, I gave in and re-labelled them the way in which they kept appearing. Since doing that, every reboot since they've come back the other way round. [16:36:05] Ah, we may finally be there... [16:37:11] Emperor: I don't think the Swift nodes want to you to finish on time [17:14:16] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:13:22] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:10:03] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:23:47] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:20:31] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:19:41] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers