[01:32:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [05:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [06:07:20] hello! trying to finally upgrade toolsdb to 10.4, some questions: 1) looks like mysql_upgrade is run after starting up mariadb.service, correct? 2) how can I start up mariadb.service without it trying to automatically start replication? [06:08:19] taavi: 1) you'd need to run it after it, yes. 2) systemctl set-environment MYSQLD_OPTS="--skip-slave-start" [06:40:57] weird I get an SSH soft icinga error, but the host looks good [06:41:27] maybe a package update? [06:42:07] jynus: which host? [06:42:26] db1176 [06:43:18] [2022-06-29 06:37:04] SERVICE ALERT: db1176;SSH;CRITICAL;SOFT;1;connect to address 10.64.0.143 and port 22: Connection refused [06:43:30] [2022-06-29 06:39:30] SERVICE ALERT: db1176;SSH;OK;SOFT;2;SSH OK [06:43:32] works for me too indeed [06:44:06] I have something [06:44:11] "Jun 29 06:37:03 db1176 sshd[851]: Received signal 15; terminating" [06:44:24] does it match some puppet run? [06:44:37] 15 is sigint, right? Normal termination? [06:45:03] Signal 15 is a SIGTERM (see "kill -l" for a complete list). It's the way most programs are gracefully terminated, and is relatively normal [06:45:03] behaviour. [06:45:08] weird [06:45:18] that [06:46:21] mmm who pooled db1132? [06:46:38] 21:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298560)', diff saved to https://phabricator.wikimedia.org/P30595 and previous config saved to /var/cache/conftool/dbconfig/20220628-213735-ladsgroup.json [06:46:38] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:46:41] Depooling it again [06:46:43] It is 10.6 [06:46:49] "Started Auto restart job: ssh." [06:46:57] on daemon.log [06:47:13] so I think it is expected, just I haven't noticed before [06:48:28] Amir1: btw, toolsdb seems to be catching up on replication much faster than last time. not sure if what you did is behind that, but if it is, thank you! [06:49:39] OnCalendar=Mon,Tue,Wed,Thu,Fri *-*-* 6:37:00 [06:49:59] that is why I haven't usually noticed it [07:08:07] one mystery I have at this point is why the new replica only has the default root grants and nothing else [07:08:24] how was it cloned? [07:09:01] stopped mariadb on the other replica and rsynced the entire data directory over [07:09:27] including mysql directory? [07:09:34] select * from mysql.user shows nothing? [07:10:48] I assume you mean the mysql directory inside datadir, yes that was included [07:11:33] select * from mysql.user shows the a few grants for localhost but nothing else https://phabricator.wikimedia.org/P30599 [07:11:59] THe mysql_upgrade script? [07:12:03] Did you run it? [07:12:10] yes [07:12:53] The table has the same definition than in the old host? [07:14:34] no, regular table on the old one and a view on mysql.global_priv on the new one [07:14:48] and the content? [07:15:18] worst case, just run a pt-show-grants on the old and copy the content to the new one [07:15:21] but that's very strange [07:15:48] content of what? [07:16:23] yeah, we can re-create the grants if needed, I'm just worried that other tables could have mysteriously disappeared too [07:17:08] taavi: a select * from mysql.user on the old host [07:17:45] "Last snapshot for s1 at codfw (db2141) taken on 2022-06-28 23:09:59 is 1069 GiB, but the previous one was 1134 GiB, a change of -5.8 %" [07:17:55] it shows all the users correctly [07:19:35] taavi: that's strange indeed [07:20:03] taavi: it was a 10.1 -> 10.4 migration? [07:20:08] yeah [07:20:28] straight from stretch to bullseye, if that matters at all [07:20:30] So what I can think of is https://jira.mariadb.org/browse/MDEV-22645 [07:20:44] But those only affect roles [07:20:56] At least that's what I saw when I reported it [07:21:15] And only when doing a dump [09:17:54] taavi: I asked a couple of users to improve their code. Some did. Fingers crossed it was useful :) [09:22:10] did you see my s1 size srinkage? I guess that was you? If yes, kudos [09:23:55] jynus: probably the schema change Amir1 is doing in s1 yep [09:25:44] weird because dumps didn't seem to be affected- not sure if they weren't alltered by then or just it mostly affected raw size (e.g. indexes) [09:26:07] jynus: the only I remember is removing rev_timestamp to binary(14) [09:26:18] it should brush off a couple of gigabytes but not much [09:26:30] maybe then it caused defragmentation [09:27:00] that could explain it, too [09:27:10] probably. The templatelinks drop will start soon though, that'll be fun [09:27:16] <3 [09:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [09:35:26] marostegui: I'm around for T311106 :D [09:35:26] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [09:35:32] let me know when you have time [09:36:02] Amir1: sure, one sec to change it [09:36:39] Ah it needs restarting, so we should wait till the host is a bit warmed up [09:36:56] noted [09:37:18] I will restart it and ping you a bit later [09:37:21] Will probably pool it a bit [09:37:24] To get it warmed up [09:37:41] Host restarted now with ps disabled [09:37:49] going to pool it with some weight [09:39:07] you understood ps wasn't my #1 suspect now, right :-P [09:39:38] (thread pool is, because how wide it appeared on 10.6 vs 10.4 on graphs) [10:32:44] Amir1: and of course now the new replica that had its replication started like an hour later isn't even able to keep up properly :/ [10:49:26] taavi: :/ check binlog and see if any user is making large writes [10:49:58] jynus: btw, do you have a way to check the size change of x1 backups in the past two months? [10:50:32] sure, I would even have a dashboard link, but not on production [10:50:44] want me to give you a table from SQL? [10:50:55] anything would work for me [10:51:04] total size only? [10:51:13] yup [10:54:10] https://phabricator.wikimedia.org/P30611 [10:55:01] marostegui: I was wondering to make the experiment fully controlled (as much as possible) we can repool 10.6 and run perf on it with another 10.4 with the same weight and same section. We can depool and continue the experiment in different ways as well [10:55:13] (I can even try it on s7 which is pooled I think?) [10:55:24] jynus: thanks [10:55:36] Amir1: note the comment I just added [10:55:44] Amir1: Yep, agreed [10:55:53] I will repool db1132 in a bit with its full weight [10:56:00] (I am afk for a bit now) [10:56:07] And then we can run the test, and leave 10.6 back depooled [10:56:30] I just want to see the impact without ps [10:56:37] As that is easy to test and observe [10:57:02] jynus: thanks. Interesting, I thought it got much better but still needs clean up [10:57:11] marostegui: noted. Let me know once you're done [10:57:21] Amir1: which host did you pick yesterday for 10.4? [10:57:24] db1134 was it? [10:57:29] (So I can mimic the hosts) [10:57:31] db1135 I think [10:57:38] based on dumps vs snapshots, I would guess some defragmentation could be needed [10:57:40] ok, let me see its weight [10:57:51] but they are basically the same [10:57:52] 200 [10:58:15] Yeah, let me repool db1132 now with % [10:58:20] jynus: yeah, I will also optimize it soon [10:59:05] Amir1: Just started its repool slowly [10:59:10] Should be done in 1h [10:59:23] thanks! [11:00:58] Amir1: hopefully by next quarter I can get you those on grafana [11:01:22] awesome [11:12:14] Amir1: I am probably doing https://phabricator.wikimedia.org/T311522 on tuesday [11:12:19] If the DIMM replacement goes fine today [11:12:31] So for s3 and s4 let's try not to pick next Tuesday :) [11:12:36] sure [11:12:44] I will let you know for sure tomorrow [11:17:39] marostegui: I won't add anything Tuesday, I don't have a rush for these switchovers [11:18:29] sounds good [12:46:49] Amir1: Feel free to give db1132 some load [12:46:52] Anytime you like [13:03:01] marostegui: sorry I was in a meeting, let me check [13:03:12] no rush [13:11:01] I'm running the perf now without adding extra load (it's already getting traffic) [13:11:09] sweet [13:11:15] let's see how the graphs look like [13:11:19] Once done [13:12:52] https://www.irccloud.com/pastebin/ibydvJXd/ [13:12:58] that was there before too btw [13:13:43] Yeah that's not strange [13:23:11] Amir1: where are we with the long-running DB change on cumin1001, do we expect that it's done by beginning of next week? [13:24:11] moritzm: yes, they all be done by beginning of next week [13:24:36] the revision alter table was done in one or two more dbs as expected [13:24:50] marostegui: https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/normal.106.svg and https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/normal.104.svg [13:27:37] Interesting, they look almost identical now [13:27:52] yeah pthread has a bit of increase [13:28:14] it might add up I guess? [13:29:06] I am very tempted to leave db1132 serving [13:29:14] And see how it does if there're spikes like we had in the past [13:29:17] where it suffered [13:30:11] marostegui: I want to do an experiment with adding a lot of load, like A LOT and see [13:30:23] Fine by me [13:30:33] we should depool it so it doesn't bring down the whole s1 :D [13:30:41] but first fooooood [13:31:06] Amir1: Breakfast I guess? [13:31:36] lol [13:32:01] thinking now, maybe let's pool it with 1 as weight to add some randomness [13:32:19] db1132 is fully pooled now eh [13:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:33:42] ^ I have silenced that alert [15:44:15] (we should probably update topic, to be more precise soon :-D) [15:44:36] although I guess it is not that imprecise [20:20:28] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:12:16] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers