[01:09:37] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 24.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:23] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 17 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:27] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:13:11] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:20:27] these systemd alerts are still too noisy IMO [07:26:53] I don't mind it logging those things somewhere or having some place I can check them, but I don't think they need to be on this channel. [07:29:45] Maybe move them to the feed channel [07:30:14] There's already #wikimedia-data-persistence-feed for wikibugs “spam” [08:24:37] the main issue is that tha lag check is not configurable [08:25:26] ideally there would be a warning_value => / critical_value => 100, but that is not implemented [08:43:07] jynus: I think that might apply to the replica lag alerts, but not the 8 messages about Systemd units of various flavours overnight [08:44:24] that's another thing we discussed with obs team [08:44:37] those should mostly all be warnings [08:51:04] not against moving stuff around, but ideally we should fix the "root causes" if we can, rather than hide them, when possible [08:52:38] I'm guessing these new alerts are as a result of sprint week; is it unreasonable to be a bit grumpy at the prospect of having to work out how to turn them into warnings when I didn't cause them to appear in the first place? [08:53:19] I don't think you should do it [08:53:28] but the people that added them [09:56:54] The replica lag is basically all on m1. e.g. replag in a mw would be quite important [09:57:08] but we need to maybe fix something about m1 [09:59:21] I don't think that is possible- backups are created of 20 million files at a time from ci [09:59:33] what we need is that configurable lag for misc servers [10:01:49] yeah, it can be a separate alert for now maybe? [10:02:35] don't ask me- I don't handle databases [10:05:23] and last time puppet was touched I warned things like this would happen [10:13:38] godog: sorry, it's not quite clear who to ask about alerting changes made during sprint week, and I think you were tech lead on the alert project. Could we not have the new Systemd alerts (e.g. the 8 here overnight) on this IRC channel, please? I think preference is not to get notified about them at all, but sending them to the wikimedia-data-persistence-feed channel would also be OK... [10:17:47] Emperor: I'm going to lunch and will get back to you this afternoon [10:18:25] thanks :) [10:48:34] (SystemdUnitFailed) firing: minio.service Failed on backup1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:18] that's me [11:14:37] (SystemdUnitFailed) resolved: minio.service Failed on backup1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:27] sadly, I have bad news [11:44:30] all the backups of swift are no longer useful [11:59:06] Um, could you elaborate on that? [12:02:01] https://phabricator.wikimedia.org/T306602#8741394 [12:04:42] imagine that someone tells you that Swift2 requires you to reimport all files :-( [12:05:18] at least on production you have some room to operate, in backups there is no real redundancy and extra resources to do so [12:06:47] plus preciselly, the simplistic storage method was the one thing that made it ideal for backups :-((( [12:10:13] jynus: so we currently have backups, but can't make new ones, and the only way to migrate to a new backup engine is going to be a new full backup? [12:10:36] it is not *that* bad [12:11:02] we have backups, we can update them and use them, as long as we don't upgrade [12:11:19] but an upgrade will be as costly as migrating to a new engine [12:12:13] they basically just say- use rclone to upgrade! [12:12:32] fun times [12:12:39] indeed [12:12:50] there it goes Q4 for me, just to prepare that [12:13:07] and waste IO and loooots of time [12:14:16] the thing is we chose the old format because it was just in plain text on the filesystem- which is not ideal for high performance, but nice for recoverability [12:14:58] with that gone, plus the extra effort needed to upgrade, there isn't any reason to keep using minio [12:17:11] Mmm [12:20:16] I will check some alternatives- some may even avoid a migration if they implement something similar to the old minio format [12:27:16] most people on internet forums suggest using either minio or Swift for what I want :-( [12:32:33] anyway, I will finish the current upgrade and that will be a problem for future Jaime [12:34:42] I will also wait for hw to arrive, as in all scenarios, having additional hardware will help (these were not for this, but anyway) [12:38:08] on the bright side, nothing depends on minio- everything was built knowing it could be replaced by something else at any time that supports S3 api [12:41:03] Emperor: I checked the alerts, SystemdUnitFailed did start as a critical though we changed it to a warning because of spam, currently I see warnings are routed on irc too here [12:41:39] there's a bunch of things we can do: change the notification interval for warnings so they notify again say once a week (unless other warnings for the same alert come in) [12:41:51] or route warnings to -feed for example [12:42:34] regardelss of notifications (irc, email, etc) alerts all keep showing up on alerts.w.o too [12:43:04] I think route warnings to -feed, in that case. [12:44:06] sounds good to me, I'll send the patch your (team) way [12:46:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/904525 [12:48:34] Thanks - LGTM, but I think input from someone else on the team before merging would probably be fair :) [12:53:07] agreed [12:53:48] after that's done I think I'm going to change the general policy for re-notifications of warnings to be multiple-days as well [14:01:27] I've given it a look, and my suggestionwould be to remove that replication alert completely- it fails frequently because it isn't retried, and there is already a replication check (on icinga) that doesn't even fire