[01:09:37] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 24.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:11:23] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 17 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:27] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:13:11] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[07:20:27] <Emperor>	 these systemd alerts are still too noisy IMO
[07:26:53] <Emperor>	 I don't mind it logging those things somewhere or having some place I can check them, but I don't think they need to be on this channel. 
[07:29:45] <RhinosF1>	 Maybe move them to the feed channel
[07:30:14] <RhinosF1>	 There's already #wikimedia-data-persistence-feed for wikibugs “spam”
[08:24:37] <jynus>	 the main issue is that tha lag check is not configurable
[08:25:26] <jynus>	 ideally there would be a warning_value => / critical_value => 100, but that is not implemented
[08:43:07] <Emperor>	 jynus: I think that might apply to the replica lag alerts, but not the 8 messages about Systemd units of various flavours overnight
[08:44:24] <jynus>	 that's another thing we discussed with obs team
[08:44:37] <jynus>	 those should mostly all be warnings
[08:51:04] <jynus>	 not against moving stuff around, but ideally we should fix the "root causes" if we can, rather than hide them, when possible
[08:52:38] <Emperor>	 I'm guessing these new alerts are as a result of sprint week; is it unreasonable to be a bit grumpy at the prospect of having to work out how to turn them into warnings when I didn't cause them to appear in the first place?
[08:53:19] <jynus>	 I don't think you should do it
[08:53:28] <jynus>	 but the people that added them
[09:56:54] <Amir1>	 The replica lag is basically all on m1. e.g. replag in a mw would be quite important 
[09:57:08] <Amir1>	 but we need to maybe fix something about m1
[09:59:21] <jynus>	 I don't think that is possible- backups are created of 20 million files at a time from ci
[09:59:33] <jynus>	 what we need is that configurable lag for misc servers
[10:01:49] <Amir1>	 yeah, it can be a separate alert for now maybe?
[10:02:35] <jynus>	 don't ask me- I don't handle databases
[10:05:23] <jynus>	 and last time puppet was touched I warned things like this would happen
[10:13:38] <Emperor>	 godog: sorry, it's not quite clear who to ask about alerting changes made during sprint week, and I think you were tech lead on the alert project. Could we not have the new Systemd alerts (e.g. the 8 here overnight) on this IRC channel, please? I think preference is not to get notified about them at all, but sending them to the wikimedia-data-persistence-feed channel would also be OK...
[10:17:47] <godog>	 Emperor: I'm going to lunch and will get back to you this afternoon
[10:18:25] <Emperor>	 thanks :)
[10:48:34] <jinxer-wm>	 (SystemdUnitFailed) firing: minio.service Failed on backup1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:51:18] <jynus>	 that's me
[11:14:37] <jinxer-wm>	 (SystemdUnitFailed) resolved: minio.service Failed on backup1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:27] <jynus>	 sadly, I have bad news
[11:44:30] <jynus>	 all the backups of swift are no longer useful
[11:59:06] <Emperor>	 Um, could you elaborate on that?
[12:02:01] <jynus>	 https://phabricator.wikimedia.org/T306602#8741394
[12:04:42] <jynus>	 imagine that someone tells you that Swift2 requires you to reimport all files :-(
[12:05:18] <jynus>	 at least on production you have some room to operate, in backups there is no real redundancy and extra resources to do so
[12:06:47] <jynus>	 plus preciselly, the simplistic storage method was the one thing that made it ideal for backups :-(((
[12:10:13] <Emperor>	 jynus: so we currently have backups, but can't make new ones, and the only way to migrate to a new backup engine is going to be a new full backup?
[12:10:36] <jynus>	 it is not *that* bad
[12:11:02] <jynus>	 we have backups, we can update them and use them, as long as we don't upgrade
[12:11:19] <jynus>	 but an upgrade will be as costly as migrating to a new engine
[12:12:13] <jynus>	 they basically just say- use rclone to upgrade!
[12:12:32] <Emperor>	 fun times
[12:12:39] <jynus>	 indeed
[12:12:50] <jynus>	 there it goes Q4 for me, just to prepare that
[12:13:07] <jynus>	 and waste IO and loooots of time
[12:14:16] <jynus>	 the thing is we chose the old format because it was just in plain text on the filesystem- which is not ideal for high performance, but nice for recoverability
[12:14:58] <jynus>	 with that gone, plus the extra effort needed to upgrade, there isn't any reason to keep using minio
[12:17:11] <Emperor>	 Mmm
[12:20:16] <jynus>	 I will check some alternatives- some may even avoid a migration if they implement something similar to the old minio format
[12:27:16] <jynus>	 most people on internet forums suggest using either minio or Swift for what I want :-(
[12:32:33] <jynus>	 anyway, I will finish the current upgrade and that will be a problem for future Jaime
[12:34:42] <jynus>	 I will also wait for hw to arrive, as in all scenarios, having additional hardware will help (these were not for this, but anyway)
[12:38:08] <jynus>	 on the bright side, nothing depends on minio- everything was built knowing it could be replaced by something else at any time that supports S3 api
[12:41:03] <godog>	 Emperor: I checked the alerts, SystemdUnitFailed did start as a critical though we changed it to a warning because of spam, currently I see warnings are routed on irc too here
[12:41:39] <godog>	 there's a bunch of things we can do: change the notification interval for warnings so they notify again say once a week (unless other warnings for the same alert come in)
[12:41:51] <godog>	 or route warnings to -feed for example
[12:42:34] <godog>	 regardelss of notifications (irc, email, etc) alerts all keep showing up on alerts.w.o too
[12:43:04] <Emperor>	 I think route warnings to -feed, in that case. 
[12:44:06] <godog>	 sounds good to me, I'll send the patch your (team) way
[12:46:19] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/904525
[12:48:34] <Emperor>	 Thanks - LGTM, but I think input from someone else on the team before merging would probably be fair :)
[12:53:07] <godog>	 agreed
[12:53:48] <godog>	 after that's done I think I'm going to change the general policy for re-notifications of warnings to be multiple-days as well
[14:01:27] <jynus>	 I've given it a look, and my suggestionwould be to remove that replication alert completely- it fails frequently because it isn't retried, and there is already a replication check (on icinga) that doesn't even fire