[01:10:34] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 7.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:12] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:12:26] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:13:04] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[09:34:37] <XioNoX>	 marostegui, jynus: https://phabricator.wikimedia.org/T207253 seems like still worth doing, right? It almost had a working implementation
[09:35:09] <marostegui>	 It is worth doing, but no one is actively working on it
[09:35:23] <XioNoX>	 noted, that's all I needed
[09:36:21] <XioNoX>	 marostegui: this I'm less sure about: https://phabricator.wikimedia.org/T160983
[09:36:27] <XioNoX>	 looks like outside of SRE realm?
[09:36:51] <marostegui>	 Yeah, it would still need some input from us. Let me re-do the tags
[09:36:56] <XioNoX>	 or is it not relevant anymore? :)
[09:45:07] <XioNoX>	 marostegui: last one, but looks like you're not involved, do you know what's up with https://phabricator.wikimedia.org/T281833 ?
[09:45:29] <marostegui>	 I would like that to happen yeah
[09:45:38] <XioNoX>	 looks still valid and close to completion (maybe just alerting is missing?)
[09:46:57] <marostegui>	 XioNoX: I would need to check exactly what is missing
[09:47:09] <marostegui>	 I can't tell if it close to completion
[09:47:26] <XioNoX>	 no worries! You already did enough for my triaging
[09:47:32] <marostegui>	 haha
[10:16:24] <jynus>	 Emperor: you will see recently an increase in errors from swift- they are not mediawiki clients, but me, doing a second pass on previously failed to backup images
[10:19:07] <jynus>	 I need to do this becaues many backups failed because being attempted with the backup account, but they failed wrongly because of https://phabricator.wikimedia.org/T269108#8622358
[10:20:26] <Emperor>	 jynus: thanks, it's nice not to have to worry about swift errors for a few sweet hours ;-)
[10:21:45] <jynus>	 sorry for the noise, but 1) am not sure how to filter them out, as all are generated by the same proxy 2) the only possible way to differenciate them, I cannot use because that's the source of the original errors
[10:21:53] <jynus>	 (using a different account)
[10:23:09] <Emperor>	 jynus: well, I can see perms for deleted containers are a bit different
[10:25:52] <jynus>	 they are 404 errors so not something you would normally be worried about anyway
[10:26:16] <Emperor>	 I thought per ticket you were seeing 403 errors?
[10:26:28] <jynus>	 those are the root cause
[10:26:40] <jynus>	 I am running it with the higher permission account now
[10:27:00] <jynus>	 to generate some 404s instead of 403
[10:27:16] <jynus>	 I mean, not with that goal, to generate 200s when possible obviously
[10:27:25] <Emperor>	 Are these then "ghost" objects - they're in a container listing but don't actually exist?
[10:28:07] <jynus>	 no, it is more complicated to tell, but they are on metadata but not on swift for several reasons
[10:28:23] <Emperor>	 OK, maybe let us not get side-tracked on that now, then?
[10:28:27] <Emperor>	 [sorry!]
[10:29:02] <jynus>	 we have around 100K of those
[10:29:58] <jynus>	 so they should end soon
[10:42:53] <jynus>	 (as a reminder, I am not creating a backup of swift, but of mediawiki files, which may look very similar, but not exactly the same logically)
[10:53:05] <jynus>	 Emperor: as promised, the "spike" of 404 remitted and while there will still be some more, I got the 100K of real "missing files" back and back to normal backup processing
[10:57:37] <Emperor>	 thanks.
[11:00:34] <Emperor>	 sorry for the delay with the ACL stuff, I don't really know much about it; hopefully I've given you some useful pointers...
[11:29:50] <marostegui>	 I am going to shutdown db1121 to get its mgmt fixed (which is also part of the sprintweek for dc-ops eqiad)
[11:30:02] <marostegui>	 It is sanitarium master so there will be lag on the wikireplicas
[12:45:43] <marostegui>	 db1121 is back up
[12:47:32] <jinxer-wm>	 (MysqlReplicationLag) firing: MySQL instance db1121:9104 has too large replication lag (46m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1121&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[12:47:39] <marostegui>	 ^ known
[12:52:32] <jinxer-wm>	 (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (34m 57s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[12:57:32] <jinxer-wm>	 (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (20m 41s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[13:02:32] <jinxer-wm>	 (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (7m 51s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[13:07:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: (2) MySQL instance db1121:9104 has too large replication lag (7m 51s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[15:42:33] <jynus>	 I know there is ongoing orchestrator work- so it is currently down & known, right?
[15:43:04] <marostegui>	 yeah
[15:43:17] <jynus>	 thanks