[01:10:34] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 7.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:12] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:26] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:13:04] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [09:34:37] marostegui, jynus: https://phabricator.wikimedia.org/T207253 seems like still worth doing, right? It almost had a working implementation [09:35:09] It is worth doing, but no one is actively working on it [09:35:23] noted, that's all I needed [09:36:21] marostegui: this I'm less sure about: https://phabricator.wikimedia.org/T160983 [09:36:27] looks like outside of SRE realm? [09:36:51] Yeah, it would still need some input from us. Let me re-do the tags [09:36:56] or is it not relevant anymore? :) [09:45:07] marostegui: last one, but looks like you're not involved, do you know what's up with https://phabricator.wikimedia.org/T281833 ? [09:45:29] I would like that to happen yeah [09:45:38] looks still valid and close to completion (maybe just alerting is missing?) [09:46:57] XioNoX: I would need to check exactly what is missing [09:47:09] I can't tell if it close to completion [09:47:26] no worries! You already did enough for my triaging [09:47:32] haha [10:16:24] Emperor: you will see recently an increase in errors from swift- they are not mediawiki clients, but me, doing a second pass on previously failed to backup images [10:19:07] I need to do this becaues many backups failed because being attempted with the backup account, but they failed wrongly because of https://phabricator.wikimedia.org/T269108#8622358 [10:20:26] jynus: thanks, it's nice not to have to worry about swift errors for a few sweet hours ;-) [10:21:45] sorry for the noise, but 1) am not sure how to filter them out, as all are generated by the same proxy 2) the only possible way to differenciate them, I cannot use because that's the source of the original errors [10:21:53] (using a different account) [10:23:09] jynus: well, I can see perms for deleted containers are a bit different [10:25:52] they are 404 errors so not something you would normally be worried about anyway [10:26:16] I thought per ticket you were seeing 403 errors? [10:26:28] those are the root cause [10:26:40] I am running it with the higher permission account now [10:27:00] to generate some 404s instead of 403 [10:27:16] I mean, not with that goal, to generate 200s when possible obviously [10:27:25] Are these then "ghost" objects - they're in a container listing but don't actually exist? [10:28:07] no, it is more complicated to tell, but they are on metadata but not on swift for several reasons [10:28:23] OK, maybe let us not get side-tracked on that now, then? [10:28:27] [sorry!] [10:29:02] we have around 100K of those [10:29:58] so they should end soon [10:42:53] (as a reminder, I am not creating a backup of swift, but of mediawiki files, which may look very similar, but not exactly the same logically) [10:53:05] Emperor: as promised, the "spike" of 404 remitted and while there will still be some more, I got the 100K of real "missing files" back and back to normal backup processing [10:57:37] thanks. [11:00:34] sorry for the delay with the ACL stuff, I don't really know much about it; hopefully I've given you some useful pointers... [11:29:50] I am going to shutdown db1121 to get its mgmt fixed (which is also part of the sprintweek for dc-ops eqiad) [11:30:02] It is sanitarium master so there will be lag on the wikireplicas [12:45:43] db1121 is back up [12:47:32] (MysqlReplicationLag) firing: MySQL instance db1121:9104 has too large replication lag (46m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1121&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [12:47:39] ^ known [12:52:32] (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (34m 57s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [12:57:32] (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (20m 41s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [13:02:32] (MysqlReplicationLag) firing: (2) MySQL instance db1121:9104 has too large replication lag (7m 51s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [13:07:32] (MysqlReplicationLag) resolved: (2) MySQL instance db1121:9104 has too large replication lag (7m 51s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [15:42:33] I know there is ongoing orchestrator work- so it is currently down & known, right? [15:43:04] yeah [15:43:17] thanks