[01:08:56] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:09:08] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 26.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:28] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:12] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:33:50] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 89 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [05:34:12] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 103.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:35:22] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [05:35:42] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [10:48:00] I am going to pool db1176 (mariadb 11) with 1% weight for a few minutes in s1 [10:48:35] đŸ˜± [11:07:40] depooled [11:23:40] I've opened T327253 about the 27k swift objects that appear in container listings but don't actually exist [11:23:40] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [11:55:06] ok, I thought it was going to be mostly derived data, but most seem to be commons originals? [11:57:12] the first thing I would do is to search them on backups & mediawiki metadata and see if there is a pattern (e.g. files that never were intended to be there) [11:58:59] ...how? [11:59:14] I think I can help with that, with https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/mediabackups/+/refs/heads/master/mediabackups/ :-) [12:01:06] I won't have a script specifically for this, but with the backup automation creating new script is very easy: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/mediabackups/+/refs/heads/master/mediabackups/cli/update_mysql_metadata.py [12:01:12] Thanks; is that installed on some suitable host and/or its use documented? [12:01:44] yep, you can work on my worker hosts ms-backup1* on eqiad and ms-backup2* on codfw [12:02:07] they are the host that generate backups, but don't have local data themselves [12:03:06] I have abstractions and access setup to read and process both swift and mw metadata [12:03:52] and this is why I wanted your involvement on backups, as this will be useful for certain swift automations in general [12:04:17] Cool, I'll have a bit of a play this afternoon [12:05:01] expect stupid emails when I get stuck :) [12:05:17] not stupid- this is WIP and I was in the middle of a refactoring [12:05:32] so code will not be in the best of states [12:06:16] I am working on https://phabricator.wikimedia.org/T327157 preciselly, which is a different kind of batch work for backups [12:07:02] but the base listing metadata files should work [12:07:56] sadly, because backups started 1 or 2 years ago, we won't be able to recover anything beyond that [12:12:59] I also can offer you to import that file into a temporary table on the backups db- for me doing an SQL query will be easier than a python script! [12:20:18] Also opened T327269 for the rclone/repl issue consequent to this, which I think needs addressing first [12:20:19] T327269: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 [13:59:56] there is a few Unpollable hosts on codfw- not sure if prometheus needs some kick after yesterday? https://grafana.wikimedia.org/goto/C97ZKnTVk?orgId=1 [14:00:08] *Unpollable db hosts [16:14:03] Emperor: I think I may have lost the plot w/ T327269. The problem is when say eqiad has objects in the container list, but the corresponding HEAD/GET (also against eqiad) fails, yes? So does `swiftrepl` issue a corresponding delete in codfw for that scenario, or simply skip it and carry on with what it can? [16:14:04] T327269: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 [16:15:03] Emperor: does `rclone` otherwise a propagate a delete when an item is missing from the source container list, and present in the destination? [16:15:14] (in this scenario) [16:35:26] urandom: sorry, paying attention to the vector deployment [16:35:56] urandom: yes, if an object isn't in the eqiad container and is in the codfw one, then (when eqiad is active), the object is deleted from codfw [16:37:04] urandom: but if the object is in the eqiad list but not codfw, then rclone will try and copy it [and fail if it's then not present]; it won't do anything to codfw in that scenario (because it's not present in codfw in any case) [16:47:41] Emperor: what is swiftrepl doing in this scenario? How do options 1 & 3 differ? [16:48:01] (other than it not spamming logs) [16:55:32] I think swiftrepl just tries to copy everything it thinks needs copying; and then deletes everything it thinks needs deleting [16:56:48] Emperor: I would assume that an entry in the source container listing would trigger "needs copying" though [16:57:04] Yes, I think swiftrepl tries, fails, doesn't care [16:57:11] so option 1? [16:57:43] pretty much, AIUI, yes [16:58:25] gotcha [16:58:46] (I suspect the behaviours in the event of other failures may be different, but I don't know) [16:59:06] Emperor: with respect to T327253, these objects show in eqiad container listings but not codfw listings? [16:59:07] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [17:00:45] I guess I'm still also unclear, where does the risk of erroneously resurrecting an object that was meant to be deleted come from? [17:01:13] urandom: yes, in eqiad but not codfw [17:02:02] urandom: if we have a deleted object in eqiad but not codfw; And we don't do something about rclone not propagating deletions; Then when we switch to make codfw primary the un-deleted object will be propagated from codfw to eqiad, undoing the deletion [17:03:06] Emperor: and it does not do the deletions *because* of the previous errors? [17:03:29] urandom: yes, rclone doesn't do _any_ deletions if there have been errors with the copying [unless we configure it otherwise] [17:03:43] auh, ok, I see now [17:04:07] Emperor: that's indeed Badâ„¢ [17:04:43] urandom: yes; I don't think rclone's behaviour is necessarily wrong in general, but it's not what we want it to do here. [17:04:58] (there is, I think, a way to change its behaviour) [17:05:35] Emperor: but yes, that does sound like your option 1 is functionally equivalent to what we have been doing, and generally the lesser of evils [17:05:54] that's roughly my feeling, yes :) [17:06:27] T327253 continues to bug me [17:06:27] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [17:06:36] Mmm :( [17:06:45] [I'm about to vanish for today] [17:06:58] no worries; thanks for the explanations! [17:07:02] YW [22:39:33] if a file wrongly disappeared from eqiad but it is on codfw, it would be badâ„¢ to delete the copy that we do have on codfw