[01:07:30] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:07:56] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 12.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:56] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:04] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [04:14:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (pc2014:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [06:09:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (pc2014:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [07:15:17] andrewbogott: sorry I was out (and I'm ooo today), it can be due to the changes on user https://phabricator.wikimedia.org/T326802 but in that case it would give a different error. Let's create a ticket and I will check [09:55:07] Emperor: you aware of https://phabricator.wikimedia.org/T327925 and https://phabricator.wikimedia.org/T327991? [09:56:18] marostegui: in theory, yes, they're somewhere on my todo list after "understand why we have all these ghost objects in swift and what to do with them", which was definitely not going to talk all week :-/ [09:56:48] but I should look, 'cos ISTR at least one of those is when I'm on VAC so I may need to find someone to look after swift [09:57:26] Emperor: Ah cool, just making sure in case you needed to do something way before that maintenance (like I had to) [09:58:53] Some frontends need depooling, the pain is all of the "will they boot up OK" and "will some more disks fail" variety afterwards [09:59:29] but it is only network maintenance, why would disks fail? [10:00:01] oh, I misinterpreted "hard downtime" [10:00:29] It is hard network downtime I believe :) [10:00:39] in that case it's just {de,}pool on the frontends, WCPGW? [10:00:44] in any case, I guess reads will be depooled [10:00:52] what is wcpgw? XD [10:00:53] as in, mw will be [10:02:04] marostegui: sorry, What Could Possibly Go Wrong [10:02:15] ah XDDD [10:02:19] I thought it was something technical XD [10:02:40] :) [10:15:32] es2020 data check is almost in the last step, but wikidata is left, which is a big one [10:16:27] so far so good? [10:16:37] yeah, no errors, including enwiki [10:16:47] nice [10:21:53] I think I have a potential plan for vacations, some days may happen during network upgrade [10:22:30] as you can imagine, backups failing one week on one dc will not be a huge issue, but I will leave some notes on how to handle that if there is bandwidth [11:26:38] jynus: which was the new topology chain you wanted to add to orchestrator? [11:26:54] db_inventory [11:27:17] that's db1115 and db2093 [11:27:19] You can try yourself with: https://orchestrator.wikimedia.org/web/discover add one of the hosts there, and it should work [11:27:24] ah [11:27:29] I tried once and it didn't work [11:27:31] Yeah [11:27:35] Cause it has orchestrator db I think [11:27:40] ah! [11:27:45] I see then why [11:28:37] We had issues with that in the past [11:28:55] I took a look at the time and couldn't figure out what was the issue [11:28:59] I can try to check it next week again [11:29:08] As I just tried and i didn't work [11:30:38] And I just saw that db2098:3317 isn't being reported due to grants issue [11:30:40] That's weird [11:30:42] I will get that fixed [11:32:39] foxed [11:32:40] fixed [11:39:46] there is definitely some cross- checks needed between puppet - zarcillo - orchestrator - cumin aliases - prometheus [11:48:04] yeah [11:48:10] we need the source of truth [11:48:24] we have tasks for that [11:57:01] I wasn't really meaning a source of truth, but checks comparing different views [14:00:20] Amir1: created T328131, thanks [14:00:21] T328131: Cannot enable 2fa on labtestwiki - https://phabricator.wikimedia.org/T328131 [14:10:39] .. apparently something in that task causes 1password to think that the comment field is actually a 2fa token field and it really wants me to unlock it to see if it can autofill something instead of letting me type a comment [14:12:19] well that's annoying! [14:49:32] sigh, everything is 404 [17:01:54] I've updated T327253 with my current findings, but it's not pretty and I don't think I'm much nearer a clear answer [17:01:55] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [17:15:49] Emperor: I have some answers that are easy [17:16:01] Look in your example at: https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg [17:16:44] Your issue is different from others I detected, but it certainly doesn't help that a file name/path is not stable for an object [17:19:09] jynus: right, that would explain the behaviour, then, if the 16 June deletion failed to entirely succeed and left a ghost object [17:19:29] yeah, doesn't solve all issues [17:19:52] that is why I say I only had a solution for *some* of the questions, the easiest ones [17:20:25] and mw behaviour makes things more difficut to debug, sadly [17:21:57] another thing I suspect [17:22:14] is that there could be in some cases a delay between listing and downloading [17:22:36] so it is justified that one works and the other not (it is something that happens for me when doing backups) [17:22:56] obviously it is not the main issue, but it would be a factor when running deletes [17:31:16] sorry I am not a bringer of good news :-( , but I tried to warn you about digging deeper into this :-P [17:31:25] jynus: I don't think that's the case here - if there were a request between rclone doing the list and then the subsequent COPY|DELETE, we'd expect to see that in the proxy-access.log [17:32:21] oh, not at the cause of the issue, but as something we should have into account for a delete from a list gotten a few days ago [17:32:46] definitely there is something else causing it (swift artifacts) [17:33:33] Well, it's the weekend now, and I think I've earned a drink... [17:33:39] indeed