[08:19:45] Amir1 arnaudb any schema change that ran recently on s4? [08:19:50] (big one) [08:20:46] not on masters yet [08:23:11] there is https://phabricator.wikimedia.org/T348183 this is the one [09:00:46] Ah cool [09:00:54] I was checking why clouddb1019 was lagging and I was curious [09:35:55] long story short, this schema update has been run everywhere except on s4 masters because #bigtable :-) it'll wait after the production freeze [09:36:11] yeah, makes sense [10:02:26] how should we discuss? [10:03:35] I guess it is: do we want to backup them yearly? do we want to test that yearly backup? [10:04:12] yes, I mean if here is ok, or we should go private? [10:04:23] let's do it here [10:04:29] From my side, both answers are yes [10:04:49] ok, but testing requires space and destruction [10:05:08] how that should go? [10:05:18] maybe I'm missing a point here but could we keep a set of hosts scheduled to be decommissionned for that purpose only? [10:05:36] No, let's not do that. Hosts are supposed to be decommissioned on an specific date [10:05:46] the issue is es doesn't fit on regular db hosts [10:06:38] jynus: If they are logical backups, we can just recover a few wikis [10:07:01] ok, so that narrows it [10:07:21] let's say, every year, we take the es ro backups [10:07:44] then we recover it partially (details to be decided) to a test host [10:08:07] and after testing, we destroy the long-term backups [10:08:13] if we can recover a couple of 2TB, we can probably assume it is all fine [10:08:16] and store them again [10:09:18] that could be it yes [10:09:40] the only things are that timigs may not be real, as I belive es still uses HDs? [10:10:03] yeah, but I don't think that is super important [10:10:47] and what would it be a good date in the year to do so? [10:11:02] I may need to depool 1 es ro host just to be safe [10:11:10] That is for you to decide, I don't have any specific preference [10:11:28] but what I mean is, it would be possible and you would collaborate with that? [10:11:37] what? [10:11:46] to depool one es server? [10:11:56] sure, I don't mind [10:12:03] Just a general heads up is enough [10:12:09] So it doesn't get restarted, rebooted etc [10:12:26] so, le'ts say, we schedule it as early in the year as possible [10:13:05] and I create a ticket and we coordinate so there is no other maintanance ongoing, ok? [10:13:20] sure, a ticket and a mention on the team meeting is more than enough I think [10:13:57] my wonder is - if when hardware is scheduled- if we could do a full test? [10:14:18] Hardware? [10:14:18] e.g. recover it once on the new hardware to test it, once every 5 years [10:14:27] on hw replacement of es [10:14:27] yeah, we can [10:14:35] I think we are buying HW this FY [10:14:36] so whenever that happens [10:14:38] So we can do it [10:14:54] you let me "test it", and then you can either remove it or keep it [10:15:00] yeah, no problem [10:15:02] just on one host, ofc [10:15:39] it might be tight but es1 is 7,9 and our new hosts (normal db host) are 8.6 [10:15:44] so you could even use one of those [10:15:52] tight? [10:16:01] ah, I see [10:16:10] I was talking for es hw replacement only [10:16:26] so there is a full recovery at least every 5 years, not just partial tests [10:16:35] Yes, but what I mean is that we could even use those hosts (as we buy them more often) for more tests [10:17:03] yeah, that was the original intention for db-testing hosts [10:17:12] but in general, for all dbs, not just es [10:17:56] ok, so I think we have a plan [10:18:00] I know, I am just proposing that extra thing in case you want to test es1 more often than 5 years [10:18:08] But anyway, to be defined [10:18:28] anything you want to ask, or are concerned about? [10:18:56] No, just a heads up whenever you plan to start the backup is enough so we can all coordinate [10:19:24] I will write somewhere like the plan (maybe on the wiki) and will send it for your review, so we are both in agreement [10:19:30] sounds good [10:19:31] thank you [10:19:49] and we can add the details there (e.g. how much time of heads up you normally need, etc) [10:20:02] yep [10:20:32] thanks for this conversation [10:20:39] thank you! [10:22:25] do backups do any sort of periodic integrity checking (e.g. "every X days check the checksum of file X matches what it ought to)? [10:23:06] Emperor: the building blocks of that is being built- automatic testing (through replication) and checks (which amir is working atm) [10:23:49] then there is metadata gathering of each file backed up [10:25:32] right [10:26:57] Emperor: in reality, the best check is to use it regularly to provision production, but it is not that easy [10:27:42] also, as files for databases change continously, it is not just as easy to create checksums [10:27:52] Mmm (I was thinking of ceph, which "scrubs" each of its placement groups from time to time to check they on-disk contents are still correct) [10:28:24] yeah, that is something that is planned for mediabackups, which is more static [10:28:34] 👍 [10:28:51] for dbs, it is better to recover them and replicate and see that it matches production/replicates well [10:29:02] as the backups files are not the final product [10:29:12] for mediabackups it is "easier" [10:29:19] except for the scale [10:30:33] :) [10:30:51] as you seem so interested [10:31:18] let me ask you for your help to close the deployment at: https://phabricator.wikimedia.org/T269108 [10:33:21] I think if we can get it right, an ACL would be the way to go there [10:35:03] (I think also that tweaking the account ACL for mw:media is the sort of thing not to be done during a production change freeze) [10:35:10] given we are notified every time a new wiki is created [10:35:19] plus I scan the mw config, I am not worried about the updates [10:35:35] I am more lost about the swift specific commands for a single wiki [10:35:47] oh, I don't want to do now [10:36:22] but it is the last thing blocking a fully automated setup on puppet (I don't want to do it with the wrong account) [10:36:40] so I would like to have a plan for next year [10:39:51] OK, maybe put it in as a KR for Q3? [10:47:04] Done [11:53:17] PROBLEM - MariaDB sustained replica lag on s1 on db2146 is CRITICAL: 95.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2146&var-port=9104 [11:58:29] RECOVERY - MariaDB sustained replica lag on s1 on db2146 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2146&var-port=9104 [19:24:56] PROBLEM - MariaDB sustained replica lag on s6 on db2158 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [19:26:02] PROBLEM - MariaDB sustained replica lag on s6 on db1213 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13316 [19:26:16] PROBLEM - MariaDB sustained replica lag on s6 on db2171 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13316 [19:27:54] PROBLEM - MariaDB sustained replica lag on s6 on db2158 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [19:29:24] RECOVERY - MariaDB sustained replica lag on s6 on db2158 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [19:30:32] RECOVERY - MariaDB sustained replica lag on s6 on db1213 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=13316 [19:30:46] RECOVERY - MariaDB sustained replica lag on s6 on db2171 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13316