[00:05:14] PROBLEM - MariaDB sustained replica lag on es5 on es1025 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1025&var-port=9104 [00:06:22] RECOVERY - MariaDB sustained replica lag on es5 on es1025 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1025&var-port=9104 [01:10:28] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 310.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:28:20] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [06:21:32] marostegui: sorry to ping, I'm really confused by this :( https://phabricator.wikimedia.org/T328255#8573172 [06:22:56] I'll check [07:01:34] I am going to try to switchover m2 in a bit [09:04:58] jynus: how is the mydumper/new tool progressing? Asking to see if I can come with a rough estimation on when I can migrate m5 to 10.6 safely [09:05:48] yes, I closed it because it is no longer a blocker- just a slowdown [09:06:25] so you feel confident about logical backups for 10.6? [09:07:14] no one can feel confident about mydumper, but it is ok for now [09:07:33] yeah, what I mean is, you feel ok if I migrate m5 to 10.6? [09:07:48] but please update me, because I may need to setup a higher version for 10.6 [09:07:55] yeah, go ahead [09:08:13] so I want to migrate m5 to 10.6.10 [09:08:34] that means db1117 (backup source for all misc hosts) [09:08:38] the thing is is that m5 depends on the shared db1117 [09:08:43] yep [09:08:47] so you will upgrade all misc hosts at the same time [09:08:50] and db2160 (codfw backup source) [09:08:57] yes [09:09:12] but only m5 master [09:09:13] it would be ideal if you could start with other less impactful [09:09:25] such as db_inventory [09:09:38] or other host we backup but doesn't affect all hosts [09:09:39] that is db1115 and db2093? [09:09:46] yeah, as an example [09:09:57] yeah, that has orchestrator [09:09:59] any other works that doesn't involve 5 sections [09:10:00] but that should be fine I think [09:10:07] just out of caution [09:10:15] and when we have our first working for some time [09:10:24] then doing the 5 misc [09:10:26] so we do backup those logically too? [09:10:32] yeah [09:10:40] ok, I am going to to that then [09:10:52] I am open also with backup1 [09:11:06] basically, something less interdependent [09:11:53] even doing s* or x1 with just one section on the backup source would be ok too [09:13:30] yeah, s* is way too early :) [09:14:03] so the issue is this- mydumper "works" meaning that I can backup and recover [09:14:25] but I think a mydumper upgrade is mandatory [09:14:32] don't worry, I get it, I will go for db_inventory for now and then in a few weeks we can re-evaluate m5 [09:14:46] and that leads to 2 isues- later version, more instability an new format [09:14:52] both are non blockers [09:15:03] but it is scary to do many at the same time [09:15:29] https://phabricator.wikimedia.org/T328408 [09:15:56] (there were 2 issues with mydumper, the mariadb one, with was workarounded, but mydumper is a bit in question in the future [09:16:41] we can do that, then reevaluate so we get confortable with the changes (including you!) [09:49:03] thank you, for example, I want to limit issues like last one at T327155 to a small subset at a time [09:49:04] T327155: Setup dbprov1004 an dbprov2004 as an expansion of the dbprov (database provisioning) cluster, in preparation of binlog backups backup implementation - https://phabricator.wikimedia.org/T327155 [09:49:46] What is the issue there on that screenshot? [09:50:22] See the notes- it is not yellow/red but it will be when some time passes [09:50:42] Which notes? [09:50:55] on the right "Last job for this section XXX failed!" [09:50:59] I am looking at https://phabricator.wikimedia.org/T327155#8573519 [09:51:08] Aaah I was focused on the dark line [09:51:19] sorry, that's where my mouse was [09:51:31] yeah, I thought there was something wrong with that line specifically :) [09:51:40] don't worry, it will get red or yellow in a few hours [09:51:48] so it would be obvious [09:52:17] I think I give 1 week + 1 day of threshold [09:52:42] so the backup status doesn't get red everyday at 0 hours [09:53:10] on http://localhost:8000/dbbackups/jobs/?search=failed it is clearer [09:56:03] I think it is a grant issue- the leftovers from the 10.1 -> 10.4 history grants I would gess [09:58:09] BTW, that CSS style (making the current row darker) may look like a silly thing, but I found it super useful when reading lots of cells of data [10:02:53] also, detecting a failure is always good- as it is better than not realizing it happened at all! [10:21:51] Amir1: (/me catching up on backlog) when running puppet via cumin on large aliases (like A:db-all) please use a batch size (like -b 20) to avoid overloading the puppetmasters ;) [12:53:40] Amir1: you can proceed with s4 codfw if you want [12:53:47] I finished the switchover yesterday [13:33:52] thanks [13:48:41] marostegui: FYI T320534, there will be more writes to PC [13:48:42] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [13:48:51] my plan for today is to fix the mobile issue [13:50:44] Amir1: roger, thanks