[03:27:10] PROBLEM - MariaDB sustained replica lag on s4 on db1244 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [03:35:10] RECOVERY - MariaDB sustained replica lag on s4 on db1244 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [09:11:54] The new test-s4 master is db1176, db1125 is being decommissioned [09:13:30] one question, I see db2185 is set to read_only false, is that expected? [09:13:52] it is a codfw db_inventory [09:14:03] yeah, but it has the master role, so I guess that's a puppet issue there [09:14:36] but puppet expects read only True, as writes only happen on eqiad [09:14:51] (do they?) [09:14:57] But it has the role master on its yaml, that's why it probably has RO=off [09:15:07] I don't know though why it has role master [09:15:12] Maybe we can just simply remove it [09:15:18] Cause no writes happen there [09:15:20] I think you are understanding it the opposite way [09:15:29] puppet expect it to be in read only [09:15:32] it is not [09:15:42] How can it expect it to be RO if it has master on its role [09:16:01] because it is a db_inventory host, only writes on one eqiad [09:16:10] thats configurable [09:16:16] Then why does it have role master? [09:16:58] sorry, I am not sure about that. My question is, is that host being written to? [09:17:17] if not, can it be set as read only so we get rid of the alert? [09:17:54] ah, I see what you say [09:18:15] IMHO the config is wrong, but the alert is right [09:18:39] I have a patch for that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126042 [09:19:35] with: writeable_dc: 'eqiad' / replication_type: 'unidir' I belive both the config and the alert should be ok [09:19:50] leaving only eqiad as the master, at least at the moment [09:20:19] If you don't want to deal with this, I can take over, just if you can confirm codfw is not being written to (that you know) [09:20:37] and I will get rid of the alert and bad config, now that I understood your complain [09:30:06] codfw is not used I think [09:30:13] let me review that patch [09:31:18] db2093 was the host db2185 replaced and it also had master role: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895664/1/hieradata/hosts/db2093.yaml [09:32:08] yeah, I think the problem is the config [09:32:12] puppet was bad [09:32:33] I fix that and simplify the code- need to test it before deploy [09:32:44] but I want to make sure that's the intended status [09:32:49] So your patch looks good, but youi can also include removing the master role from it [09:33:01] ok, I think that's doable [09:33:47] note that "master" is also on the misc codfw hosts, and those are read only [09:34:02] so I was just making those and misc the same [09:34:56] maybe "standalone" would be more appropiate for db_inventory? [09:35:10] like es1? [09:35:45] in any case, let me now that I understood the problem send an updated proposal [09:35:53] probably after the switchover [09:36:11] It is just that I was confused before talking to you [09:36:25] now I think I know what's the issue [09:38:21] on a separate alert: "FIRING: JobUnavailable: Reduced availability for job mysql-test" (I wonder if it just needs time or needs a db inventory update?) [09:42:39] They shouldn't be standalone cause they do have replication [09:42:45] db1215 -> db2185 [09:43:11] mysql-test alert, that maybe coming from me decommissioning db1125 [09:43:21] but db1176 is the new master, which I just updated in zarcillo [09:54:09] I checked and there is something weird: https://phabricator.wikimedia.org/P74202 [09:58:34] Why is it repeated so many times? [09:59:40] the query returns it 4 times, checking why [09:59:56] I am checking in zarcillo if there's somethign strange there [10:06:48] I think we need to remove "test-s1 | eqiad | db1133" [10:06:53] from the masters table [10:07:07] Ok, let me get rid of it [10:07:15] however, it should happen [10:07:29] *shouldn't happen anyway [10:07:46] anyway, the job got resolved, it wasn't that [10:07:52] done [10:07:57] what was the issue? [10:08:16] there was not issue, just it takes 30 minutes for the graphs to update [10:09:59] I am going to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125114 [10:11:55] go for it [11:03:26] Emperor: o/ - ok if I work on ms-be1091? [11:03:31] I need to reboot it most likely [11:03:43] (IIRC it is one of the standby nodes) [11:04:07] elukey: please do, I am currently abusing ms-be2088 a bit. [11:04:16] I saw it yes, good luck :)