[03:27:10] PROBLEM - MariaDB sustained replica lag on s4 on db1244 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [03:35:10] RECOVERY - MariaDB sustained replica lag on s4 on db1244 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=9104 [09:11:54] The new test-s4 master is db1176, db1125 is being decommissioned [09:13:30] one question, I see db2185 is set to read_only false, is that expected? [09:13:52] it is a codfw db_inventory [09:14:03] yeah, but it has the master role, so I guess that's a puppet issue there [09:14:36] but puppet expects read only True, as writes only happen on eqiad [09:14:51] (do they?) [09:14:57] But it has the role master on its yaml, that's why it probably has RO=off [09:15:07] I don't know though why it has role master [09:15:12] Maybe we can just simply remove it [09:15:18] Cause no writes happen there [09:15:20] I think you are understanding it the opposite way [09:15:29] puppet expect it to be in read only [09:15:32] it is not [09:15:42] How can it expect it to be RO if it has master on its role [09:16:01] because it is a db_inventory host, only writes on one eqiad [09:16:10] thats configurable [09:16:16] Then why does it have role master? [09:16:58] sorry, I am not sure about that. My question is, is that host being written to? [09:17:17] if not, can it be set as read only so we get rid of the alert? [09:17:54] ah, I see what you say [09:18:15] IMHO the config is wrong, but the alert is right [09:18:39] I have a patch for that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126042 [09:19:35] with: writeable_dc: 'eqiad' / replication_type: 'unidir' I belive both the config and the alert should be ok [09:19:50] leaving only eqiad as the master, at least at the moment [09:20:19] If you don't want to deal with this, I can take over, just if you can confirm codfw is not being written to (that you know) [09:20:37] and I will get rid of the alert and bad config, now that I understood your complain [09:30:06] codfw is not used I think [09:30:13] let me review that patch [09:31:18] db2093 was the host db2185 replaced and it also had master role: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895664/1/hieradata/hosts/db2093.yaml [09:32:08] yeah, I think the problem is the config [09:32:12] puppet was bad [09:32:33] I fix that and simplify the code- need to test it before deploy [09:32:44] but I want to make sure that's the intended status [09:32:49] So your patch looks good, but youi can also include removing the master role from it [09:33:01] ok, I think that's doable [09:33:47] note that "master" is also on the misc codfw hosts, and those are read only [09:34:02] so I was just making those and misc the same [09:34:56] maybe "standalone" would be more appropiate for db_inventory? [09:35:10] like es1? [09:35:45] in any case, let me now that I understood the problem send an updated proposal [09:35:53] probably after the switchover [09:36:11] It is just that I was confused before talking to you [09:36:25] now I think I know what's the issue [09:38:21] on a separate alert: "FIRING: JobUnavailable: Reduced availability for job mysql-test" (I wonder if it just needs time or needs a db inventory update?) [09:42:39] They shouldn't be standalone cause they do have replication [09:42:45] db1215 -> db2185 [09:43:11] mysql-test alert, that maybe coming from me decommissioning db1125 [09:43:21] but db1176 is the new master, which I just updated in zarcillo [09:54:09] I checked and there is something weird: https://phabricator.wikimedia.org/P74202 [09:58:34] Why is it repeated so many times? [09:59:40] the query returns it 4 times, checking why [09:59:56] I am checking in zarcillo if there's somethign strange there [10:06:48] I think we need to remove "test-s1 | eqiad | db1133" [10:06:53] from the masters table [10:07:07] Ok, let me get rid of it [10:07:15] however, it should happen [10:07:29] *shouldn't happen anyway [10:07:46] anyway, the job got resolved, it wasn't that [10:07:52] done [10:07:57] what was the issue? [10:08:16] there was not issue, just it takes 30 minutes for the graphs to update [10:09:59] I am going to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125114 [10:11:55] go for it [11:03:26] Emperor: o/ - ok if I work on ms-be1091? [11:03:31] I need to reboot it most likely [11:03:43] (IIRC it is one of the standby nodes) [11:04:07] elukey: please do, I am currently abusing ms-be2088 a bit. [11:04:16] I saw it yes, good luck :) [14:51:05] with permission of the DBAs, I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125114 because the deployment was done hours ago, but I didn't merge puppet because of the outage [14:51:25] (in any case it won't touch databases, just backup hosts) [14:51:27] no issues from my side [14:51:39] but want to make sure reality matches puppet [14:51:42] So the new Config-J systems you can hot-swap a drive, but it involves a controller reset which pauses I/O to all the spinning disks for ~18s [14:52:06] ^ marostegui this is relevant to dbas, that would cause a mariadb crash [14:52:14] We don't use config J [14:52:17] which is what happens when io timeouts [14:52:35] well, in case the same controler is to be used for dbs [14:53:10] Emperor: do you know which controller is that or point me to a procurement task so I can check? [14:53:15] jynus: I don't think that's proposed, since database systems require proper RAID, battery backup etc [14:53:31] marostegui: then no impact on you [14:53:40] sorry [14:53:55] marostegui: two ticks [14:54:32] marostegui: https://phabricator.wikimedia.org/T368928 [14:55:45] * Emperor trying to figure out if this is Good Enough for swift nodes [14:58:16] The beauty of supermicro invoices [14:59:28] yeah, you need a degree to read thenm [14:59:50] most of the spec are embedded in a single line obfuscated config "name" [15:00:23] yeah I am basically copy and pasting each line in google and check what it is [15:00:58] but yes, config j and e are different, e has a controller [15:03:28] I think the controller is a S3908 [15:05:47] (there's a lot of detail in T384003 ) [15:05:48] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [15:06:15] Emperor: yeah, but config J doesn't have that one [15:07:45] marostegui: err, you mean that config E (which you use for databases) doesn't have the S3908? I agree with that. [15:08:07] ms-be2088 does I think have a S3908, and it is a Config J-10G [15:08:57] yes, sorry, config E has S3916L [15:09:30] Sorry, I didn't mean to put you into a rabbit hole [15:09:31] Emperor: In any case, I will ask DC ops to test that hot swap (as we've not tested the host yet) [15:09:44] I think that's sensible [15:10:02] but my main takeaway for taht is that that controler was ok for backups, but may not be for dbs [15:10:12] preciselly for that stuff [15:10:17] if we cannot do hot swaps, it is not, no [15:11:53] I'm going to test a controller restart on one of the running-swift SM nodes (in eqiad) [15:15:22] https://phabricator.wikimedia.org/T388684 [15:15:44] sigh, that tells me that ms-be1081 is in the process of losing a disk [15:15:59] * Emperor hates hardware