[02:05:57] PROBLEM - MariaDB sustained replica lag on s7 on db1174 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1174&var-port=9104 [02:06:15] PROBLEM - MariaDB sustained replica lag on s7 on db2182 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2182&var-port=9104 [02:06:21] PROBLEM - MariaDB sustained replica lag on s7 on db2218 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104 [02:06:57] PROBLEM - MariaDB sustained replica lag on s7 on db1158 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1158&var-port=9104 [02:07:01] PROBLEM - MariaDB sustained replica lag on s7 on db1227 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [02:07:07] PROBLEM - MariaDB sustained replica lag on s7 on db1181 is CRITICAL: 8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [02:07:11] PROBLEM - MariaDB sustained replica lag on s7 on db2122 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [02:07:15] PROBLEM - MariaDB sustained replica lag on s7 on db2168 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104 [02:07:21] RECOVERY - MariaDB sustained replica lag on s7 on db2218 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2218&var-port=9104 [02:07:57] RECOVERY - MariaDB sustained replica lag on s7 on db1174 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1174&var-port=9104 [02:07:57] RECOVERY - MariaDB sustained replica lag on s7 on db1158 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1158&var-port=9104 [02:08:01] RECOVERY - MariaDB sustained replica lag on s7 on db1227 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [02:08:07] RECOVERY - MariaDB sustained replica lag on s7 on db1181 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [02:08:11] RECOVERY - MariaDB sustained replica lag on s7 on db2122 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [02:08:15] RECOVERY - MariaDB sustained replica lag on s7 on db2168 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2168&var-port=9104 [02:08:15] RECOVERY - MariaDB sustained replica lag on s7 on db2182 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2182&var-port=9104 [04:57:50] going to switchover s6 primary master [05:35:27] Going to disconnect replication from es4 hosts and make them standalone [05:47:47] es4 is now standalone, I am going to leave it like this for a bit to make sure nothing is broken [05:47:52] The old masters are RO on mysql level too [06:18:21] Going to disconnect replication from es5 hosts and make them standalone [06:49:12] I am a bit puzzled on why es2025 (es5 - standaalone) is alerting on read only and expecting it to be False, any help would be appreciated? All standalone external stores have all the hosts in read only, so not sure why this host (and es2020, es2021, es2022, es2023) are alerting on expected False) they are the only hosts as can be seen at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=es20 [06:50:00] Any thoughts? [06:56:01] found the issue [07:05:33] what was it ? [07:28:07] I think this is going to fix it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1032290 [07:30:15] yeah, that fixed it [07:30:21] +command[check_mariadb_read_only_es5]=db-check-health --port=3306 --icinga --check_read_only=true --process [07:31:07] omg wikitech has a dark mode [07:44:48] I am going to try to document the process of adding and removing external store sections [07:44:58] At least referencing to the tickets where I've been very verbose [07:45:54] 👍 [07:47:45] going for https://phabricator.wikimedia.org/T364814 with my spotty internet and a tmux [07:47:58] I will be ready to take over if you need me to [09:45:36] marostegui: Are you done with db1173 (old s6 master?) [09:45:46] I am [09:45:50] awesome [09:45:56] I am adding pcX to dbctl at the moment [09:46:07] Just pushed the section config, I am going to configure the instances now [09:46:08] wohoo [09:48:09] Ha [09:48:11] dbctl broke [09:48:13] Damn regex [09:48:15] fixing [09:48:51] uh? [09:49:01] same as this volans https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025699 [09:49:25] oh wait [09:49:27] it is already there [09:49:34] I've seen a patch from Scott [09:49:35] about that [09:49:55] it is missing the same on the DEFAULT section from what I can see [09:50:02] checking [09:50:07] I will send a patch [09:50:12] and you can review if it makes sense [09:50:21] sure [09:51:59] 'pc1' does not match any of the regexes: 'DEFAULT', '^s1[01]$', '^s[124-8]$' [09:52:11] yeo, sending the patch now [09:52:57] volans: you think this is it? https://gerrit.wikimedia.org/r/1032407 [09:53:39] mmmh I'm not sure, that regex doesn't include es* that the approach we're mimicking for pc* no? [09:53:43] that's weird, the pc sections should be in external loads [09:53:47] exactly [09:53:49] not in sectionloads [09:53:59] Then I don't know :)( [09:54:17] I am going to undo the dbctl changes for the instances so dbctl is back operative while this gets figured out [09:54:29] move line 69 to 77 [09:55:09] ok one sec, let me get dbctl back fixed [09:55:11] just in case [09:55:12] oh wait [09:55:14] that's already there [09:55:18] it's already volans [09:55:28] let me check quickly the code [09:55:30] dbctl back [10:00:28] marostegui: you set "flavor": "regular", [10:00:35] should be "flavor": "external", [10:00:41] https://phabricator.wikimedia.org/T362786#9803692 [10:01:08] oooooooh [10:01:10] fixing [10:01:10] so it was trying to validate the section object as if it was a regular section, not an external section [10:03:17] ok changed [10:03:21] Let me try to edit pc1011 again [10:04:20] k [10:04:27] worked! https://phabricator.wikimedia.org/P62484 [10:04:30] yay [10:04:52] the MW side has been already merged? [10:04:55] and deployed [10:05:06] volans: it depends [10:05:11] we merged the ignore part [10:05:12] the part that ignores pc* [10:05:14] ok [10:05:15] :D [10:05:30] XD [10:05:42] I'm actually feeling spicy, wanna get the read part deployed today? [10:05:54] on a thur? let's do it monday instead XD [10:06:01] let's make sure the ignore works too :D [10:06:02] I still have to add all the instances [10:06:19] ugh, boring. [10:06:24] XDDD [10:06:55] OTOH, I'm about to start using the new replication user [10:07:10] deploying it will be fun [10:10:35] wow /var/cache/conftool/dbconfig on cumin1002 is ~180M and covers just since January [10:11:04] I wonder if we should add some cleanup timer there or it could be useful in some way to keep hisotrical data [10:11:32] there is plenty of space, so no worries at all [10:11:32] I don't think we've ever used any of that data since we started using dbctl [10:11:47] So nice to see pc on https://noc.wikimedia.org/dbconfig/eqiad.json [10:11:59] going to add the spares now [10:12:51] volans: we do way more schema changes [10:13:02] I'm also sorta sad for phabricator paste [10:13:49] last time I counted, we did around 90 schema changes a year, each needing 200 depool/repool and each repool is 5 dbctl actions, you get the idea [10:16:03] I guess we could 1/ delete old pastes and 2/ don't save to phab the diff if some conditions are met? (like for example only the weight changes from X to Y with both X and Y > 0) :D [10:17:49] it's useful to have them somewhere public. but maybe not phab paste? [10:18:26] like a http server we could just dump the diff file and then remove the old files from time to time [10:21:10] no, you can't put them in swift ;p [10:21:41] we wouldn't even if you insisted :P [10:21:57] WMCS object storage? :P [10:22:20] sending them over the wire to wmcs will be fun [10:22:35] * Emperor briefly read that as "over the wire to emacs" [10:22:51] * dhinus is sure there is an emacs-to-s3 interface [10:23:20] https://github.com/mattusifer/s3ed [10:23:22] of course there is [10:23:38] :) [10:23:49] there are multiple actually [10:25:00] off topic: is it ok if I merge and apply this today? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029709 [10:25:37] dhinus: sure, no issue from our side [10:26:07] volans Amir1 I think it is all done now https://phabricator.wikimedia.org/T362786#9803820 [10:26:45] marostegui: awesome. Wanna try a "switchover" to replace an existing pc host? [10:26:55] it's being ignored so it's the safest time to do it [10:26:55] marostegui: thanks, I'll send a ping here when I start the cookbook [10:27:06] and we probably should document on how to somewhere [10:29:58] Amir1: it should just be a set-master no? [10:30:01] dhinus: sure [10:30:47] I have no idea [10:30:53] Amir1: I am tryting [10:30:57] Thanks <3 [10:31:03] are the spare usable immediately by any pc section? [10:31:10] yes [10:31:19] it's a cache so it can move around [10:31:20] haha dbctl doesn't have security checks for pc [10:31:21] https://phabricator.wikimedia.org/P62494 [10:31:22] XD [10:31:26] It let me do a stupid thng [10:31:39] oh oooo [10:31:57] volans: Should I create a task for this? [10:32:08] it should have told me I have no pooled replicas [10:32:22] I guess we just need to extend the current checks to pc [10:33:00] yous et "min_replicas": 0 on purpose? [10:33:07] shouldn't we add the 2 spares to all sections? [10:33:21] volans: yeah, beacuse we don't have replicas [10:33:30] volans: The spares currently replicate from pc1 and pc4 [10:33:34] ah ok [10:33:45] but they can be added to pc2 if a master dies [10:33:53] even if they don't replicate from it - it just means they'll be cold [10:33:57] sure [10:34:02] better than nothing [10:34:03] :D [10:34:24] But I am surprised dbctl let me change the master to a host which isn't pooled and hence the section had no master after the commit [10:34:34] This doesn't happen with sX [10:34:36] which is good [10:34:46] are you sure of that? [10:34:59] wanna try it on s4? [10:35:07] (as I said, I'm feeling spicy) [10:35:08] Yes, dbctl doesn't let you promote a host which isn't pooled [10:35:12] I saw it earlier with arnaudb ;) [10:35:25] So the host needs to be pooled (even with weight 0) [10:35:47] which command did you run? [10:35:50] so I can check the code [10:36:49] marostegui: running the maintain-views cookbook now [10:36:58] *update-views [10:37:05] volans: so earlier during arnaud's switchover, the future master was depooled by mistake, when he tried to change the master he got: https://www.irccloud.com/pastebin/63iDkIYl/ [10:37:11] dhinus: roger thanks [10:37:49] that's an explicit check for sectionLoads [10:38:14] volans: can it be extended to the rest? [10:38:45] volans: otherwise you can promote a replica which is depooled and then you end up with no master [10:38:55] yeah that's bad [10:38:59] I wonder why we didn't do it for all [10:39:00] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/config.py#569 [10:39:24] seems a valid assumption for all our section no? [10:39:35] yeah [10:40:23] feel free to open a task to the conftool maintainers :D [10:40:33] XDD [10:40:41] I will create a subtask of the current one [10:40:47] I'l comment with the places where it will need fix [10:40:54] thanks [10:41:36] thank you for spotting it [10:44:58] https://phabricator.wikimedia.org/T365123 [10:48:58] marostegui: the update-views cookbook completed successfully [10:49:08] thanks dhinus! [11:41:08] I'll be afk for a while; an ISP person is coming by 🤞 [11:57:30] and they're gone, sadly it's a collective issue that will stretch on for a while :/ [11:58:00] I'm impressed by your luck [11:59:17] at this stage I might be the root cause of my country's current economic decline [12:00:54] * Amir1 buys more Croissant to support arnaudb [12:01:05] xD [12:09:21] PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 3.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [12:11:21] RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [12:21:31] ^ I'm going through binlog to see what transaction chokes replication from time to time [12:40:23] PROBLEM - MariaDB sustained replica lag on s7 on db1181 is CRITICAL: 3.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [12:41:13] PROBLEM - MariaDB sustained replica lag on s7 on db1158 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1158&var-port=9104 [12:41:25] RECOVERY - MariaDB sustained replica lag on s7 on db1181 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [12:42:13] RECOVERY - MariaDB sustained replica lag on s7 on db1158 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1158&var-port=9104 [14:04:39] PROBLEM - MariaDB sustained replica lag on s3 on db2205 is CRITICAL: 4.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [14:29:41] RECOVERY - MariaDB sustained replica lag on s3 on db2205 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [14:47:05] is it safe to run "systemctl status mariadb@s4.service" on a wikireplica? (clouddb1015 in this instance). context: T365164 [14:47:06] T365164: [wikireplicas] clouddb* free memory decreases over time - https://phabricator.wikimedia.org/T365164 [14:47:28] sorry "restart" not "status" :P [15:02:25] dhinus: why do you need to restart it? [15:02:31] Ah [15:03:00] Well, it is safe to run, but double check if replication starts automatically or you'd need to issue a start slave for that instance [15:33:56] marostegui: ack. I was in a meeting, I'll try restarting now, and double check "show slave status" [15:38:56] restarted, replication was not running, I had to manually do "START SLAVE" [15:39:23] replication is back in sync [15:40:03] memory usage is down [15:45:06] Good! [22:19:37] PROBLEM - MariaDB sustained replica lag on s3 on db1189 is CRITICAL: 10.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1189&var-port=9104 [22:19:45] PROBLEM - MariaDB sustained replica lag on s3 on db1212 is CRITICAL: 4.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104 [22:20:01] PROBLEM - MariaDB sustained replica lag on s3 on db2205 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [22:20:37] RECOVERY - MariaDB sustained replica lag on s3 on db1189 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1189&var-port=9104 [22:20:37] PROBLEM - MariaDB sustained replica lag on s3 on db1157 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1157&var-port=9104 [22:21:01] RECOVERY - MariaDB sustained replica lag on s3 on db2205 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [22:21:01] PROBLEM - MariaDB sustained replica lag on s3 on db2177 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2177&var-port=9104 [22:21:37] RECOVERY - MariaDB sustained replica lag on s3 on db1157 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1157&var-port=9104 [22:21:45] RECOVERY - MariaDB sustained replica lag on s3 on db1212 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104 [22:22:01] RECOVERY - MariaDB sustained replica lag on s3 on db2177 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2177&var-port=9104