[01:08:28] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:10:02] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [08:36:25] marostegui: I'm online now. How can I help with wikireplicas? [08:37:05] arturo: can you talk to valentin about https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342? [08:37:23] you two have definetely more context than me [08:37:29] ok [08:37:53] And also I would need someone to take a look at this step: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_7:_setting_up_metadata [08:39:53] ok [08:53:04] marostegui: it seems the maintain-meta_p script is meant to be executed in clouddb10* servers [08:53:52] there are a bunch of hosts, so I'm not sure where to execute the script [08:54:00] also, it seems it accepts a $wiki argument [08:54:07] what should I use there? [08:54:55] arturo: Those are my doubts too [08:54:59] I think it only goes to s7 host [08:55:11] But I don't want to run that with these doubts [08:55:23] arturo: could you get someone more familiar from your team to look into that? [08:55:30] ok [08:55:38] thanks, I appreciate it [08:56:26] most of US-based folks in my team are off today [08:56:57] btullis: are you around? [09:03:11] marostegui: why do you think running the meta_p script is required? [09:04:19] https://phabricator.wikimedia.org/T337446#8887064 [09:09:30] ok [09:09:37] so I think I'm starting to understand [09:09:46] also, reading this https://gerrit.wikimedia.org/g/operations/cookbooks/+/d22a13d53af3df5986042cf1820a443aaa3907d3/cookbooks/sre/wikireplicas/add-wiki.py [09:10:11] So I ran maintain-views everywhere [09:10:14] I didn't run anything else [09:10:26] ok [09:11:14] so I think this is what we need at this point [09:11:19] aborrero@cumin1001:~ $ sudo cumin "P{R:Profile::Mariadb::Section = 's7'} and P{P:wmcs::db::wikireplicas::mariadb_multiinstance}" "/usr/local/sbin/maintain-meta_p --all-databases --debug --dry-run" [09:11:19] 3 hosts will be targeted: [09:11:20] clouddb[1014,1018,1021].eqiad.wmnet [09:11:30] (without the debug and dry-run) [09:11:56] I've done a manual dry-run just now, and it just inserts a bunch of stuff [09:11:58] sure, those hosts are fine they all have s7 [09:12:11] example [09:12:16] https://www.irccloud.com/pastebin/28yHyhRY/ [09:12:51] meta_p doesn't exist, if the script doesn't create it itself, I can do it [09:13:13] I mean the database, no idea about the tables inside [09:13:21] ok, there is this option [09:13:21] --bootstrap Creates tables, views and dbs if they don't exist [09:13:33] then that is definitely needed [09:14:09] it does this [09:14:10] https://www.irccloud.com/pastebin/8gX4o93W/ [09:14:25] that looks good [09:14:38] ok, let me run that cumin command [09:14:54] it will take a while, because it will query every wiki endpoint [09:15:04] It might fail [09:15:10] As s5 is currently not up on those hosts [09:15:12] for future,I added bootstrap to docs https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Wiki_Replicas&diff=prev&oldid=2080773 [09:15:18] Thanks [09:15:47] Sorry, not s5, but s2 and s1 [09:15:50] ok, so I plan to run this [09:15:51] aborrero@cumin1001:~ 2s 98 $ sudo cumin "P{R:Profile::Mariadb::Section = 's7'} and P{P:wmcs::db::wikireplicas::mariadb_multiinstance}" "/usr/local/sbin/maintain-meta_p --all-databases --bootstrap" [09:16:05] which will hit clouddb[1014,1018,1021].eqiad.wmnet [09:16:12] that looks good [09:16:20] +1 [09:16:28] those seems to be the right hosts per https://orchestrator.wikimedia.org/web/cluster/alias/s7 [09:16:31] ok, running it now! [09:16:39] arturo: yes [09:17:20] expect it to take a good 2 minutes per host [09:17:57] mmmm returned with succeed already, looks suspicious? [09:18:14] can you check if it did something? [09:18:21] one sec [09:18:32] there is data [09:18:49] whether it is what is supposed to have...I don't know [09:19:06] I have a potential user online who can test [09:19:21] it's used for replag and such [09:19:25] https://replag.toolforge.org/ [09:19:36] the grants are missing I think [09:19:53] ladsgroup@tools-sgebastion-10:~$ sql meta_p [09:19:53] ERROR 1044 (42000): Access denied for user 'u3182'@'%' to database 'meta_p' [09:20:07] let me see if the role is assigned [09:20:24] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/mariadb/grants/wiki-replicas.sql$44 [09:20:42] no, that's not the grant you need [09:21:04] oh obviously, sorry [09:21:23] definitely don't want to grant everything to everyone :D [09:21:34] try again [09:21:41] yup [09:21:46] works weeeeee [09:21:50] cool [09:21:58] and replag is back now: https://replag.toolforge.org/ [09:22:20] That's a big weird, s5 shouldn't have lag [09:22:32] neither should s7 [09:23:21] what's is heartbeat_p ? [09:23:25] I think meta_p is not properly updated for s5 and s7 [09:23:58] we don't have heartbeat_p anywhere so maybe that's something we need to run too? [09:24:02] I can't access it from labs but let me check if it actually exists [09:24:07] no it doesn't [09:24:13] I am telling you it is not anywhere [09:24:37] yeah, sorry, I wrote it before you send [09:25:03] it's a view [09:25:06] see ./modules/profile/files/wmcs/db/wikireplicas/views/heartbeat-views.sql [09:25:31] yeah,but where does it need to live? on each section? [09:25:41] -- TODO: add it to maintain-views.py [09:25:42] nice... [09:26:26] -- This only has to be run once per host [09:26:27] let me add it to s5 and see if it fixes the lag [09:26:32] doesn't help much either [09:27:24] ok, done, I guess we also need specific grants for it [09:28:24] I didn't find anything in puppet [09:28:42] but it's *probably* covered by the general grant on _p [09:29:09] it keeps showing lag [09:30:20] could be some caching or something like that, leave the heartbeat mess to me [09:33:21] k thanks [09:35:37] marostegui: is there anything else I can help with? [09:36:18] arturo: Did you and valentin reached any conclusion about dbproxy? [09:36:27] If I can do 2 transfer at the same time it will speed up the recoveries [09:36:57] marostegui: reached to him, but no conclusions [09:37:08] do you need that patch to make it happen? [09:37:38] arturo: I don't know what is needed. I actually didn't know it would affect upload.wikimedia.org [09:37:45] I mean, what do you need and what do we need from valentin? [09:38:26] What I need is to be able to shutdown two sections (say s1) at the same time, which is what I did yesterday but apparently that has implications on the CPU usage of dbproxy1018 and dbproxy1019 which is something valentin found [09:40:16] I'm confused, why does valentin care about CPU usage of dbproxy servers? [09:40:26] arturo: it is probably if you sync with him :) [09:40:30] I am just being the messenger here [09:40:54] I think the 3 of us are online ATM, would you be up for a quick sync meeting? [09:41:17] arturo: I'd rather do it via irc [09:41:27] ok [09:41:28] I am still working out many things [09:42:08] Amir1: s5 replag seems fixed, did you do something? [09:42:12] s7 still showing lag [09:42:15] (which is not real) [09:42:17] marostegui: yup [09:42:21] on s7 now [09:42:46] added the basic grant: GRANT SELECT, SHOW VIEW ON `heartbeat_p`.* TO `labsdbuser` [09:43:03] cool [09:43:58] Amir1: can you add that to all the clouddb hosts that have an "x" there? [09:44:01] the rest aren't ready yet [09:44:09] the heartbeat_p db doesn't exist in s7, shall I create it from the file? [09:44:21] yeah, please create it on all those 3 hosts [09:44:21] oh sure, definitely [09:49:25] okay s7 is fixed, going to clouddb1021 in s2 and s3 [09:49:36] no [09:49:42] those aren't ready [09:49:47] only the ones that are marked as done [09:49:53] oh sorry, they had x in the ticket [09:50:24] https://www.irccloud.com/pastebin/xQRDHXsR/ [09:51:36] ah for clouddb1021 yes [09:51:41] but it is down as I am recloning the other two [09:51:54] oh okay cool [09:51:58] I will add it once they come up [09:53:39] awesome [10:00:00] marostegui: I just created T337721 which I hope captures the problem with pybal [10:00:01] T337721: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact - https://phabricator.wikimedia.org/T337721 [10:00:17] thanks [10:02:41] arturo: also maybe https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_5:_setting_indexes. I highlighted it to marostegui yesterday on task when I noticed the meta_p step wasn't done. [10:03:04] I don't know the details of the pybal issue but dbproxy used to not be behind confctl and depool needing a puppet patch, I don't know how hard it is to revert back to that for now [10:03:20] which would fix the pybal issue [10:03:52] It was btullis that moved it I think [10:06:47] arturo: do you think it'd be doable to move it out of confctl so it would stop hitting pybal for healthchecks? [10:07:07] RhinosF1: Yeah, that needs to be looked at too by WMCS or Data Engineering [10:07:12] arturo: ^ [10:07:19] Razzi did the work long time ago and he left, not sure anyone knows how to do it [10:07:30] Let me try on s5 [10:07:40] Amir1: I have no idea, I'd need to research and sync with btullis [10:07:59] marostegui: ACK [10:08:16] let me see if I can find the ticket [10:15:48] /usr/local/sbin/maintain-replica-indexes doesn't seem to be working [10:17:46] marostegui: for the pybal issue, I think I can depool half of dbs per section from all sections from haproxy so it would stop doing healthchecks, would that help? [10:18:20] (and switch to the other half once we are done) [10:18:23] Amir1: That's thepoint, it is faster if I shutdown both of them [10:18:47] I see [10:18:52] let me see if I can find a solution [10:19:35] we can make them point to clouddb1021, at least healthchecks wouldn't fail [10:24:33] found the ticket: T304478 [10:24:34] T304478: Move wikireplicas dbproxy haproxy config to etcd - https://phabricator.wikimedia.org/T304478 [10:24:48] the whole thing is half done, honestly, we can just remove it [10:25:18] let me see if I can find a way [10:38:14] RhinosF1: do you know if the index thing is an optimization that can be done at a later time, or something that we need today? [10:38:58] arturo: I am sure it is not needed today [10:39:07] ok, so we can ignore it for now [10:39:11] yep [10:40:18] I was about to say the same. It would make some queries in the labs slow but that's least of our worries right now [10:40:45] (adds indexes in /etc/index-conf.yaml) [10:42:20] Amir1: per your comment on https://phabricator.wikimedia.org/T337721#8888032 [10:42:36] I think I understand (or remember now) [10:42:57] the dbproxies for the replicas are behind LVS so they can be gracefully reached from cloud VMs [10:43:11] so, LVS provides a public IPv4 address for them [10:43:38] if we can switch to dns load balancing for today and the next couple of days, I'd think it would simplify things for Manuel [10:44:19] what are the sections that need depooling? [10:46:28] cc marostegui [10:46:48] s1, s2, s5, s7 [10:46:53] s4 might be added later [10:46:53] ok [10:47:05] wait [10:47:12] I need to recap on all the things I am doing at the same time [10:47:18] s5 and s7 are also done [10:47:23] sure [10:47:30] ok, I'll wait and then I'll write my idea to the phab ticket [10:47:31] let's not depool things right now [10:47:34] I need to say yeah [10:47:51] I have many stuff going on at the same time that I need to have a complete vision to reubicate myself [10:48:07] no problem, I'll stand by [10:48:08] ok, s3 needs to be depooled entirely [10:48:20] so that means clouddb1013 and clouddb1017 [10:48:50] I have already stopped clouddb1013 but I would like to do the same with 1017, but I didn't simply because of the dbproxy things [10:49:02] ok [10:49:14] s1 probably too, but not now [10:49:20] I'd say s3 for now [10:49:26] let me see if I understand correctly [10:49:44] if we depool s3 (clouddb1013 andd clouddb1017), that's OK to unblock you [10:49:48] arturo: I know very little about databases and even less about the wikireplicas architecture. Probably worth adding to a tracker for this incident though so not forget as a lot it going on. [10:49:55] we can alter repool them and move to the next? [10:50:40] arturo: So right now clouddb1013:3313 (s3) is down, and clouddb1017:3313 is up. I want to be able to also shutdown clouddb1017 to start the transfer. I simply didn't to leave one of them up and avoid what valentin saw yesterday [10:51:03] arturo: Once those two are finished, probably by tonight, tomorrow I'd like to the same thing to s1, yes [10:51:13] ok, my plan is to just drop the S3 definition from LVS @ puppet for the wikireplicas [10:51:32] leaving the rest as is [10:51:42] so an extreme depool :-) [10:53:19] SGTM [10:53:41] as long as it doesn't bring down upload.wikimedia.org, it's good [10:53:46] will try! [10:54:52] I am going to step away for around 1h, I need to eat [10:56:21] go go [10:56:28] thanks for everything <3 [10:57:55] thanks! enjoy the food [10:58:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/924481 [11:00:47] I believe that's all we need, but will try to have others review it so we don't break the wikis or anything [12:19:37] I need to rest a bit. Will be back soon [12:20:05] arturo: Yeah I cannot really give you much input on that patch [12:20:34] marostegui: valentin confirmed that's right and will deploy it later today [12:20:40] ok thanks [13:45:52] Could I get a +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/924516 please? Disable setting enable_swiftrepl ; which will enable puppet (hopefully!) to run to completion on the bullseye frontend that used to be a swiftrepl node [13:47:23] thanks :) [13:48:14] you will have to check python-cloudfiles and time [13:53:09] those that rarely have someone for a review should support each other more [14:02:47] codfw node looks OK, going to give it 24h before doing the eqiad one. [14:04:46] marostegui: valentin told me they sorted out the LVS/dbproxy thing, and you should be all set to proceed [14:05:12] yeah, synced with brandon on -operations [14:05:17] Thanks for the help [14:07:37] cool [15:16:15] marostegui: are you fully unblocked for now? [15:16:37] arturo: yep thanks [15:16:44] we still have this: [15:17:03] https://phabricator.wikimedia.org/T337734 [15:17:13] so if you can follo wup with your team and see if someone knows more about it, that'd be great [15:19:18] sure I can do that [15:19:22] thank you [15:19:51] other than that, I'm about to leave the laptop for the day, but I'll be available tomorrow morning in case you have anything else where I can help [15:20:08] thanks [15:20:23] emjoy your evening