[12:15:48] marostegui: is the m5 proxy enabling in a couple of hours likely to be a usefully-educational thing for me to watch? [12:15:55] probably not XD [12:16:15] Emperor: If all goes fine, it is simply merging a DNS change [12:16:21] cool, ta, just seemed worth checking since it wasn't at 6am local time for once :) [12:16:23] # host m5-master [12:16:23] m5-master.eqiad.wmnet is an alias for db1128.eqiad.wmnet. [12:16:29] That will point to dbproxy1017 [12:16:31] and that's it [12:16:54] Emperor: I do think the switchovers are educational, at least to see how the workflow is [12:16:57] And the speed we do them at [12:17:15] I know it is pretty early, but I would encourage you to attend to one or two if you can :) [12:17:56] don't suppose you could be bullied into doing one a couple of hours later? [12:18:22] Emperor: The reason we do them that early is basically to have more room to observe the traffic [12:18:34] Especially with the big big wikis like s1 (enwiki) and s8 (wikidata) [12:19:29] le sigh. My other half's away on the 11th, so maybe I'll try and make that one [12:19:58] haha no rush, you don't have to attend to those specifically yet, we'll have more in the future too (as kormat loves them) [13:55:01] @marostegui I've been out a lot the last few weeks so haven't been very responsive on T288093 -- I don't think it will affect much but I'm around this morning to notice if it does. [13:55:03] T288093: Place m5 proxies in codfw and eqiad - https://phabricator.wikimedia.org/T288093 [13:55:10] andrewbogott: thank you :* [13:55:17] legoktm: you around yet? [13:58:43] andrewbogott: reminder of which databases live in m5 and might be affected: https://phabricator.wikimedia.org/T288093#7461416 [13:59:44] thx [13:59:53] Going to merge the dns and deploy [14:00:01] marostegui: hey I'm here now [14:00:09] legoktm: hello! [14:00:20] I am going to merge in a sec, if you want to restart mailman3 in about 2-3 minutes [14:00:29] sounds good [14:00:42] deploying! [14:01:56] deploy done [14:02:00] toolhub seems happy so far [14:02:01] m5-master.eqiad.wmnet. 58 IN CNAME dbproxy1017.eqiad.wmnet. [14:02:31] I'm seeing some breakage but hoping that bd808 runs to the rescue [14:02:35] :( [14:02:43] andrewbogott: what do you see? [14:02:45] Grants missing? [14:02:52] https://toolsadmin.wikimedia.org/tools/membership/ [14:02:56] I haven't dug in at all [14:03:00] uuuups [14:03:20] Let's see if the error can be found if not I will revert [14:03:41] looks like toolhub broke too [14:04:05] andrewbogott: I suppose the first thing to try is restarting Striker [14:04:06] are you able to see the concrete error? [14:04:08] 14:03:45 [Q] ERROR (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)") [14:04:18] mailman3-web is also unable to connect [14:04:21] let me check the proxy [14:05:55] I cannot see what could be wrong with the proxy, I can telnet correctly to 3306 and I get redirected fine [14:06:01] I am going to revert and audit the grants again [14:06:09] toolsadmin (Striker) is reporting "Access denied for user 'striker'@'10.64.48.43' (using password: YES)" [14:06:40] thanks bd808 I can check that later [14:06:43] Going to revert for now [14:06:56] hm, telnet to dbproxy1017.eqiad.wmnet from lists1001 is failing [14:07:04] maybe FW issues? [14:07:22] legoktm: telnet m5-master.eqiad.wmnet 3306 works? [14:08:02] nope [14:08:21] so FW issues likely [14:08:32] bd808: I have fixed striker, does that work now? (I am going to revert anyways, but just checking) [14:08:51] marostegui: yes! Striker is working [14:08:56] ok good! [14:09:00] I had to add https://gerrit.wikimedia.org/g/operations/puppet/+/5121590ae251609f52ac80aee981240df3926b1a/modules/profile/manifests/mariadb/misc.pp#18 to get it to work originally [14:09:17] deploying the revert now [14:09:59] legoktm: So maybe that needs to get added to m5 proxies too (dbproxy1017 and dbproxy1021) [14:10:04] (the change is getting deployed) [14:10:04] striker is back, toolhub is still upset [14:10:12] andrewbogott: yep, didn't touch that one [14:10:27] dns merged [14:10:55] toolhub back with revert [14:11:01] toolhub definitely has the grants for dbproxy1017, so might be FW or something [14:11:08] thanks for confirming bd808 [14:11:12] legoktm: mailman back? [14:11:37] yep [14:11:51] cool, I will post all this on the task and will dig on the FW part [14:11:59] Thank you all for helping and sorry for breaking things :( [14:12:11] We knew placing the proxy back in m5 was going to be hard :( [14:12:38] the error in toolhub said "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)". I'll look for a better trace in logstash [14:12:59] that smells like the same issue legoktm saw from lists1001 [14:13:34] I looked it up, 115 means you can't reach the host [14:13:41] ok, so setting this aside for a future window? [14:13:55] andrewbogott: yeah, I will investigate [14:13:56] thank you [14:14:07] ok! Time for breakfast :) [14:14:21] Thanks andrewbogott :** [14:14:40] legoktm: [14:14:40] root@lists1001:~# telnet dbproxy1017.eqiad.wmnet 3306 [14:14:41] Trying 10.64.48.43... [14:14:46] Definitely a FW (or connectivity issue) [14:14:51] yeah [14:15:25] I suspect toolhubs has the same issue [14:15:38] bd808: which host could I use to test connectivity? [14:15:43] A root would need to test that from inside the eqiad k8s cluster. I don't have the super powers to enter a shell there. [14:15:44] what is the puppet role for the proxy? [14:16:04] marostegui: maybe kubernetes1011? [14:16:09] legoktm: mariadb::proxy::master [14:16:13] bd808: Thanks I will check! :) [14:16:27] Will talk to Service Ops if I need more help, no worries! [14:16:30] we need to add the proxy to the egress https://gerrit.wikimedia.org/g/operations/deployment-charts/+/c137819149b98d3b342c6dc1ac0d9aa1cd6edfdc/helmfile.d/services/toolhub/values.yaml#20 [14:17:03] I thought "cidr: 10.64.16.35/32 # T288720: db1132.eqiad.wmnet" was the new one? [14:17:03] I am so smart, I added the codfw and the new master but not the eqiad proxy, good job manuel [14:17:04] T288720: Failover m5 master (db1128) to db1132 to upgrade its kernel - https://phabricator.wikimedia.org/T288720 [14:17:21] bd808: that's going to be the new master yeah, but I forgot to add the proxy [14:17:27] In a brilliant move from my side [14:17:30] good catch legoktm [14:17:35] ah [14:19:07] We still need to find the issue with lists1001 [14:19:23] so it sounds like Striker was a grant on the db side and is fixed/will work & toolhub needs the correct egress for the new proxy in eqiad. [14:19:49] bd808: exactly, I will review the grants _again_ though just in case I missed something else (you have no idea how messy they are) [14:20:08] So we'd need to add dbproxy1017 and dbproxy1021 [14:20:12] as those are the active and passive proxies [14:20:48] ok, found it [14:20:55] dbproxy1017 has profile::mariadb::proxy::firewall: 'cloud' [14:21:02] which does [14:21:03] include ::profile::mariadb::ferm_wmcs [14:21:17] so we need a mode called 'cloud+lists' :) [14:21:33] yeah, m5 were originally only cloud services [14:21:43] well and wikitech [14:22:16] m5 proxies and misc databases that is [14:22:43] give me a few minutes for a patch [14:22:52] no rush, I will need to review all the grants too [14:24:45] I made T294437 for toolhub. Will work on that today [14:24:46] T294437: Add egress rules for dbproxy1017 & dbproxy1021 - https://phabricator.wikimedia.org/T294437 [14:25:00] thank you bd808! [14:25:27] * bd808 sneaks off to have a shower [14:31:15] marostegui: should I also set it on dbproxy2004 too? [14:31:29] legoktm: let's see [14:32:02] Do we have lists200X? I don't think so no? [14:32:13] mailman only lives in eqiad, right? [14:32:55] yeah, eqiad only for now. I have have an open task for setting up a spare, but that's a dream right now [14:33:02] haha [14:33:08] so let's wait to have a host there I would say [14:33:15] ok [14:33:27] So we ensure that we don't do cross DC queries for now [14:33:32] as a hard way to ensure it [14:33:48] * legoktm nods [14:34:31] should I merge / roll it out now? [14:34:43] yeah, we can merge, nothing would be using m5-master [14:34:50] The dns points directly to db1128 [14:35:08] s/m5-master/dbproxy1017/ [14:37:08] legoktm: it works! [14:37:26] ^.^ [14:37:31] very nice! [14:37:35] thank you [14:40:53] summary https://phabricator.wikimedia.org/T288093#7461553