[07:30:08] morning [07:30:46] o/ [07:32:39] I see the network issues continued yesterday :/ [10:00:46] * dcaro lunch [11:19:53] * taavi afk for a bit [11:37:30] hi all due to bad planning i have ran outr of resources on the puppet-diffs project. i created https://phabricator.wikimedia.org/T349006 to request addytional resource and i know that the process is for theses to be approved and resolved in your meeting tomorrow. but if there is anychance i can skip the queue and get things updated today it would be a great help but undertans if not [11:37:36] possible, thanks (cc balloons) [12:43:07] jbond: should be done, if it's a small request we expedite by getting someone +1 it [12:44:18] dcaro: awesome thanks <3 [12:53:47] topranks: let me know when you want to work on T348140, we can do codfw to refine the process and find any issues there before doing eqiad [12:53:48] T348140: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 [12:54:11] dcaro: I'm just updating the task now, I'll ping you shortly on it if that's ok [12:54:20] no problem yes [12:59:19] dcaro: ok I added a comment there, my best guess as to the commands we need [12:59:29] but I'm far from certain ] [12:59:43] overall the idea doesn't seem too hard, but I suspect we may end up tinkering a bit [12:59:58] and the question is if we can withstand a longer than "very quick" outage if that happens [13:00:40] on codfw should be ok [13:01:14] ok yeah [13:02:58] okok :), just rechecked the ids in the commands, I think I'm ready if you are [13:03:26] @wmcs, I'm messing up with codfw, if you are doing something it might break momentarily [13:03:34] ok give me 5 mins to make coffee ?? [13:03:38] and say my prayers :P [13:03:43] hahahah, sure :) [13:09:25] ok ready [13:09:39] do we want to do a call or any other screen-share? [13:09:44] or are you happy for me to just try? [13:10:22] I'm ok for you to just try, but if you prefer the call I'm happy with that too :) [13:10:47] I can just try, if it goes badly we can sync up but that won't happen :) [13:11:03] of course not! :) [13:11:28] alright I'll try the first, [13:11:40] anything you want me to be checking? (ping/tcpdump/...) [13:12:46] ok the port unset worked, had to supply the current value of it [13:13:02] I'm running "ip monitor" in the qrouter netns on cloudnet2005 [13:13:15] I can see it deleted the IPs there following the command [13:13:34] I'll continue on - nothing in particular to run right now [13:13:40] ack [13:13:54] took the subnet delete [13:14:35] subnet create worked - but allocation pools seem wrong [13:14:38] I'll continue for now [13:14:48] ack [13:15:05] started seeing `185.15.57.9 dev qg-1290224c-b1 FAILED`, but that's expected I think [13:18:36] yeah, fwiw the port creation seems to happen automatically when you add the subnet [13:18:48] ack [13:18:55] a lot of them started failing now [13:19:35] they seem ips in the 185.15.57 range [13:19:42] ok I think it's done [13:19:59] the cloudnet has the IP on the interface again with new subnet mask, and default route [13:20:22] this works [13:20:22] 185.15.57 [13:20:28] https://www.irccloud.com/pastebin/4ZNMnhtP/ [13:21:12] I can ssh to instances again [13:22:22] I think it's trying to probe all the ips in the 185.15.57.0 - 185.15.57.23 range [13:22:55] yeah that was already the case [13:23:04] ack [13:23:24] it's not an ideal setup, all the IPs are in the same bridge but it was working ok [13:23:38] i.e. it's arping for the "inside" addresses also on the "outside" (to cloudgw) interface [13:23:44] it's noise but it won't break anything [13:24:01] ack [13:24:16] I see some instance traffic going out, reluctant to say it's ok but maybe looks alright? [13:24:37] running the network tests.... [13:24:42] ok [13:25:06] most are failing [13:25:17] actually half [13:25:19] https://www.irccloud.com/pastebin/WYmMksF1/ [13:25:30] let me check the first one [13:25:31] [2023-10-17 13:23:35] WARNING: cmd '/usr/bin/ssh -i /etc/networktests/sshkeyfile -o User=srv-networktests -q -o ConnectTimeout=5 -o NumberOfPasswordPrompts=0 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -o Proxycommand="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -i /etc/networktests/sshkeyfile -W %h:%p [13:25:31] srv-networktests@bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org" tools-codfw1dev-k8s-worker-2.tools-codfw1dev.codfw1dev.wikimedia.cloud "timeout -k5s 10s ping -c1 172.16.128.1 >/dev/null"', expected return code '0', but got '255' [13:27:23] I seem to be unable to ssh to instances again, I think they might be failing to reach ldap (using root works) [13:28:01] routing to the bastion seems ok, and it's answering SSH [13:28:05] https://www.irccloud.com/pastebin/0u2N7CKh/ [13:28:22] https://www.irccloud.com/pastebin/9HiiHwOT/ [13:28:41] that's a dns failure ok [13:28:57] yeah [13:29:00] https://www.irccloud.com/pastebin/eDLgV1Zv/ [13:29:40] I can ping to that ip though [13:29:55] yeah same, odd it times out [13:30:33] we are using anycast for that iirc right? [13:33:01] pdns-recursor has no logs for the last 20 min on cloudservices2005-dev, let me try to restart it [13:34:25] the cloudnet is now trying to NAT everything to 185.15.57.10 [13:35:05] I see queries coming in, but no replies [13:35:12] on the cloudservices2005-dev [13:35:13] https://www.irccloud.com/pastebin/IA4jb9QG/ [13:35:18] https://www.irccloud.com/pastebin/7KodWyao/ [13:35:20] (I was requesting pimpollo xd) [13:35:49] Working for me now from the VPS GW IP 172.16.128.1 [13:35:57] I ran this command from Arutro's notes: [13:35:58] openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw [13:36:09] oh, now it worked for me from the tools instance [13:36:23] 2004 replied [13:36:24] https://www.irccloud.com/pastebin/tfokHaLM/ [13:36:48] The iptables NAT rule no longer has the rule 2 from paste above [13:36:52] nice! [13:36:55] I see replies now in tcpdump [13:37:05] from 2005 [13:37:11] did you change anything? [13:37:23] this? `openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw` [13:37:40] we should add it to the list if that was it :) [13:38:37] yep that was it, it adds a nat rule by default it seems [13:38:54] ldap seems not to work yet (can't ssh as my user): Oct 17 13:38:10 tools-codfw1dev-k8s-worker-2 sshd[2245101]: Invalid user dcaro from 172.16.128.19 port 45228 [13:38:54] I'll update the task with the full set of commands once we're happy [13:38:56] looking [13:39:44] oh, wait, that might just be codfw stuff [13:40:41] yep, I can ssh to other projects xd [13:41:17] rerunning network tests \o/ [13:41:19] woot! [13:41:57] one failure, [2023-10-17 13:41:34] INFO: running: puppetmasters can sync git tree [13:42:45] worked manually... maybe the test is flaky, seems to look for 'no update', so might fail when there's actually an update xd [13:43:02] now it passed [13:43:11] okok, I think it's working :) [13:43:22] ok yep maybe it's that [13:43:26] it did cause an outage though :/ [13:43:32] cool, we'll let's keep an eye on it [13:43:47] yeah it is 100% going to cause an outage, you can't delete the default gateway and not have one [13:43:57] but hopefully we can execute quick in eqiad [13:44:14] I think we can keep it for a day, and send an email to cloud-announce saying that there will be a small outage [13:44:38] hmm, taavi wanted to upgrade k8s tomorrow, not sure if it's a good idea to pair both [13:45:01] (as in, if it goes well, there's only one maintenance window, but if it goes bad things can get messy xd) [13:45:13] sure yeah no need to rush it, eqiad has been fairly stable (there is an ARP timeout sync thing which I believe contributes to that, i.e. arp timing out differently on cloudnet/cloudgw, which luckily we haven't hit in eqiad) [13:45:56] dcaro: I'll let you guys make the call on the dual-change in one window or not [13:46:03] ack [13:46:31] certainly we don't want to make changes in parallel, but yeah if we do both and something doesn't work it may be confusing to know what caused it [13:46:34] I'm tempted to wait until next week though, to make sure k8s upgrade is stable before the network flap [13:46:40] I would prefer if the k8s upgrade would be the only planned disruptive thing for tomorrow, things can get quite messy otherwise [13:46:47] sure [13:47:06] taavi: you think monday next week is enough time to make sure k8s upgrade is stable? [13:47:10] at least if we start seeing network issues with arp cache on cloudgw in eqiad we know how to address it as an emergency change [13:47:15] otherwise we can leave it a while [13:47:36] dcaro: for sure yese [13:48:03] actually we should maybe wait a little longer and then try to fail over the cloudnet's [13:48:05] or reboot them [13:48:14] oh, good idea [13:48:16] we want to make sure failover works, and it comes back ok after [13:49:04] topranks: want to do it now? we can also wait to see if there's any issues and do it tomorrow/next [13:49:55] yeah no reason not to, as long as we're happy things are now good, or if we're waiting to see if anything else raises it's head [13:50:32] let me run the trace I had yesterday - takes 15 min - then we give it a shot [13:50:37] ack [13:51:09] can we tell openstack to fail over gracefully? [13:52:08] Yes, I think we can, but do we? [13:52:13] (if testing a failover) [13:54:40] Open to suggestions, I'd maybe try a planned failover first, if that works reboot cloudnet2005-dev (will be backup after failover), when it comes back reboot cloudnet2006-dev (will be active) [13:55:09] so we kind of test both - manual failover and also what happens if active just reboots/dies [13:55:50] looking [13:57:25] last time I checked this it was using the neutron cli, that is deprecated now xd [13:59:48] I think it's going to be `wmcs-openstack newtwork agent set --disable 73361b68-276d-45a6-87a4-2b704a56dedb` (2005 L3 agent) [14:00:42] it might need manually enabling it after the reboot though [14:01:36] hmm, this is interesting though: `| ha_state | None ` [14:01:50] I think I remember that being different before [14:02:07] yeah was just looking - they both say that [14:02:46] maybe the HA is no longer something on openstack [14:03:01] "disable" is the only param that would seem to failover from what it lists [14:03:26] I think that the None is ok, our cookbook gets the right info: [14:03:28] https://www.irccloud.com/pastebin/LSQgqOih/ [14:07:32] yeah, it's running "/usr/bin/python3 /bin/neutron-keepalived-state-change" [14:07:48] and sending it's weird vxlan-encapped vrrp's on the wire [14:08:29] https://www.irccloud.com/pastebin/DGhv9w53/ [14:11:20] did you try anything? should I disable the agent? [14:19:27] sorry got pulled away [14:19:34] no I didn't do anything, you want to give it a shot? [14:19:58] sure, I'll do it [14:20:04] you checking/monitoring? [14:21:13] yep fire away [14:21:48] done [14:21:58] hmmm [14:22:05] doesn't seem to have flipped anything over [14:22:18] oh [14:22:20] it has now [14:22:22] https://www.irccloud.com/pastebin/1vALZiSb/ [14:22:54] yeah looks ok at first glance [14:22:57] my ssh connections still work [14:23:02] (to vms) [14:23:10] runnig tests [14:24:23] everything went well \o/ [14:24:39] let's reboot cloudnet2005 then? [14:24:44] woot! [14:24:47] yep let's go for it [14:25:50] rebooting! [14:29:25] ok, it's back up [14:30:15] tests keep passing [14:30:53] https://www.irccloud.com/pastebin/ITd7IIov/ [14:30:57] I'll enable the agent [14:31:40] ok, everything good [14:31:41] https://www.irccloud.com/pastebin/8HeFbbHM/ [14:31:48] topranks: should we reboot 2006? [14:32:23] * andrewbogott pops in to say: [14:32:43] * topranks fears what andrewbogott might be popping in to say [14:32:46] I'm going to be out this morning but thank you for working on neutron things! [14:32:53] hahaha [14:33:08] that's ok, was hoping it wasn't a problem report :) [14:33:21] Eh, I'm not paying enough attention to know what's broken ;) [14:33:23] dcaro: yeah let's reboot 2006 [14:33:29] * andrewbogott out again [14:33:50] andrewbogott: yw! [14:34:01] rebooting 2006 [14:34:49] hmm, this looks weird [14:34:50] https://www.irccloud.com/pastebin/nUJcRyoP/ [14:34:55] active-active [14:35:07] I did not notice any network loss on the VM ssh connection though [14:35:28] perhaps it's because the agent hasn't been able to communicate with 2006 due to reboot? [14:35:57] yep, I think it was just the grace period [14:35:59] https://www.irccloud.com/pastebin/8DhWeyLv/ [14:36:01] 2005 made itself live due to keepalive's stopping, and told openstack, but it's not showing the other one as standby as it can't alk to it? [14:36:04] ok [14:36:42] ok, 2006 back up [14:37:08] came back as standby, as expected [14:37:09] https://www.irccloud.com/pastebin/3uPNMnfC/ [14:37:15] running tests [14:37:47] all passed 🎉 [14:37:55] I think we are good :) [14:39:18] nice [14:40:14] dcaro: yeah all my checks look good :) [14:40:29] so hopefully eqiad will fairly straightforward [14:40:45] topranks: I'd say let's plan for next monday to do eqiad, unless there's more tests you want to do, that gives some time for others to find issues too, wdyt? [14:40:54] sure that's always sensible [14:41:05] btw the packet-loss I observed last night is now gone: [14:41:10] https://www.irccloud.com/pastebin/LSBPSDwh/ [14:41:14] oh, nice [14:42:14] topranks: can you update the task with the fixed commands? I'll send the email to -announce [14:43:15] dcaro: done! [14:43:26] thanks :), around 15:00 works for you? [14:44:11] CEST? yeah that should be ok, I've a meeting at 16:00 CEST but I'm hopeful it'll be ok [14:44:49] ack [14:46:49] topranks: this should not affect internal traffic right? [14:46:57] as in VM<->VM [14:47:11] no won't affect that [14:47:25] Should only be VM <-> outside [14:48:18] ack [15:19:56] looks like cloudvirt1051 just went down? [15:20:54] mgmt console is also unreachable [15:21:16] affected VMs https://phabricator.wikimedia.org/P52997 [15:22:10] just noticed [15:22:54] the alert tells that it "requires you to either restore the server or evacuate manually the VMs on it", but doesn't tell me how [15:23:04] 🤦‍♂️ [15:23:24] should I try the drain node cookbook? [15:24:43] hmm, I think it might sssh to the cloudvirt [15:27:57] it does not, so yes, please try [15:28:04] ok, I'll try [15:29:06] it just runs this on a cloudcontrol I think: `bash -c 'source /root/novaenv.sh && wmcs-drain-hypervisor {hypervisor_name}'` [15:29:46] the cookbook fails on cloudcumin1001 due to alertmanager ssh access, I'll try running that on a cloudcontrol directly [15:30:32] WARNING: Failed to migrate instance bfad7fbd-53db-4604-aa38-19ffa3e3da02 (harbordb): Compute service of cloudvirt1051 is unavailable at this time. (HTTP 400) (Request-ID: req-9dae42b7-800f-442a-8e2e-084868578924) [15:30:35] ok that's not helpful [15:33:12] https://docs.openstack.org/operations-guide/ops-maintenance-compute.html#total-compute-node-failure says to update the nodes in the database which is very scary [15:35:18] dcaro: I'm going to try that on a single node that I know we can recover if needed [15:40:56] https://phabricator.wikimedia.org/T349109 [15:47:55] Can someone ack cloudvirt1051 on icinga? I won't be able to deal with it for a while but it's making my phone go crazy. [15:53:49] andrewbogott: should be acked on vops already, if it's not please keep complaining :) [15:54:06] I will ack on icinga as well [15:54:34] it's acked in victorops so it should've already stopped contacting your phone [16:03:06] andrewbogott: things seem ok now, taavi moved coudvirt1051 to the "maintenance" aggregate, more details in the task [16:17:38] andrewbogott: if you have time later, we should probably start a canary in 1051 because even if it's in the "maintenance" aggregate there's an icinga check that is failing. I downtimed the host in icinga until tomorrow morning. [16:17:45] dcaro: I fear I accidentally resolved that page, does that mean it'll page again? [16:17:58] I think so, let me keep an eye for it [16:18:00] taavi: resolved in victorops? [16:18:17] yeah. victorops had a big checkmark and I just clicked on it because it was the most obvious thing to click [16:18:35] I can ack on icinga, that should avoid it paging [16:18:44] yep, it's a bit confusing xd [16:19:23] I'm not sure what's the behaviour in victorops, but I think it will page again yes. There's also a setting in VictorOps that will page again anyway after 24 hours, I think even if it's acked [16:20:06] I think acking or downtiming in icinga is the way to avoid it [16:20:20] done [16:20:26] should not page in 2 weeks xd [16:20:48] yes, victorops will "clear" the ack in 24h [16:23:54] * dhinus off [16:29:05] gtg. cya tomorrow [17:10:10] I'm briefly back! Sorry if I was curt before about 1051, I'm just superstitious about ack'ing anything directly in victorops. [17:10:13] thank you for handling! [17:11:09] did the draining work or is that something I can help with?