[08:36:23] Emperor: do you have some time this morning to switch swift to IPIP? [08:37:10] vgutierrez: could do. What action is required from me? [08:37:43] keeping an eye over ms-fe nodes [08:37:49] ack [08:37:55] and if we are feeling extra paranoid depooling ms-fe@codfw before proceeding [08:38:49] we already switched several services on other lvs instances, in theory changes in ms-fe instances won't impact its ability to receive traffic at the moment, and the real change in traffic won't happen till we restart pybal [08:39:18] OK, seems reasonable to me [08:50:00] Emperor: cool, let me get some coffee first :) [08:50:08] some things shouldn't be done before coffee [08:53:18] fair [09:05:21] so.. before merging I'll proceed to disable puppet on ms-fe nodes, to validate in one of them that everything goes as expected [09:07:06] OK [09:21:57] patch needs some love [09:22:39] MSS rules generated for ms-fe instances are missing IP and PORT :| [09:22:43] `outerface (enp101s0f0np0 lo) saddr @ipfilter(()) proto tcp sport () tcp-flags (SYN) SYN TCPMSS set-mss 1440` [09:23:31] it should be something like `outerface (ens13 lo) saddr @ipfilter(( 2620:0:861:ed1a::9])) proto tcp sport (443 80) tcp-flags (SYN) SYN TCPMSS set-mss 1440` [09:24:12] no harm done, I've detected this on the change catalog produced by PCC [09:24:55] 👍 [09:30:03] wait.. I was staring to ms-fe1009 catalog on the codfw change... [09:32:30] ... [09:33:49] for ms-fe2009 PCC shows `outerface (enp101s0f0np0 lo) saddr @ipfilter( proto tcp sport (443 80) tcp-flags (SYN) SYN TCPMSS set-mss 1440` [09:33:52] that looks better [09:41:34] Emperor: adjusted the ms-fe@codfw CR to just enable IPIP on codfw instances [09:44:56] +1 [09:50:28] codfw change merged (pupept disabled) running puppet on lvs2014 [09:53:08] and lvs2013.. both lvs looking good :D [09:53:35] I'll run puppet on ms-fe2009 [09:56:30] Emperor: can you double check that ms-fe2009 looks healthy to you? [09:58:40] doing so [09:59:48] the only difference so far is that outgoing traffic is getting MSS clamped [10:00:53] https://www.irccloud.com/pastebin/elqznGWP/ [10:01:19] and that now the instance has ipip0 and ipip60 devices to handle incoming IPIP traffic [10:02:08] ms-fe2009 looks OK to me [10:03:07] thx, enabling puppet on A:swift-fe-codfw [10:07:17] dashboards still look OK [10:07:40] I've triggered a staggered puppet run across A:swift-fe-codfw [10:07:55] next step will be restarting pybal on lvs2014 and lvs2013 [10:18:50] lvs2014 done, ipvsadm looks as expected [10:19:01] https://www.irccloud.com/pastebin/MDQdgTtf/ [10:19:46] time for lvs2013, if we break something, we break it now :) [10:21:09] actually I'm not that crazy.. I can validate IPIP traffic getting accepted first :D [10:23:14] https://www.irccloud.com/pastebin/XX3rM07C/ [10:23:18] looking good :D [10:23:37] Emperor: I hope you have the popcorn ready [10:24:02] I have a bucket of hot salty sysadmin tears prepped, is that good enough? :) [10:24:20] lol [10:26:10] https://www.irccloud.com/pastebin/lkjN13Bi/ [10:26:14] traffic flowing as expected [10:26:57] https://www.irccloud.com/pastebin/hAvXWEbs/ [10:27:19] Emperor: I'll brew some coffee while you validate that swift-fe@codfw is still healthy and we can proceed with eqiad :D [10:28:39] looks good from my end [10:36:46] Emperor: lovely [10:37:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120603 needs your +1 :D [10:38:06] {{done}} [10:42:08] thx... same procedure as before, I'll disable puppet on impacted lvs and A:swift-fe-eqiad [10:53:20] * vgutierrez running puppet on ms-fe1009 [10:56:25] ms-fe1009 is accepting IPIP traffic against ports 80 and 443 of the VIP https://www.irccloud.com/pastebin/MVDc4GQa/ [11:05:39] all looks good to me thus far [11:07:47] cool [11:07:54] restarting lvs1020 and lvs1019 :D [11:10:31] Emperor: all good from my PoV, traffic flowing as expected [11:12:15] looks good from here, too [11:22:55] Emperor: final one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1121336 [11:23:03] dropping wrr in favor of maglev [11:25:34] +1 (on the basis that it looks to do what the comment says it does) [11:35:17] vgutierrez: you planning on deploying that change soonish? [11:35:27] Emperor: in a few minutes [11:37:33] 'k [11:38:00] nice work vgutierrez! [11:40:34] Emperor: do you have a dashboard where we can see the number of requests per ms-fe instance? [11:42:08] Amir1, federico3: sorry one wrong info I gave you in the call, the new structured diff in conftool is not yet merged/released, somehow I thought it was in the last release. Last update in https://phabricator.wikimedia.org/T383760#10525151 [11:43:14] ipvsadm on lvs2014 shows that mh is configured with the expected flags (mh-port) https://www.irccloud.com/pastebin/bpAAyVOr/ [11:43:31] proceeding with lvs2013 [11:43:36] vgutierrez: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-site=All&viewPanel=34 ? [11:45:00] Emperor: cool.. I'll keep an eye to doublecheck that we aren't overloading any specific realserver [11:45:42] elukey: it looks like the luck is still on my side :D [11:46:22] vgutierrez: grand. Assuming no fire, are we done with this work now? [11:46:28] Emperor: yes :D [11:46:41] thanks for the assitance [11:46:46] *assistance even [11:50:06] cool :) [15:18:47] Emperor: I am not saying "gooood" until we re-test the hot swap procedure, but https://phabricator.wikimedia.org/T384003#10568007 looks a nice step forward [15:19:00] now my fear is that a reboot would have fixed it anyway [15:19:04] this is why I want to re-test [15:45:56] +1 to retest [16:00:24] if it works we just need apply the fix and roll reboot all the new supermicro ms-be(s) [16:00:36] I also asked a quote for the other controller so we can compare [16:01:05] TY :)