[06:52:45] 10Traffic, 10SRE: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Bugreporter) [06:53:36] 10Traffic, 10SRE: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10Bugreporter) [07:05:41] 10Traffic, 10SRE: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Vgutierrez) sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend... [10:07:16] hello, I'd like to restart pybal today to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/951545 - any objections? [11:32:17] vgutierrez: FYI newest mtail has proper histograms https://github.com/google/mtail/issues/675#event-11730210918 cc fabfur [11:45:33] godog: thanks, I'm trying to quit with mtail ... :) [12:00:12] any thoughts on a pybal restart today? :) [12:15:53] hnowlan: no problem for me but considering I'll be OOO in a short time maybe other people should ack it ... [12:17:09] it's a reasonably quick change but I'll hold sure [12:19:46] hnowlan: err do you have some context for that? :) [12:22:37] IIRC we have some networking maintenance around 16 UTC.. so I'd avoid that time window [12:22:40] vgutierrez: sure - currently thumbour uses a custom pool named thumbor that previously contained the metal thumbor nodes, which over time included more and more k8s workers as we switched over. Now that we're all k8s for thumbor, this pool is a liability as we need to update it independently over the main k8s pool [12:23:57] hnowlan: yeah.. so no big restrictions from our side [12:25:21] great - ideally I'd like to do it now [12:28:18] hnowlan: if you do it now I'm available if needed [12:28:40] thanks! merging now [12:30:22] do you have any dashboard ready to check anomalies ? [12:31:01] yep https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&from=now-1h&to=now big drops in responses is the main thing I'm looking for [12:31:09] doing secondaries [12:32:01] tnx [12:33:21] Done - moving to primaries unless there's any objection [12:33:53] ok [12:37:11] all done, looks okay to me. rps looks stable on thumbor [12:38:16] thanks! [12:57:22] np [12:57:26] :) [12:57:29] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:57:55] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) 05Open→03Resolved [12:58:25] (SystemdUnitFailed) firing: (2) ifup@ipip0.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:48] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [13:55:17] topranks: we're having some issues with ipip0 / ipip60 devices in ncredir instances [13:55:55] contents of /etc/network/interfaces https://www.irccloud.com/pastebin/j42RSIBa/ [13:56:17] networking.service triggers an ifup -a and ipip0 and ipip60 come around nicely [13:56:50] but ifup@ipip0.service && ifup@ipip60.service attempt to bring the device up again for some reason I don't grasp [13:57:05] and fail with [13:57:08] https://www.irccloud.com/pastebin/8zEdltZx/ [13:57:45] the offending line being: ip link add name ipip60 type ip6tnl external [13:57:45] RTNETLINK answers: File exists [13:58:28] I noticed that ifup appends "|| true" to the ip link set dev" stanza but not to the other one [13:58:47] so appending || true would do the trick but it looks hackish to me [14:08:43] vgutierrez: hmm yeah [14:08:51] ifupdown causing more headaches [14:09:21] that config is added to the /etc/network/interfaces line by puppet is it? [14:19:50] there may be other ways to approach the config in /etc/network/interfaces [14:20:09] I'm wondering should the commands be "pre-up", and have a "post-down" to delete the interface? [14:20:29] although not sure that will fix the issue with the systemd units [14:20:49] there is also a "mode tunnel" that ifupdown supports natively, but I'm not sure if we can make it do what we want [14:21:39] I cannot use mode tunnel cause it requires an endpoint [14:21:55] you're right ,that config is managed via puppet [14:21:55] ok yeah I was gonna test that, we can't put "any" in there no? [14:22:13] it requires an IP AFAIK [14:22:38] interface/manifests/ipip.pp is the puppet side of things [14:23:15] the pre-up could make sense [14:23:49] the other way I've tackled such things - but again very hacky - is to use a shell script of some sort [14:23:59] let me test manually on ncredir2001 [14:24:15] and invoke that with the up command, and have the script check if it exists or whatever [14:24:29] I'm just doing a quick test on sretest1001 actually [14:24:33] oh thanks [14:25:38] actually a question - ultimately it probably makes sense all our servers get these ipip interfaces defined ? [14:26:02] on a liberica world all realservers will have these interfaces [14:26:11] or does it? I know all won't be realservers behind a LB, but many will, the easiest approach might just be to have them on all? [14:26:24] I guess it means the ipip kernel module gets loaded where it's not needed then sometimes [14:26:44] in 1 or in 1000 hosts, we need to bring the devices anyways [14:27:25] yeah, I'm just thinking in terms of config automation, might be easier that everything has it [14:27:34] but that's for another day [14:28:50] 10Traffic, 10Data Products, 10Data-Engineering, 10Observability-Logging, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Some updates about the ongoing work: Currently our Benthos configuration produces this output, when fed with HAPr... [14:29:48] vgutierrez: yeah I think you're right about the "tunnel" mode natively in the interfaces file [14:30:08] on sretest1001 I added this but it didn't do the trick [14:30:11] auto ipip0 [14:30:12] iface ipip0 inet tunnel [14:30:12] mode ipip [14:30:12] address 127.0.0.42 [14:30:12] netmask 255.255.255.255 [14:30:12] endpoint any [14:30:13] dstaddr any [14:30:40] it issued a command "ip tunnel add ipip0 mode ipip remote any", rather than the "ip link " type command we need [14:35:18] vgutierrez: another option might be to not define the interfaces in /etc/network/interfaces at all [14:35:36] but simply add them with "post-up" commands under the definition for the primary interface [14:36:06] given everything is being done by iproute2 commands called from "up" probably doesn't make much difference [14:36:49] what would happen @reboot time? [14:40:42] it would just issue the commands when bringing the main interface up [14:43:00] does augeas created the systemd services for each interface? [14:43:18] they seem superfluous tbh, it's not clear to me why they are there [14:43:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman) [14:43:46] topranks: nope afaik [14:43:52] dunno what triggered the service creation [14:44:10] topranks: did you test the pre-up approach? [14:44:51] vgutierrez: can we do that on ncredir2001 maybe? [14:44:57] sure [14:45:02] it's already depooled [14:45:12] I'll disable puppet there and reboot the host [14:45:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman) [14:45:31] pre-up should work fine, and I think is better, but just adding the config for an ipip interface on sretest server doesn't create the extra systemd units, which is what is tripping us up [14:45:57] some kind of debian magic in place :) [14:47:20] are you testing it? [14:47:35] Hey peeps, quick question re: the network migration, on the 28th, conf2004 will be affected. According to hiera that's the one the codfw lvs::balancer read from. I'm unsure as to what we should do to mitigate impact. [14:48:08] claime: switch to another etcd host and reboot pybal in codfw [14:48:33] we should probably do that well in advance yeah? [14:48:38] reboot pybal? [14:48:54] I assume restart the service [14:48:55] restart pybal service [14:48:58] yes.. restart pybal [14:49:04] pybal is a service, not a hostname [14:49:10] sorry for using the wrong verb [14:49:15] lol [14:49:16] haha ok :) [14:49:42] topranks: you already modified /etc/network/interfaces on ncredir2001, right? [14:50:07] yeah, this is what I have now, although it's likely we'll hit same issue [14:50:17] https://www.irccloud.com/pastebin/tRnLV2DJ/ [14:50:19] post-down p link del dev ipip60 --> small typo there [14:50:25] ugh [14:50:27] good spot [14:50:40] fixed [14:50:47] ack, rebooting the host now [14:50:51] thanks [14:51:16] I also suppose we shouldn't use the one that has etcdmirror on it, so that would mean using the same conf node as eqsin and ulsfo [14:52:51] topranks: systremd is happy but the network devices arent' there [14:53:25] (SystemdUnitFailed) resolved: (2) ifup@ipip0.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:52] https://www.irccloud.com/pastebin/ai4wcECL/ [14:54:41] ^^ this was after I issued the "ifup ipip0" command manually judging by the time [14:54:50] oh :_) [14:55:04] so what's going on? [14:56:23] not sure tbh [14:56:28] topranks: hmm it looks like ip link set up dev ipip60 should be "up" and not "pre-up"? [14:56:52] running ifdown does delete the interface, and subsequent "ifup" brings it back up and re-creates again [14:57:05] let me try that [14:58:26] sure. pre-up makes more sense in that we want to create the device before we set it to "up", but we are manually issuing the "ip link set dev ipip0 up" command too so maybe it doesn't matter [14:58:48] I'm not sure what failed on boot with pre-up in there though [14:59:59] vgutierrez: seems to be the same outcome [15:00:08] yep [15:00:10] * vgutierrez puzzled [15:00:32] let's try with "up" to create the interface, and re-add the "post-down" to see if that helps with the systemd services [15:00:54] you're already editing the file [15:00:55] (SystemdUnitFailed) firing: (3) ifup@ens13.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:20] vgutierrez: heh yep, but let me know if we should try something else [15:02:02] ok to reboot? [15:02:07] yep [15:04:27] ok so same story - no interfaces. which is very strange [15:04:51] only difference from when we began is the "post-down" - which makes me think that is getting triggered on boot somehow ? [15:04:57] I'm wondering if puppet triggered the interfaces coming up [15:05:08] rather than the reboot [15:05:45] right.. [15:05:54] https://www.irccloud.com/pastebin/8QTYpDkx/ [15:05:55] (SystemdUnitFailed) resolved: (3) ifup@ens13.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:40] that's kinda depressing :) [15:08:50] oh and... Feb 07 12:36:37 ncredir2001 systemd[1]: Started ifup@ipip0.service - ifup for ipip0 [15:09:13] the first execution of that service when we seen the error got triggered by puppet adding the interface as well [15:11:11] there are a lot of things getting in each other's way here [15:11:25] puppet/ifup/systemd etc [15:11:51] part of me thinks the easiest would be to do it all with "up" commands under 'ens13' device definition in /etc/network/interfaces [15:12:11] but probably lots of work to re-do the puppetcode to do that [15:12:25] perhaps adding that "|| true" is the easiest way forward [15:12:50] we will be looking to replace ifupdown this year, so hopefully it would just need to be temporary [15:23:38] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Jhancock.wm) This rack is physically ready [15:24:39] topranks: so are we sure that ifupdown is attempting to bring the device up on system start? [15:27:07] vgutierrez: that's a good question.... 'allow-hotplug' brings a detected int up, but given it doesn't exist until it's brought up, I suspect maybe it won't [15:27:14] perhaps 'auto' is required instead [15:27:37] topranks: I don't see any auto stanza for ens13 [15:28:09] it's a physical interface so it will get created because its on the pci bus, and then the "allow-hotplug" will tell the kernel to bring it UP [15:28:25] (SystemdUnitFailed) firing: ifup@ens13.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:20] topranks: now.. that fixed it :) [15:31:31] huh ok [15:31:58] I never really got the "allow-hotplug" thing but I guess it makes sense [15:32:33] it was introduced to deal with ephemeral interfaces I think, i.e. usb ethernet adapters [15:33:25] (SystemdUnitFailed) resolved: ifup@ens13.service Failed on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:26] where you don't want it "auto" bringing them up unless they are present [15:33:42] yeah.. definitely we want them in auto for this kind of device [15:34:25] unless we have a way of triggering the hotplug event for them [15:38:04] reading the docs I think it only makes sense for a real hardware device to have allow-hotplug [15:38:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10Andrew) There's no need to coordinate with us for cloudbackup2001, it might cause us to get a transient alert... [15:42:28] vgutierrez: so 'auto' rather than 'allow-hotplug' definitely makes sense. [15:42:53] I'm still slightly confused about the systemd services, I'm wondering if the 'allow-hotplug' was affecting those too? [15:43:02] but if they aren't failing then great :) [15:51:58] https://puppet-compiler.wmflabs.org/output/998438/1323/ [15:52:24] actually https://gerrit.wikimedia.org/r/c/operations/puppet/+/998438 [21:50:09] (LVSHighRX) firing: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [21:55:09] (LVSHighRX) resolved: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX