[07:12:11] someone knows what's up with "MaxConntrack: Max conntrack at 85% on krb1001" ? [07:14:26] 07Puppet, 06Infrastructure-Foundations, 10Keyholder, 06SRE: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10650455 (10fgiunchedi) >>! In T374711#10649030, @jhathaway wrote: > @fgiunchedi should we consider this issue resolved, since the arming step for keyhold... [07:21:28] also how can people read IRC without muting the pywikibug bot ? [07:21:33] hi, I have a patch for Hiera lookup of `abuse_networks` to have its values merged instead of overridden. That is to ease banning IP ranges in deployment-prep. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128859 [07:21:59] 07Puppet, 06Infrastructure-Foundations, 10Keyholder, 06SRE: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10650469 (10fgiunchedi) [07:22:17] XioNoX: some /ignore it, others have client that let you split bot traffic out of the main discussion . But most probably we could make it to not notify on every single comment/vote etc :) [07:23:02] yeah I do /ignore it [08:39:15] 10netops, 06Infrastructure-Foundations, 06Traffic: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731#10650600 (10cmooney) Doing a bit of an audit here to assess the current situation, we have the following cables in place which need to be removed: |Site|LVS|Cable|Sw... [08:55:49] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#10650693 (10jcrespo) 05Open→03Resolved a:03jcrespo This has not happened since, the rate of backup errors are very low. It... [09:16:09] 10SRE-tools, 10Spicerack: Allow to discover/test in more isolation spicerack features - https://phabricator.wikimedia.org/T389329 (10Volans) 03NEW p:05Triage→03Medium [09:47:39] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10650888 (10ayounsi) Port moved and still the same issue. I asked them (in French) if the patch got properly changed, and to call me on my mobile to discuss it more in details. [10:21:07] moritzm: I'll need to add an entry to ganeti7001/3's /etc/network/interfaces to add the sandbox vlan. In theory it should not have any impact, but maybe safer to drain one then the other ? [10:22:11] more exactly this https://www.irccloud.com/pastebin/YAI5g6z6/ [10:24:01] is this only needed for 7001/7003 or also for 7002/7004? [10:24:11] only 7001/7003 [10:29:28] let me quickly drain 7001, better to be on the safe side, then we can also simply reboot to pick up the new network config [10:33:26] I'm getting a traceback from pynetbox, did something change for the network config of 7001? https://paste.debian.net/hidden/f42933ae/ [10:37:19] not that I'm aware of [10:38:58] moritzm: I think I know [10:39:00] one sec [10:39:09] the bride interfaces are not in netbox [10:41:28] ah, thx [10:41:46] topranks: there might be a small bug in the puppetdb import script [10:42:18] https://netbox.wikimedia.org/extras/scripts/results/166810/ "Set asw1-b3-magru et-0/0/9 tagged vlans to [] matching eno12399np0" [10:44:11] XioNoX: you ran it without the "apply changes" ticked? [10:44:21] topranks: yeah [10:44:24] I'm wondering does it not work properly on trial run [10:46:08] moritzm: your error seems unrelated [10:46:19] https://netbox.wikimedia.org/ipam/ip-addresses/16698/ the cluster IP is duplicated in netbox [10:47:19] moritzm: I deleted the "rogue" one, https://netbox.wikimedia.org/extras/changelog/215329/ can you give it another try ? [10:48:07] I'm wondering about this line in the log [10:48:08] ganeti7001: removing child interface no longer in puppet 711 [10:48:24] that is due to code added recently to deal with the lvs reimage issue valenti n hit [10:48:53] trying [10:49:07] yes, it worked now! [10:49:42] I think probably we need to review the automation around our ganeti nodes in light of this new vlan, definitely some stuff has hard-coded that only public/private is needed [10:50:03] moritzm: I looked at all the other VIPs and no duplicated ones [10:50:35] topranks: I'll have a look after we edit e/n/i [10:50:47] there is definitely a problem [10:51:13] puppetdb has interface 711 in the list [10:51:19] but script seems to try and remove it [10:51:22] I'll need to take a look [10:52:25] for now don't run the puppetdb import script - I'll try to work out what's going on on -next and retry on production netbox once we have a fix to verify [10:52:43] ack [10:53:38] XioNoX: ganeti7001 is drained, you can make the config change whenever it works and then simply reboot. then I'll next swap the Ganeti master and proceed with draining 7003, ok? [10:56:45] moritzm: `ifup sandbox` seemed to have worked [10:56:51] rebooting just in case [10:58:44] ack [11:00:00] oh, convenient cookbook named reboot-single :) [11:04:23] the mistake removing the 711 int was simple enough [11:04:50] I'll wait until puppetdb has updated with the new sandbox vlan int / bridge device and re-run to check there aren't any other gotchas [11:05:02] hmm, forgot to add sandbox to `auto lo private public`... rebooting again :) [11:16:18] topranks: looks better, but not sure about the same line (23 and now 24) https://netbox.wikimedia.org/extras/scripts/results/166823/ [11:16:27] moritzm: all good for ganeti7001 [11:16:46] XioNoX: puppetdb has the new int in it [11:16:57] I'll let you know when 7003 is ready [11:17:20] re-running the import script works ok.... [11:17:21] https://netbox-next.wikimedia.org/dcim/devices/5215/interfaces/ [11:17:42] Let me submit the patch [11:27:06] XioNoX: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1129216 [11:28:34] topranks: +1 [11:30:00] moritzm: is there something special I need to do for Ganeti to be aware of that new vlan, or can I just use makevm with `--network sandbox` ? Or should I just give it a try? [11:31:08] I think makevm network sandbox should be all that's needed [11:31:33] cool! [11:58:12] XioNoX: ganeti7003 is ready [11:58:41] moritzm: cool, also the VM got created properly [12:00:21] nice [12:00:26] moritzm: fwiw I updated my script for making the e/n/i config to also add the sandbox vlan if it exists for the given row [12:00:29] and moved it here: [12:00:29] https://github.com/topranks/ganeti_network [12:00:41] rebooting 7003 [12:01:02] topranks: thx! [12:09:32] moritzm: all done with 7003 [12:13:57] ganeti7001 somehow aquired a second IP on the private bridge? [12:14:01] cmooney@ganeti7001:~$ ip -br -4 addr show dev private [12:14:02] private UP 10.140.0.11/24 10.140.0.15/32 [12:14:10] topranks: yeah it's the cluster's VIP [12:14:27] ok yeah and it's now the master [12:14:28] gotcha [12:14:50] and now I understand where this "private:0" in puppetdb we sometimes see comes from [12:15:24] looks like puppetdb uses a simple string for the "ip" var under an interface [12:15:29] i.e. not a list/array [12:15:53] if there are two IPs of the same family on an int it creates an ":0" and records the second IP against that [12:16:58] I don't think it should cause us any issue. The netbox import script does create the additional interface, however it doesn't attach the cluster VIP to it as it's status==vip [12:18:22] ack, I'll rebalance the cluster in a bit [12:21:28] done [12:48:15] moritzm: I'll need to do the same dance on all of eqsin's ganeti [12:49:16] sure thing, I need to switch these to nftables anyway (which also requires a reboot, so we can kill two birds with a stone) [12:49:37] I need to look into something else now,will get the first one ready in an hour or so [12:49:45] that's not very vegetarian of you [12:49:54] moritzm: no rush at all! [12:51:09] I have some meetings, and then there is a dcswitchover to which we might want to stay quiet [13:16:15] yeah, good point. we do it tomorrow instead [13:45:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [13:55:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [14:13:16] 07Puppet, 06Infrastructure-Foundations, 10Keyholder, 06SRE: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10652054 (10jhathaway) >>! In T374711#10650455, @fgiunchedi wrote: > There's two parts to keyholder, `-proxy` and `-auth`. You are correct the latter requ... [14:29:55] moritzm: switchover is done, so I'm fine doing that now if you prefer [14:31:11] let's do it tomorrow, I poking at maps/bookworm currently and that's simply too beautiful to interrupt [14:32:15] hahahha [14:47:47] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10652276 (10RobH) I saw your reply and was about to ping in IRC to thank you for discussing in French with them directly. My fear is there is a language barrier and perhaps... [15:04:20] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10652513 (10MoritzMuehlenhoff) >>! In T388629#10648788, @jhathaway wrote: > Un... [15:36:57] quick question about https://wikitech.wikimedia.org/wiki/Vlan_migration [15:37:33] I am helping the ML team to upgrade their ml-serve hosts (k8s workers), what is the follow up needed for BGP after the reimage? [15:52:24] elukey: usually run homer on the local core routers to remove the now down sessions [15:53:03] XioNoX: ack thanks! [16:34:04] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10653129 (10RobH) [17:05:41] XioNoX: I ran homer and indeed it removed the old neighbor, but I don't see any new one now for ml-serve2001 [17:06:03] https://netbox.wikimedia.org/dcim/devices/2963/ has the bgp flag [17:06:11] elukey: ah yeah, you need to run it on its ToR switch too [17:06:19] ahhh okok [17:06:24] lsw1-a5-codfw in that case [17:06:34] ack thanks [22:51:55] FIRING: MaxConntrack: Max conntrack at 81.92% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:41:55] RESOLVED: MaxConntrack: Max conntrack at 81.49% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:42:55] FIRING: MaxConntrack: Max conntrack at 81.36% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack