[06:41:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:06:56] (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:57:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05Stalled→03In progress Finally merged! [11:01:40] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) [11:23:05] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Adding question here in addition to the CR: For druid ingestion we have 2 jobs, the first ingests all c... [13:08:31] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) I believe that it is not necessary to refine this data. [13:24:56] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) In theory there should not be any PII data, but it would be safer to sanitize is nonetheless. As the data i... [13:35:22] bblack, mmandere: were you able to run makevm in the end? (/me saw a PASS in the last run) [13:36:33] hey [13:36:55] yeah the makevm passed, I'm trying to debug the OS installer part now (no console output, likely some instance network setup issue) [13:37:27] next thing on my list to check, is whether the trunking is set up on the switch side, since this first instance is on the public side and would rely on that :) [13:37:32] mmmh that rings a bell though, maybe check with Mori.tz, I think he encountered some issue similar to that (no console) [13:37:48] ack [13:37:53] the instance IP doesn't even ping, so I don't think it's getting dhcp regardless of any console-specific issue [13:39:33] omg, that interface name: enp175s0f0np0 [13:39:45] predictable™ [13:40:26] fyi, switch port have both vlans configured [13:40:32] with private as native [13:45:38] yeah [13:45:50] can confirm that we never got a DHCP req on install1003 for the instance [13:46:11] so either the instance is sending DHCP and it never gets there, or the instance isn't even functional enough to try that when it starts, one of the two :) [13:46:44] instance mac is AA:00:00:4B:65:48 [13:47:30] I'm at the step in the instance-creation docs where you "start" the instance for the first time and watch the console, but I've never gotten any console output, and gnt-instance list claims the instance is "running" [13:48:52] 6003 (the primary for the instance) does show a tap0 interface that I assume is for the VM [13:49:13] maybe I need to go look for some ganeti-specific logs on those hosts (this is all fairly new to me, playing with ganeti!) [13:50:39] kvm log for bast6001 is empty [13:52:09] anyways, I gotta run out for a bit, will debug this more later! [13:52:42] moritzm: ^ I think you solved the "no console" issue not long ago? [13:56:09] bblack: `sudo brctl show public` only shows tap0 and not the server's uplink [14:01:15] "iface enp175s0f0np0.611 inet manual" is present in etc/networks/interfaces, so maybe it's just need to be brought up, but dunno if ganeti is supposed to do it [15:19:57] XioNoX: yeah that seems to be relevant for sure. Just not sure what the correct way to fix it is, yet [15:23:22] notably on ganeti5001 as a reference, which has a similar-enough /e/n/i, "ifup -a" returns no error, while in drmrs it spits out: [15:23:25] Error: argument "enp175s0f0np0.611" is wrong: "name" not a valid ifname [15:23:28] ifup: ignoring unknown interface enp175s0f0np0.611=enp175s0f0np0.611 [15:23:38] maybe we've hit an iface name length limit heh [15:24:54] that seems longer than IFNAMSIZ that should be 16 [15:25:15] way to go linux :P [15:25:36] not sure if recent kernels have make it larger [15:26:06] my eqsin comparison host has a much shorter one with ens1f0np0.510 [15:27:19] https://phabricator.wikimedia.org/T209707 [15:28:18] supposed to be ok in buster though [15:30:02] either way, I think it's the iface.vlanid naming format which implies the correct trunking config [15:30:11] I don't think I can arbitrarily rename that one in a simple way [15:35:06] maybe some create ip link commands as in: https://serverfault.com/questions/424522/arbitrary-vlan-interface-name [15:45:52] yeah I think I found a way, with some /e/n/i creativity maybe [15:50:10] woot, yeah, I see console output this time [15:51:15] so what I did was: I removed the two early reference to the enp175s0f0np0.611 interface (the "auto" and "inet manual" stuff earlier in /e/n/i) [15:51:43] and in the bridge stanza, I replaced the "bridge_stp enp175s0f0np0.611" [15:51:46] with: [15:51:50] pre-up ip link add name vlan611 link enp175s0f0np0 type vlan id 611 [15:51:53] post-down ip link delete dev vlan611 type vlan [15:51:55] bridge_ports vlan611 [15:52:26] erro sorry, replace "bridge_ports enp175s0f0np0.611" (not bridge_stp) [15:53:23] (and then ran "ifdown public; ifup public" to get the normal network scripts to reconfigure things) [15:53:44] fixing the other ganeti cluster now while waiting on the bast6001 installer [16:04:22] so the issues was the length of the iface name? [16:04:25] yes [16:04:33] interesting! [16:04:44] /e/n/i is manually-edited/managed on ganeti nodes anyways, so this was just further site-specific customization in the end [16:05:28] the end result of the interface config is functionally the same as before, but virtual interface name for the trunk port "enp175s0f0np0.611" is now just "vlan611" [16:06:37] not sure how that will look on the next puppetdb->netbox import from the ganeti6 hosts, we'll see! [19:27:28] next in line is prometheus6001, but I gotta eat now, etc [19:27:57] the next new wrinkle for that one will be: does the switch of all the install-time stuff to the new install6001 break something that was working fine for installs before? :)