[01:16:21] * bd808 off [08:42:41] morning [10:05:47] morning [11:01:23] o/ [12:20:31] arturo: hi, i have a question about https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#Network_/_Vlan_usage [12:20:58] yes [12:21:08] that doc says that cloud-gw-transport is used between cloudgw<->cloudsw, and cloud-instance-transport is between cloudnet<->cloudgw [12:21:42] however, only the cloud-gw-transport-eqiad (vlan1107) has hops on cloudnet servers, and cloud-instance-transport1-b-eqiad (1120) does not [12:22:42] it may be that the names are changed in the docs, and are the other way around, instance transport being between cloudsw and cloudgw, and gw-transport between cloudgw and neutron [12:24:07] ok. I'll update the docs to match the current reality [12:24:23] please double check now that we're at it [12:26:49] I did https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/Network&diff=prev&oldid=2155364, as far as I can tell it now matches both the reality and the diagram at the top of the page [12:27:21] ack [12:28:11] I believe it was easier to do it like that from the routing PoV when we introduced the additional subnet, that's why the instances-transport is "outer" [12:28:41] i see, that explains why the naming is 'backwards' compared to what I expected them to be [12:29:36] I also think that cathal wants to refresh all that, and reuse cloud-private for the transit, so another change may be happening at some point [12:35:39] for context, I'm doing some research into how we could replace neutron-linuxbridge-agent with neutron-openvswitch-agent, and I'm not seeing an obvious way of getting rid of a shared transit VLAN between all the cloudgw and cloudnet devices (which might be in different racks) [12:36:35] why would you like to get rid of that? how else would you do ingress/egress? [12:37:29] no, you can't get rid of that entirely [12:37:37] but I think netops prefers if VLANs are not shared between racks [12:37:40] or at least ingress, because egress you can do on a distributed fashion [12:40:03] taavi: that's why using cloud-private for that transit is something to consider [12:44:06] would you like me to elaborate? [12:48:36] thanks, but i think i understand what you're saying [12:50:02] the problem is mostly that the networks in our current setup don't directly map to what an ovs setup requires, and i'm playing with a few different ideas in my notebook to see what our deployment would possibly look like [12:52:03] ok [12:52:47] the picture I had in my head was to replace the linuxbridge agent with the ovs agent for the L2 side. For the L3 or ingress/egress, try to retain a similar setup to what we have now [13:02:38] the main thing I'm wondering is how to implement the required 'external provider' network. that's the one where the per-tenant OVS networks connect to, and based on my understanding the most straighforward way to implement it is with a VLAN between all L3 nodes (cloudnet) and cloudgw nodes [13:11:45] I see [13:12:46] I think we should explore how to hook this external provider network with cloud-private [13:45:11] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/26 [13:45:40] done [13:45:44] LGTM [13:47:04] dcaro: also, please push the tag for webservice 0.103.2 [13:47:33] oh, I forgot, yes [13:48:10] thanks! [13:50:36] hmm, I think that the bug filtering script for bumping the version does not work well xd [13:51:50] hmm, there's also weird tags on the gitlab repo: 7817f4d92d41f8787b427a351927ac0441004086 refs/tags/debian/0.68^{} [13:51:58] with `^{}` at the end [13:52:03] (they are duplicated) [13:53:27] ohhh, I think that's what's confusing the script: [13:53:27] ``` [13:53:27] dcaro@urcuchillay$ git ls-remote --tags origin | awk '{print $2}' | sort -V | tail -n 1 [13:53:27] refs/tags/debian/0.103^{} [13:53:27] ``` [13:53:48] huh [13:54:32] oh, those are signed tags maybe? [13:54:53] (they point to the same commit, but the ones with `^{}` show the signature too) [13:55:48] * taavi afk, be back later [13:58:28] yep, that's it :) https://git-scm.com/docs/git-ls-remote#_output [14:03:37] another quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/28 [14:10:48] approved [14:10:50] * arturo food [14:13:39] thanks [14:13:45] I think it's not enough though :/ [14:27:06] it was, :) [16:09:17] possibly stupid question: does the 'b' in 'cloud-instances2-b-eqiad' stand for row B where we formerly were, or is there some other meaning in that? [16:32:48] no it stands for row b, and probably should be changed [16:33:09] if anyone knows what the '2' is all about I'd love to know :) [16:34:48] taavi: I note there are no hosts on that vlan any longer on the WMF production switches in eqiad row B [16:35:51] I will do some work to clean up the config / links from asw2-b-eqiad (prod) to cloudsw1-c8-eqiad [16:36:28] after than we can probably rename the vlan. I'm fairly sure that's a no-hit on the switch but I'll try to test first to make sure [16:37:00] topranks: presumably the 2 is there to differentiate it from the one used for nova-network back in the days :-) [16:37:28] opinions on this quota request? T358477 1TB of increase for volumes is quite a lot, but the use case seems legit [16:37:31] T358477: Request for more compute and storage for the GLAMS dashboard project - https://phabricator.wikimedia.org/T358477 [16:37:45] but that's helpful, thanks. so I assume that if I need to allocate a new subnet to be used for openvswitch experiments, it should be called 'cloud-instances3-{DC}'? [16:38:07] dhinus: do you need more thoughts than what I just commented? [16:38:17] * dhinus refreshes :) [16:38:36] based on previous meeting it's a hard no on the storage request :P [16:38:44] topranks: LOL [16:39:10] taavi: thanks for the list of ports on the cloudgw fw rules patch [16:39:19] can I take it they are all TCP apart from 53/DNS ? [16:40:42] topranks: yep [16:40:52] great thanks I'll submit a new patch [16:41:01] thanks! [16:41:14] taavi: topranks: correct: 2b means "second in row b" [16:41:43] there was another instances vlan in row b previously was there? [16:41:54] if there was, it predates me! [16:41:56] we name all our vlans that, private1-a-codfw, private1-e4-eqiad etc. [16:42:09] in the case we ever need a 'private2-a-codfw' [16:42:23] but I don't think we have any such instances on the prod side [16:43:27] anwyay I guess we will should rename it 'cloud-instances-eqiad' [16:43:51] taavi: maybe we should talk about the openvswitch setup / topology? [16:44:20] maybe set up a quick meeting for next week? [16:44:26] or tomorrow even [16:44:29] ^ +1 [16:47:07] topranks: arturo: sounds good, after my doc reading session today I think I have enough information to talk about it [16:47:20] ok [16:47:39] great! if you have any relevant links fire them on, we could catch up tomorrow maybe while it's fresh in people's head? [16:48:52] works for me [16:49:46] * bd808 predicts taavi will be very bored if he ends up in a networking class during future course work [16:49:54] https://docs.openstack.org/neutron/latest/admin/deploy-ovs.html is the main ovs-specific doc, and the neutron docs in general are useful [16:50:27] taavi: cool, I looked at some of that before but will refresh [16:50:49] does 11am CET work? I'll throw something in the calendar [16:50:54] I tend to forget about everthing neutron that the moment I close the browser tab [16:51:16] topranks: a bit later works better for me [16:51:33] same, but 11 CET works [16:52:00] sure - my calendar is free so feel free to suggest a time [16:58:25] topranks: Added a section with HA awareness on the capacity needed for ceph https://grafana-rw.wikimedia.org/d/DpbFWWCGk/wmcs-ceph-eqiad-capacity?forceLogin&orgId=1&var-num_new_servers=10&var-days_to_get_servers=60&var-ha_domains=4 , for 6 racks we can last almost 2 years with the 30d estimation, with 4 racks we need to expand in 34 weeks (add 8 hosts, that will get us to 1 year, but we would need to add more hosts then) [16:58:45] topranks: arturo: is 13:00 CET fine? [16:58:51] yes [16:58:53] dcaro: thanks that's good info [16:59:01] https://usercontent.irccloud-cdn.com/file/TzRcK1yc/image.png [17:01:17] my gut instinct is if it can scale for 12 months by adding hosts - and not new network equipment/design - we can work on some of the bigger picture / longer term stuff about how we grow the overall platform before we revisit it next year [17:02:19] taavi: 13:00 CET works for me [17:02:24] I'll send an invite [17:02:37] or ack yours even :) [17:12:31] * arturo offline [17:44:41] dcaro: do you know if different ceph pools can be throttled to different performance rates? So that we could have slow pools and fast pools running on the same cluster in order to distribute bandwidth strategically? [17:45:14] not from the top of my head, I can investigate if you want [17:46:21] no need. I'll just leave that out for now. [17:46:53] * andrewbogott scribbling on https://docs.google.com/document/d/1mb1LCxuHS8USMMrpXeX2cvrWtCP_0uMifMc1zo2gojc/edit?usp=sharing [17:59:22] * dhinus doesn't have access to that gdoc [17:59:52] oh? It was supposed to be anyone with the link... [18:01:37] now I see it [18:02:56] I added you explicitly. I'm always surprised by how the 'anyone with link' permissions work (or don't work) [18:03:17] i don't see it either [18:03:46] taavi, how about now? https://docs.google.com/document/d/1mb1LCxuHS8USMMrpXeX2cvrWtCP_0uMifMc1zo2gojc/edit?usp=sharing [18:04:05] i can read now [18:04:10] ok, good [18:04:12] I was still seeing "Restricted", I tried changing it again to "Anyone with the link" [18:04:25] taavi, I'll add you as editor [18:04:38] I can see that doc in read only so it's definitely open to all [18:05:05] for some reason the first attempt by andrewbogott was not working, but my second attempt did... [18:05:07] a bit ago I could 'suggest' but now it dropped to read-only again [18:05:18] that's me playing with the settings sorry :D [18:05:24] I'll move it back to "comment" [18:05:55] (you can select "view", "comment" or "edit" under the "anyone with the link" option) [18:13:29] dcaro: the `--command 'sh -c "exec 1>>$TOOL_DATA_DIR/recompile.out; exec 2>>$TOOL_DATA_DIR/recompile.err; recompile"' --mount=all` hack in T319883 is clever. :) [18:13:30] T319883: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 [18:15:04] bd808: it's the shell wrap that the jobs-api does, but it's not yet merged+deployed for everyone to use with `--filelog` [18:21:48] * dcaro gtg [18:21:52] cya tomorrow [18:40:43] * andrewbogott -> cook lunch [19:46:00] * bd808 lunch