[10:05:50] arturo: hey Antoine was asking yesterday evening about packet drops on cloudgw1002 [10:06:14] yeah, I saw the backscroll. I have been reading this morning a few things, and scanning a few graphs [10:06:18] it appears there has been an uptick in packets dropped by the cloudgw on receipt [10:06:19] yeah [10:06:33] where do you see that? [10:06:47] whatever tuning might be possible it seems correlated with a large uptick in usage over the past few weeks [10:06:49] https://usercontent.irccloud-cdn.com/file/NxFrfQj2/image.png [10:07:06] this is the dashboard I've been playing with [10:07:06] https://grafana-rw.wikimedia.org/d/VLFehqB4z/node-detail-test-cathal?orgId=1&var-instance=cloudgw1002:9100&var-Mountpoint=All&var-netdev=enp101s0f0np0&var-disk=All&var-num_cpus=16&var-chip=All&from=now-13d&to=now&refresh=30s [10:07:46] bottom blue measure is "node_network_receive_drop_total" [10:08:05] which are basically packets that were received ok by the system, but dropped in the inbound path [10:08:27] as opposed to say errors inbound (bad crc on the packets or something, could be problem with far side, cable, hardware etc) [10:08:53] the usual cause of this is the system/application/upper layers cannot keep up and the kernel is dropped received packets [10:10:30] there are also a bunch of weird messages from the NIC firmware [10:10:30] bnxt_en 0000:65:00.0 enp101s0f0np0: Received firmware debug notification, data1: 0xdd67, data2: 0x0 [10:10:52] yeah I seen those [10:11:12] worth working out what it means... it could just be due to full rx ring or something (and thus the same as the drops) [10:12:55] none of the individual cpu's seem to be maxing out [10:13:08] the system only has a single socket/numa core so we at least don't have that headache [10:13:44] I'm reading we can set `ethtool -s msglvl N` to try to translate the debug messages into more information from the driver [10:14:17] yeah [10:14:44] is it safe to do that for a hot interface? [10:14:44] probably worth doing [10:15:42] sry,... I honestly do not know what the risk is [10:15:51] "debug ip packet" on a cisco is not a good idea :P [10:16:27] I notice the CPU Wait time jumped up too, but that was at the start of the month before things got busy [10:16:43] https://grafana.wikimedia.org/goto/2IXa9u7Ng [10:16:53] so probably not the cause - but worth knowing [10:17:20] I see in htop that conntrack is the big userspace user of cpu... though the number of total conntracks has not increased along with the usage [10:17:36] In general I think we need to: [10:18:04] 1) determine if there is something not working right (my guess is everything is working ok) [10:18:22] 2) determine if there are tunings we can do to get more performance (almost certainly yes but it's a rabbit hole) [10:18:41] 3) get a view on what is causing the increased usage and decide if there is anything to be done there [10:19:12] 4) start working out how we scale the system in general (more cloudgw in paralell? more NICs/cpus? something else?) [10:19:14] in htop we have conntrackd consuming lots of CPU. That's the daemon syncing conntrack entries to the other node, it is not involved in the data plane [10:19:33] ah ok [10:19:41] yeah probably fine [10:20:23] there is a potential issue with memory access when a packet arrives and comparing to the conntrack table, if that's taking too long, but seems not, makes sense that's the sync [10:20:33] number in total isn't up so that hasn't changed [10:23:46] if we are having issues in cloudgw, we should be having them too on cloudnets, basically they see the same ingress/egress traffic [10:29:16] two observations [10:30:24] well one actually [10:30:34] is that the cloudnet has 40 cpus versus the cloudgw 16 [10:31:07] cloudnet is also a two-socket system (so NUMA back in play) [10:31:12] but you're right it's not showing drops [10:31:34] but we aren't comparing the same things, totally possible cloudgw is running out of steam but better hw in cloudnet can keep up [10:31:50] let me open a ticket, so we can capture all this [10:31:52] plus exactly what functions are being performed for each packet makes a difference (filtering, nat etc) [10:33:05] T381078 [10:33:05] T381078: cloudgw: suspected network problems - https://phabricator.wikimedia.org/T381078 [10:38:23] is this concerning? [10:38:25] https://www.irccloud.com/pastebin/DDwkRKEp/ [10:38:40] similar numbers in cloudgw1001 [10:38:42] https://www.irccloud.com/pastebin/ovGnjl3H/ [10:41:58] https://www.irccloud.com/pastebin/qY2z4l7F/ [10:46:37] they're not incrementing though [10:47:39] well... I'm assuming their they'd be reflected in the prometheus error stats [10:47:59] but these are L4.... so not really related to the packet itself I think [10:54:32] do you know if VM DNS problems correlate with network usage peaks in cloudgw or cloudnet? [10:55:08] the overall rate of drops here is low enough I'd be surprised any user noticed in terms of dns [10:55:20] yes - some queries may have timed out - but unlikely a re-try would have as well [10:55:40] we've had no drops in almost 48 hours either, not sure exactly what the DNS issues reported were like [11:10:27] * arturo nods [11:11:18] so overall I'm not sure if the problems hash.ar observed are related to the things I picked out (inbound discards) [11:11:45] but the inbound discards are something we need to consider as a sign the system is hitting performance bottlenecks [11:14:20] arturo: perhaps we could discuss on today's netbof meeting? cloudgw in general and how to scale it? [11:14:49] Joanna was going to send an invite related to active/active openstack but she didn't send it on yet so we can maybe use the time for this instead? [11:18:03] yeah, sure [11:18:55] topranks: oh, sorry, I just noticed, I have a conflicting meeting today and wont make to the netBoF meeting [11:37:35] topranks: cloudgw is 10G, what would be the next stage regarding speed? 20G, 40G? [11:37:54] ok np [11:38:20] 25G is the next step up in terms of Ethernet speeds [11:39:03] how difficult is that upgrade? [11:39:04] and more typically used for server-side - typical top-of-racks these days are 48x25G (for servers) and 8x100G (for network uplinks) [11:39:31] but you need to consider if the system/cpu etc. is unable to keep up with current peaks, giving it a 25G NIC is not going to help [11:39:56] the upgrade for us is tricky as the switches in C8/D5 are older and only have 10G and 40G ports [11:40:00] softirq has never been over 25% in the last few weeks [11:40:02] and all the 40G ports are used [11:40:16] it's dropping packets [11:40:28] occasionally, but it is doing it [11:40:46] all I'm saying is you need to consider the whole system [11:40:56] we couldn't just drop a 400G NIC in there and it'd do 400G [11:41:30] sure, makes sense [11:41:50] but because the boxes do NAT [11:42:03] 2x10G, enabling the other link on the existing NIC, might also be an option [11:42:23] it is very unlikely that we will see 10G as peak on the graphs, server wont most likely get to that line rate because NAT [11:42:28] we'd have to sneak it in while Arzhel was off :P [11:42:40] therefore 6G may be a practical limit, therefore we are hitting it [11:42:49] possibly yes [11:43:05] the packets/sec is the real measure when thinking about the processing power of the box [11:43:19] Gb/sec is just our upper limit in terms of traffic [11:43:25] ok [11:43:33] until we hit that the question is how many packets/sec is the box dealing with [11:44:13] usally one is a good indicator for the other of course, the mix of packet sizes is probably not changing here [11:44:23] though we should try to understand what the recent uptick is because of [11:48:18] does it matter though? if we have random users doing random things in Cloud VPS and Toolforge, and suspect of capacity problems, for me that's enough reason to think on increasing capacity [11:49:48] I would say it's worth knowing [11:50:04] but I totally agree, we are seeing potential signs of hitting some performance limits (very little right now) [11:50:14] so it's time to think about how we scale up regardless [11:50:27] the peaks are reflected in the outbound traffic to our CRs [11:50:44] it is definitely also worth knowing what this traffic is and whether it is going to the internet or not [11:50:59] 6Gb/sec is a good proportion of our overall internet capacity in eqiad [11:51:19] so if this traffic is going out to internet we need to factor it into our circuit sizes, commit rates etc [11:51:28] I see [11:51:33] alternatively perhaps it's going somewhere in wiki-land, in which case it doesn't factor in to that [11:51:39] I assume a good chunk is hitting the wikis [11:52:06] we don't have a lot of visibility at the moment regarding what each cloud vps tenant is doing network wise, but we have this graph here [11:52:06] https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1&viewPanel=65&from=now-7d&to=now [11:52:24] it shows toolforge alone can take 3GiB/s [11:56:45] yeah there is some good data there [11:56:53] we can possibly drill down into it [11:57:17] unfortunately the d5/c8 switches are older model where we don't have sflow data export [11:57:19] it seems increases in integration, deployment-prep and toolforge, they spike together [11:58:05] at least based on those broad descriptions they are probably not all going to internet [11:58:27] yeah [12:01:58] looking at the eqiad overall internet use it has risen in the past month [12:02:05] but I've no way to say what the cause is [12:02:17] at least without digging deeper [12:42:31] quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097450 [12:57:46] arturo: thanks! addressed the comments [13:25:08] arturo: I understand your comment about the parameter is a must have? [13:26:05] well, I truly dislike templated scripts, I believe they represent a bad pattern [13:26:42] ok, strong feelings trump lose opinions, I'll change [13:49:15] topranks: I have created this dedicated dashboard to shop project traffic, in case it is useful https://grafana-rw.wikimedia.org/d/ded9b969-7207-4bde-9077-5f81457625c4/wmcs-openstack-eqiad-project-network-usage?orgId=1&var-project=tools&var-project=paws&var-project=integration&var-project=deployment-prep [14:00:32] * arturo food, then meetings [14:34:00] that dashboard is very nice. could the PAWS spikes be the main source of increase in traffic? [14:57:32] I am not sure how the integration project generates ~ 20MB/s of traffic when it is idling [14:58:23] then I have never looked at its network traffic :b [14:58:59] you _thought_ it was idling :P [15:05:20] I would appreciate a quick review of this incident report (the smaller incident that happened on Monday) https://wikitech.wikimedia.org/wiki/Incidents/2024-11-25_WMCS_proxy_nginx_failure [15:49:01] andrewbogott: late reply, but I'd support removing "%{::wmcs_project}.eqiad.wmflabs" from search in everywhere except tools [15:55:32] taavi: thanks for reverting my erroneous wikitech link. I found this other page but I think it's out of date? https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/New_cluster#front_proxy_%28haproxy%29 [15:56:26] dhinus: that's also about a different proxy [15:57:05] (i've been meaning for ages to consolidate those two toolforge proxies into one) [15:59:25] haha ok [15:59:49] I'm creating a task to add some more docs [16:01:16] is the "toolforge webproxy" the replacement of https://wikitech.wikimedia.org/wiki/Obsolete:Portal:Toolforge/Admin/Dynamicproxy ? [16:01:57] "toolforge webproxy" is an ambiguous term, as you might have noticed [16:02:28] yep :D [16:02:47] I mean the one in the page that you reverted [16:02:48] arturo: thanks for that page, looks super useful :) [16:03:27] so there are the tools-proxy-N nodes, which previously indeed ran dynamicproxy and are what that failover section you tried to edit talks about [16:03:36] nowadays they just proxy everything to the k8s haproxies [16:05:25] taavi: gotcha, thanks! [16:05:54] basically the only meaningful thing that currently happens on them is the toolviews counting based on the access logs [16:06:58] otherwise I would have gotten rid of them a long time ago [16:50:16] aruto: I was looking in grafana. Prometheus node_exporter is sending all the tap-interface stats from the cloudvirts [16:50:40] So we have per-vm usage stats, though correlating the names -> tap ints is not so simple [16:51:27] yeah, that info is not even in the openstack DB [16:51:45] it's not too hard to find [16:51:56] you get the "ip neigh show" output from cloudnet in the right netns [16:52:07] and then the "bridge fdb show" from each cloudvirt [16:52:34] the latter has "mac:tap interface: association [16:52:42] in the grafana dashboard that I mentioned earlier, I just added cloudgw/cloudnet panels so we can correlate [16:52:43] and you can search the first one for a mac to find the IP [16:52:50] ok [16:52:54] https://grafana-rw.wikimedia.org/d/ded9b969-7207-4bde-9077-5f81457625c4/wmcs-openstack-eqiad-project-network-usage?orgId=1&forceLogin&var-project=deployment-prep&var-project=integration&var-project=paws&var-project=tools&var-cloudgw=cloudgw1002:9100&var-cloudnet=cloudnet1006:9100 [16:53:42] what's "paws" ? [16:54:01] a Cloud VPS project hosting a public instance of jupyter notebooks [16:56:11] taavi: I edited again, hopefully this time more correctly :P https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#WebProxy [16:57:15] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/24 [16:58:22] dcaro: +1d [16:58:31] thanks! [17:01:53] ah ok [17:02:01] paws is what's changed yeah [17:02:01] dcaro: I might not get the emailer review fully done today, is it okay with you if I merge it when you're away? [17:02:16] arturo: also I adjusted the cloudgw/cloudnet graphs hope that's ok [17:02:31] I filtered for only vlan ints, to stop counting packets twice (on physical/vrf/vlan) [17:02:33] topranks: I think I just overwrote your changes because I added DNS panels [17:02:42] and ah no worries [17:02:45] blancadesal: no problem for me 👍, don't feel rushed though, it can wait if you don't have time [17:02:51] I have it open still I can get the JSON [17:03:06] topranks: please add them back, I will go offline now, so no more changes today from me :-) [17:03:20] ok cool thanks [17:03:30] the big diff also I multiplied the bytes * 8 to get bits [17:03:47] ok [17:03:58] dcaro: it's mostly that we don't have much code that explicitly deals with threading/coroutines so it's interesting to take a closer look :) [17:04:23] feel free to play with it, using time.sleep around helps you find blocking points [17:05:11] but anyway paws is the thing that's changed daily use from ~1Gb/sec to 4-5Gb/sec in the past few weeks [17:06:12] they've gone from few hundred pps max to 1.2 million [17:06:54] * arturo offline [17:07:14] paws is a public resource we provide? is there any documentation on it? [17:07:31] https://wikitech.wikimedia.org/wiki/PAWS [17:07:39] dhinus: thanks! [17:07:51] that's, uh, rather worrying [17:07:54] the main expert is r.ook who is away today [17:07:55] am I reading also that the pps on the integration project are quite low, even if it's doing some relatively high bps? [17:08:10] (mostly curious, seems like big packets if so) [17:08:34] dcaro: if you still have a moment today, these should (hopefully) be quick: https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/5 [17:08:34] https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/43 [17:08:48] we frequently get abusers running {stuff} in PAWS, so it's possible there is something dodgy running there [17:09:01] I was only thinking that [17:09:09] yeah I'm trying to take a look but someone's taken my admin rights away in the UI [17:09:23] I don't know much about Jupyter, but it occured to me "can you execute code there to talk to the internet?" [17:09:35] ok well it's not emergency to look at now [17:10:53] taavi: I also don't have admin rights unfortunately [17:11:14] me neither :/ [17:11:17] and the kubectl commands in the documentation seem to be missing something to make it use the right config [17:14:11] dcaro: integration does a lot of `git clone` I guess that is lot of big packets yes [17:14:35] there is some npm/composer installs but afaik they should mostly hit a local cache and barely do any network traffic [17:14:43] (cause we have lock files) [17:15:24] and Jenkins in production retrieves the build artifacts from the instances [17:17:02] thanks! it's interesting how that's reflected on big network packets traffic [17:17:50] blancadesal: +1d [17:17:54] taavi: kubectl wfm from bastion.paws.eqiad1.wikimedia.cloud, using KUBECONFIG=/home/rook/paws/tofu/kube.config [17:18:03] not sure if that config is also somewhere else [17:18:04] where is that documented? [17:18:12] nowhere :( [17:18:20] but I had some memories from playing with quarry [17:18:42] i'm pretty sure I asked for that to be documented several years ago at this point :/ [17:18:58] (the last time I had to urgently look at a PAWS thing because no-one familiar with Magnum was available) [17:19:02] I think there's another way to access the cluster, but I don't remember it right now [17:19:57] I think for production I had a dashboard showing the bandwith, packet per seconds and the ratio or size of packets [17:20:25] so you could see the surges a small packets :) [17:20:30] and large downloads [17:23:23] if i'm reading https://prometheus-paws.wmcloud.org/graph?g0.expr=sum%20by%20(namespace)%20(irate(container_network_receive_bytes_total%5B5m%5D))&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=30m correctly most of the traffic is coming from the kubernetes internals which seems odd [17:26:29] even the paws.bastion has high CPU pressure https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=paws&var-instance=All&from=now-30d&to=now&viewPanel=609 :) [17:26:52] that Pressure Stall Information is a more detailed version of Load Average (more or less) [17:27:11] the top of the graph has a link to the ref https://docs.kernel.org/accounting/psi.html [17:27:43] oh, nice [17:27:48] yeah [17:27:57] it is awesome, I should port that panel to the production host-overview [17:27:58] dcaro: thanks! [17:28:15] also paws-nfs1 has a /srv/paws disk space that is filing up since 11/11 [17:28:34] which is more or less when it started having more CPU usage [17:29:05] Out today haven't read everything. Hmm...I guess the kube.config file isn't documented. At any rate if you need one tofu will generate one for you. There are a lot of abusive accounts still appearing on paws, they haven't diminished as much as they normally do. Many of them are running DOS junk which could be part of the networking stuff. Though if it is mostly coming out of kube-system and other places that doesn't explain it [17:29:08] hmm, I wonder if it might be logging stuff on a loop [17:29:12] why is paws bastion having so much cpu? Well I don't know :) [17:30:13] How is the paws bastion showing cpu pressure? I'm not seeing it on the system or in the linked chart [17:31:06] https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=paws&var-instance=bastion&from=now-30d&to=now&viewPanel=609 ? [17:31:09] well it did at some point [17:31:37] then the spike was at 0.025 so probably very low, and I don't know what that unit is [17:36:08] anyway. I am off for dinner [17:36:45] bon app! [17:37:03] merci [17:37:04] ! [17:44:49] * dhinus off [17:53:49] * dcaro off [17:53:50] cya!