[07:17:07] 10netbox, 10DC-Ops, 10Infrastructure-Foundations: Netbox: investigate custom status - https://phabricator.wikimedia.org/T310594 (10ayounsi) @wiki_willy please let us know if DCops have any preference on this topic. [07:39:51] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10ayounsi) > this is probably too blunt a check but i wonder if we could infer this based on the puppet role. e.g. `if role not in ['insetup', 'spa... [09:12:22] 10netbox, 10Infrastructure-Foundations: netbox scap script fails to create cacert bundle - https://phabricator.wikimedia.org/T320718 (10jbond) 05Openā†’03Resolved a:03jbond I have now merged 3-2-2 into master and deleted it and will fix up the documentation next time there is a [15:18:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > But still fairly comfortably within the 10G NIC capcity. What throughput limits were hit? Sorry if I missed them on the dashboard you linked, I d... [15:19:14] hey topranks -- out of curiosity, is nic_saturation_exporter installed on WMCS servers? [15:22:56] yeah, we're using it fleet-wide on any baremetal hosts [15:30:17] CDs is: [15:31:26] cdanis: even. Iā€™m totally ignorant about that, must have a look in, assume the question is related to observed discards / reports of bottlenecks? [15:34:40] topranks: not your fault, it's not really documented šŸ˜… yeah, it came out of a past incident where we were seeing 'microbursts' of tx packet drops on memcached hosts because of hot keys [15:35:22] Prometheus-reported rates only have a temporal resolution of ... I forget if we're at 60s or 30s fetch interval here, but a lot can be hiding in that in-between time [15:36:27] https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/prometheus/files/usr/local/bin/prometheus-nic-saturation-exporter.py [15:38:45] hm [15:39:00] I don't see those metrics being exported by anything matching cloudceph.* [15:39:04] so perhaps it's not running on those hosts [15:41:59] cdanis: thanks! I'll definitely look into it [15:42:22] yeah the lack of granularity is always a problem, and very much so with the 5-min averages on the LibreNMS stuff. So this could be quite useful :) [15:42:25] erm also https://phabricator.wikimedia.org/T319184 [15:42:28] sorry [15:42:32] https://phabricator.wikimedia.org/F35565980 [15:42:34] this graph [15:42:53] I think 'Gib' in Grafana means Gibibits? [15:42:55] not gigabits [15:43:39] 7.45 Gibit/sec == 8Gbit/sec [15:43:46] so that's interesting as well [15:46:25] "gibbibits" ??? [15:46:39] is that counting in 1024 at rather than 1000? [15:46:42] yes [15:46:49] 2^30 bits [15:47:00] it is unfortunately a thing: https://en.wikipedia.org/wiki/Gibibit [15:47:21] TIL [15:47:31] It's already a thing I guess, best it has a name! [15:49:22] `role::wmcs::ceph::osd` includes `profile::base::production` which includes `profile::monitoring` which includes `profile::prometheus::nic_saturation_exporter` if `! $facts['is_virtual']` [15:49:35] so maybe it's running but not being scraped? [15:50:07] otoh the scraping is generated by a puppetdb query [15:50:09] hm. [15:52:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I did some tests in the past and that was more or less the maximum network throughput I got, so I was expecting for that to be the same (thinki... [15:59:10] cdanis: to your point the stats are exposed as raw counter bytes [15:59:43] the graph panel allows you to set units as "IEC" or "SI", I believe the latter should mean 1,000 bits to a kilobit [15:59:55] Although adjusting these doesn't seem to make much difference on a sample graph [15:59:59] hmm [16:00:08] Gibibit is in fact a SI unit :) [16:01:15] Yep, and I stand corrected, made the change and it shows closer to 8Gb/sec [16:01:18] https://usercontent.irccloud-cdn.com/file/te6Rpk4F/image.png [16:03:20] the number being *that* close to precisely 8Gbit/s makes me a little suspicious [16:04:19] I'm not overly concerned about that though, typical usage is a lot lower, above is a new machine coming online I believe, which will always tend to max out the link. It's not a situation where links are normally saturated. Obviously if we need them to do initial sync faster than the 3hours shown above then perhaps they need to move to faster NICs etc., but I'm not sure that's the case [16:04:43] cdanis: yeah, I share your suspicions, but I do also try to not read too much into these things sometimes [16:04:48] yeah that's very fair :) [16:05:10] and yeah, your observation about how any 'reasonable' syncing system will behave this way and eat everything it is allowed, is the right point of view, too [16:05:33] But yeah that type of operation is gonna try and max everything out, the relatively flat 8Gb means something is at its limit. I'm not sure that's the NIC/switch port though. [16:06:27] warrants further checks. I gotta run out now but I'll look into the nic_saturation_exporter stuff next week, do we have a Grafana dash for that already? [16:07:06] yeah the metrics are actually built into the host and cluster dashboards [16:07:17] it's just that it's either not running on or not being scraped from cloudceph hosts [16:07:26] I'll take a quick peek later today as well [16:07:34] it is supposed to be running on all bare-metal hosts in the fleet, as moritz said [16:08:05] cool I'll take a look [16:08:19] yeah we should get it going for them if it's not, definitely good to have that additional info [16:09:20] I suspect we definitely have quite bursty traffic patterns from the WMCS hosts in general, the discards on the switch->cr uplinks are rather large given the average peaks shown in LibreNMS graphs are only at about 1G on the 10G links [16:24:44] yeah, that was the same pattern we saw with the memcached hosts [16:25:10] the tool btw was inspired by elukey running `ifstat 1` on those hosts while it was happening :)