[08:19:42] 10netbox, 10Infrastructure-Foundations: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10ayounsi) [08:23:53] 10netbox, 10Infrastructure-Foundations: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10ayounsi) [08:31:02] 10netbox, 10Infrastructure-Foundations: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10Volans) > As a first step I suggest that we identify on https://wikitech.wikimedia.org/wiki/File:Server_Lifecycle_Statuses.png which transitions are manual vs. automat... [08:49:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @Andrew I think it's wise to proceed cautiously alright. And I've no objection to us keeping the Ceph host "public" and "cluster" NICs separate... [08:50:02] 10netbox, 10Infrastructure-Foundations: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10ayounsi) I agree on the MOTD, it's not a significant change, the last of the 3 options in term of priority/usefulness, but maybe also a low hanging fruit. Maybe a bit... [09:25:11] I'm trying out tying the mgmt data from netbox into prometheus, hosts in "decommissioning" status are not in mgmt dns, is that expected? [09:25:27] case in point wtp1039 https://netbox.wikimedia.org/dcim/devices/1146/ [09:25:54] to be clear I'm ok with that, I'll need to exclude said status from mgmt data export [09:28:36] intutively it makes sense to me that we want to stop caring/probing mgmt on decommissioning hosts [09:31:11] godog: they still have their asset tag record: wmf7058.mgmt.eqiad.wmnet has address 10.65.5.52 [09:32:00] XioNoX: good point, it didn't occur to me [09:32:24] godog: https://phabricator.wikimedia.org/T310266#7991769 "Do we need to monitor hosts that are Decommissioning in Netbox?" [09:32:36] I don't remember the conclusion if any, just remember the thread [09:33:29] nice find, I forgot about it, moving to asset tags complicates things a bit but nothing we can't do for sure [09:33:48] "complicates" as opposed to exclude decommissioning status from the hiera-data netbox export script [09:34:21] can be a first iteration vs. future iteration too [09:35:15] agreed, ok I'll go with excluding decommissioning for now so we know all fqdns in hiera are in dns too [09:36:23] sigh which I realised now will introduce some races unless hiera-export script is ran as part of decom workflow [09:38:47] (problem for future iteration) [09:39:04] change is https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/842359 [09:41:48] I'll let the other reviewers voice their opinion too. The change itself is trivial. [09:42:05] +1, thanks! [09:53:01] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one device still showing ok and no alarms for FPC errrors. We can re-open if problem happens again. ` cmooney@re0.cr2-esams>... [10:25:10] godog: yes decomm'ed have only the asset tag ones [10:25:27] and the idea was to make the hiera cookbook run all the time the dns cookbook is run [10:25:34] and that's called by the decommission one too [10:25:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > Perhaps something like cluster re-syncing when nodes are added/remove or in failure etc? Yep, that is when we hit the throughput limit yes, if... [10:26:25] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/804575 [10:27:09] volans: *nod* thank you, running the hiera netbox script at decom time SGTM [10:27:37] volans: what do you think re: excluding decom hosts from mgmt data at least for now ? i.e. https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/842359 [10:29:25] godog: I didn't get what's the problem, in the hiera now you have things like wmf5174.mgmt.eqiad.wmnet, that is totally resolvable in dNS [10:30:28] ah, maybe I know [10:31:07] so yeah, the host with decomm status has the dns_name in netbox still with the hostname, but the dns cookbook generates only the asset tag record [10:31:10] and not the hostname one [10:31:33] we have different options [10:32:07] 1) change the hiera export to use the asset tag when the host status is decommissioned and the dns_name starts with "$hostname." [10:32:25] 2) keep the dns records for hostname even when decomm'ed [10:33:52] *nod* I don't feel strongly about either, maybe whichever is easiest to do? 2) seems fine to me as the hostname is still in netbox anyways [10:34:10] "fine" as in intuitively makes sense [10:35:34] the reason behind that was that once decommissioned a host could be renamed and used with a different hostname, and that the only attached bit to it is the asset tag at that point and not anymore the hostname... that said... wouldn' be a big deal, but we should ask dcops and potentially traffic what they thing about it [10:35:39] (1) is fairly easy too [10:38:25] re: reuse, once that happens the new hostname will be in netbox and dns mgmt updated accordingly with the new hostname (?) once the host is in service again that is [10:38:39] status "staged" perhaps? [10:40:50] (I have to go shortly for lunch, will read later) [10:46:16] yes, in case of renames all will be fixed in netbox and so all will follow accordingly without issues [10:47:39] I guess the other question that we need to ask ourselves is: is it ok for a check and hence its metrics to change from $hostname.mgmt... to $asset_tag.mgmt... when the host is decommissioned? For then disappear few days/weeks later when it's offlined (renames and re-purposes are very few compared to actual physical decom) [10:48:41] among the options there is also 3) change the decomm cookbook to rename the dns record directly in netbox [11:20:30] for the record i think 3 is the best option. specificaly once the box is decomisioned we should refer to it via its $assettag.mgmt name everywhere and kill its old hostname everywhere. this means that we would mean that metricts for the mgmt name would stop at a similar time as metricts for the production name and we would we would have parity between the $host.$site name and the [11:20:36] $host.mgmt.$site name. [11:25:12] to be noted that the host will stay with its hostname in netbox (as device name) as decom tasks are done via hostname, not asset tags [11:32:51] ahh ok ultimatly it might be best for dc-ops to make the call on this one [11:33:21] agree [11:56:05] 10netbox, 10Infrastructure-Foundations: Netbox: manage VRRP priorities - https://phabricator.wikimedia.org/T319301 (10cmooney) Good stuff! I was just actually looking at this for the cloudsw, to try to get better balancing of the uplinks from c8/d5 (as currently d5 being used for both cloud and prod realm and... [12:17:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) Ok. Yeah there are some larger spikes around that time. Biggest on cloudcephosd1026 on Aug 18th. {F35565980} But still fairly comfortably wit... [12:18:20] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052 (10ayounsi) After many iterations (see patches above) the pypi version ended up being deployed in prod... I manually applied the... [12:36:45] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052 (10jbond) nice work and sounds good to me > The only blocker so far to me is on how to cleanly "backport" the fix from https://gi... [12:47:41] 10netbox, 10Infrastructure-Foundations: netbox scap script fails to create cacert bundle - https://phabricator.wikimedia.org/T320718 (10jbond) [12:47:53] 10netbox, 10Infrastructure-Foundations: netbox scap script fails to create cacert bundle - https://phabricator.wikimedia.org/T320718 (10jbond) [12:52:01] 10netbox, 10Infrastructure-Foundations: netbox scap script fails to create cacert bundle - https://phabricator.wikimedia.org/T320718 (10jbond) the fix mentioned above is only on the master branch however the last deploy was from the 3-2-2 branch. I still need to update the documentation for deploying netbox a... [13:11:26] ack re: dcops making the call, I'll open a task [13:22:14] {{done}} T320721 [13:22:15] T320721: Decide whether decom'ing hosts mgmt DNS entry should stay or not - https://phabricator.wikimedia.org/T320721 [13:31:37] thx [13:38:10] godog: I've integrated with an example [13:41:24] cheers volans, that's useful [16:07:54] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Just FYI I've adjusted one of the links on the row E/F switches now. Quick run-down of process: # Drain link by chaning OSPF interface cost both sides: ** `set protocols ospf area 0.0.... [16:08:07] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 (10jbond) > The main one is the necessity of Manual status changes, as defined in https://wikitech.wikimedia.org/wiki/Server_Lifecycle (Eg. "The ser... [17:24:16] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Ok I've fixed the MTUs for all the underlay / switch to switch links in the new cage now. All that remains on those are the uplink sub-ints to the CRs, which for some reason are at 9174... [17:28:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) @Papaul I can make myself available if @Cmjohnson cant [18:24:19] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Actually I've discovered something odd on those sub-interfaces between switches and cr's. Firstly the value I was seeing was the protocol mtu (i.e. payload mtu) as I was looking at the... [21:01:48] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10demon) [23:22:08] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond)