[08:48:24] Hey! Did Mark change his nick in the freenode to libera transition? [08:49:04] Yes, it is question_mark now [08:57:31] marostegui: thanks! I'll update the contact list page [09:55:25] uh oh [09:55:26] ;) [11:56:18] there are some stale references to a retired service showing up here - these are representations of what's in etcd rather than pybal right? pybal itself doesn't have this service configured any more https://config-master.wikimedia.org/pybal/codfw/restbase [11:57:19] actually no I'm wrong, there are other hosts showing up in conftool newer than the ones above [12:08:26] <_joe_> hnowlan: I think old files should be removed, but I'm not 100% sure [12:09:11] <_joe_> yeah so [12:09:26] <_joe_> we do remove templates that generate those files [12:09:32] <_joe_> but apparently not the files heh [12:09:58] ahhh makes sense [12:11:10] <_joe_> so one thing we could do is remove them all, and restart confd [12:11:15] <_joe_> on puppetmaster1001 [12:11:23] <_joe_> doing it now [12:12:16] do I remember wrongly that puppet-merge was taking care of those? [12:13:28] <_joe_> volans: we're talking files on disk [12:13:38] <_joe_> that are generated by confd [12:14:41] ack got it [12:14:48] <_joe_> hnowlan: https://config-master.wikimedia.org/pybal/eqiad/restbase yep that fixed it [12:15:01] <_joe_> the codfw url is still in the varnish cache I guess? [12:15:02] _joe_: nice, thank you! [12:25:03] hi! if anyone is familiar with mypy 0.900+ new types system, docker-pkg is broken because of it and meanwhile I have proposed to pin mypy<0.900 https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/747104 [12:25:25] but maybe there is an actual fix to support 0.900 properly :] [12:26:42] hashar: just add to the requirements for that tox env the stub packages [12:27:18] like 'types-requests' [12:27:32] as suggested in the error itself [13:05:22] I'm in a dilemma re: blackbox probes and labels, currently we have "instance" label for probes set to : to match the usual convention. I need to add per-service state (e.g. production) as a metric, the natural choice would be to write the metrics localyl on the filesystem to be picked up by node-exporter. However I can't set 'instance' directly that way [13:05:57] I'm considering writing a mini exporter just to listen on localhost on the prometheus hosts though, to be able to override 'instance' [13:06:52] so the dilemma is going forward with that idea, or switch to a different label than 'instance' [13:07:32] godog: what sort of "per-service state as a metric" are you thinking of? [13:08:13] case in point is the service::catalog 'state' field as a 1/0 gauge [13:08:21] to e.g. page only on production services [13:08:49] on services in 'production' state, more accurately [13:12:27] so, that sounds like you want to add a label to scraped metrics? [13:14:39] kormat: looking at a thing now, will reply in a bit [13:19:43] volans: thank you :] [13:22:00] kormat: I don't want to lose/change the metric when the state changes [13:22:07] the metric history rather [13:22:44] although state doesn't change that often heh [13:24:37] godog: i am confuse. do you have an example timeseries? [13:28:08] kormat: sure, e.g. https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22toolhub%3A4011%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=12h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [13:30:05] kormat: I'd like to augment that with paging alerts though only if the service is in production [13:31:42] godog: one way to achieve that is by adding scrape-time labels to the prom config [13:33:02] yeah that's probably the simplest, though there would be a new metric when the service changes state [13:33:27] new _timeseries_, yes [13:33:58] you'd filter out the state= label depending on what you're doing [13:35:01] yeah the other thing is that state is IMHO interesting only when talking about paging alerts [13:35:16] all other times we don't really care I think, e.g. when browsing [13:36:11] hi all i plan to set up an onboarding Q&A session for jesse covering request flows, app layer, logging, server lifcycle and spicerack/cumin. please let me knw if you would also like to be added (cc mmandere kwakuofori arnoldokoth Emperor) [13:38:35] godog: is this state at the instance-level, or at the service-level, or something else? [13:39:31] e.g. if we were talking about databases, is this something that would change when we depooled one? [13:46:31] kormat: at the service level in this case, but no state wouldn't change when e.g. we depool a service from a site [13:46:49] hum. [13:47:21] this is the 'state' field in 'service::catalog' in hiera, to be clear [13:47:50] oh good. something with a nice generic name and a non-generic meaning. :D [13:48:02] the other thing that occurred to me is that in generall not all blackbox probes might come from service::catalog therefore might not have 'state' [13:48:10] lol [13:54:19] kormat: thanks for the idea bouncing though! very useful [13:54:54] godog: just don't rate the conversation as state=production :) [13:55:14] I won't, or I might get ... irate [13:55:24] :D [13:58:35] but on to happier news, the checks are working for a few services I've opted in, e.g. https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview [13:59:05] and found real issues too, like kartotherian not having its discovery entry in SAN [14:00:09] if it looks like many data points it is because the checks run every 15s unlike icinga ~60s [15:21:19] kormat: I've eventually convinced myself and implemented a "mini-textfile" exporter in half a page of python, as it might come handy for other generic "info type" metrics we want (e.g. the pooledness of a site) [15:21:34] godog: hah, ack. [15:27:20] volans: (or anyone else that knows) - so for all the excess/tagged interfaces on lvses and netbox (the ones other than the host primary)... [15:27:34] * volans here [15:27:44] they're explicitly puppet-configured with "interface:IP" per vlan+host in modules/lvs/manifests/configuration.pp [15:28:04] yes, I was looking at it this morning as topranks was asking around in general how are they configured [15:28:12] I kind of assume the flow for a new host is: configure them there, let puppet apply them to the host, then let puppetdb feed that towards netbox+homer [15:28:33] (netbox for the ip allocations, and homer for the switch vlan trunking config) [15:28:55] I think so, the only limitation that I see is that the reimage cookbook runs the puppetdb import script for you at the end, but if we reimage into insetup [15:28:58] and then move the role [15:29:07] we need to re-run the netbox script [15:29:11] right, might have to manually re-run that in this case [15:29:13] that's fine [15:29:32] not a problem but I'm a bit worried for the general case of hosts that are reimaged into insetup and then change afterwards [15:29:41] my only other question was: how should I "reserve" these IP addresses in netbox (so that while my puppet patch is pending, they don't get allocated in netbox to someone else with the GUI) [15:29:45] off topic for your changes though, so ignore it and go on ;) [15:30:02] In terms of that last question, "reserving" them. [15:30:27] Is a better pattern not to create an IP object in Netbox (like we do and associate a DNS name before a host is built), but not tie it to any interface? [15:30:48] And then have the Netbox puppet import script import that and actually bind the IP object to the device interface? [15:30:59] yeah I figured something like that should happen, just not sure what the details of that operation should be [15:31:05] if they exists already the import script should attach them unless they are VIP [15:31:10] VIPs are not attached to ifaces [15:31:19] otherwise I could pick a bunch IPs for this in my puppet patch, then deploy it tomorrow, but someone already stole one those IPs for some other new host in the meantime [15:31:34] if I pick lvs1013 as example https://netbox.wikimedia.org/dcim/devices/121/interfaces/ [15:31:49] it seems they don't follow any specific pattern [15:31:51] as long as you set the IP's status as "reserved" it won't be assigned to any other host by automation [15:32:02] you can reserve them in Netbox in 2 ways [15:32:19] 1) as XioNoX said, mark the IP as reserved in netbox [15:32:28] 2) set it up all the way, as you want [15:32:36] for both https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox [15:32:39] the former seems simpler [15:33:04] just skip the VIP parts [15:33:07] yeah it would avoid the possibility of it being assigned somewhere else in the meantime [15:33:08] switch ports for special cases (like trunking) need to be manually configured. To save time you can configure the vlans (and everything else), but keep the interface disabled. So it won't do anything when running homer. And then set it to enabled when ready and it will configure it all. [15:33:33] oh I thought homer had rules for the lvs vlan trunking setup [15:33:52] nop [15:33:57] ok [15:34:12] for most of these interfaces, we're re-using the existing ones that are already correctly-configured anyways [15:34:21] So XioNoX just saying this to see if my understanding is correct [15:34:22] it's just the new host-primary to deal with, as that's a new port [15:34:37] The ports connected to the LVS should be configured something similar to this right: [15:34:39] https://netbox.wikimedia.org/dcim/interfaces/16685/ [15:34:55] topranks: yep [15:34:59] Is there any reason, on a new build, to delay configuring the port that way? [15:35:08] as per your example to leave it disabled? [15:35:22] no, you can pre-configure it [15:35:26] well, I think netbox basically imports that from the switch though, right? [15:35:36] you could even enable it and let it sit, but then it will alert [15:35:43] (well, from puppetdb after it sees all this config, indirectly) [15:36:43] bblack: yeah, it will fix it from lldp if something is incorrect, but for a new server, as there is no LLDP, DCops needs to explicitely define which switch port and with vlan a new host is going to [15:37:16] using https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/ [15:37:34] but that only supports servers with 1 uplink and 1 vlan [15:38:27] IIUC (please correct me!) - given the lvs1016->lvs1020 scenario (we're re-using all the existing, enabled, configured x-row links, but we have a new port for the host primary): [15:38:58] 1) I need to configure that new host primary port on the switch manually in the usual lvs primary way (untagged on the local private vlan, with trunking for the public) [15:39:19] 2) Configure all the interface:IP stuff for the trunked interfaces in puppet in modules/lvs/manifests/configuration.pp [15:39:32] 1) configure it on Netbox, then run Homer to configure the switch [15:40:20] 3) Puppet the host with this new config (which will set all of those ports and IPs and vlans on the host side, and export what it sees from LLDP on all these links to puppetdb) [15:40:35] 4) re-run the puppetdb import script, and it will fix up everything else in netbox for me [15:41:26] well and I guess I skipped 0) Reserve these IPs in netbox so they don't get accidentally allocated out before I get to (1) [15:42:15] bblack: is that before moving the X-row links? [15:42:16] (and ack, step 1 - use netbox to drive homer, not truly manual switch config) [15:43:17] XioNoX: on the physical process: lvs1020 will already be up and running in an "insetup" role with no real config. We'll downtime lvs1016, stop pybal, and unplug all its xrow links. Then plug them into lvs1020, then puppetize the machine into the real role (with all the interface:IP:vlan config from puppet) [15:43:45] ok [15:44:01] so yeah that should work :) [15:44:02] does this all make basic sense? [15:44:05] From my understanding that sounds workable, open to correction [15:44:48] fyi, I have an interview starting in 15min so I won't be able to help [15:45:08] I think it will be a little later today before we can actually try the process, or possibly tomorrow, depending on dcops schedule [15:45:26] I'm about if I can help with any of it [15:45:33] for now I'm just gonna do the manual reservations, and publish the puppet patch containing all the interface:IP mappings for lvs1020 (and not merge it yet) [15:46:07] and I guess, fix the primary interface for lvs1020 to have the trunk/vlan info it needs in netbox and export that via homer [15:48:05] I'm still trying to work out how step 3 works internally, i.e. what puppet does to take the source data and add the necessary to /etc/network/interfaces [15:49:20] so modules/lvs/manifests/configuration.pp has the original source data [15:51:22] oh [15:51:29] something has been refactored since I last looked, maybe [15:51:53] sorry, no, I confused myself, let's start over: [15:52:25] hieradata/common/lvs/interfaces.yaml has the raw input data for mapping vlan:lvs_host:interface_name:IP [15:52:46] this data is under the top-level hieradata key "lvs::interfaces::vlan_data" [15:53:11] modules/profile/manifests/lvs.pp (which is used by lvs servers) has in its parameters: [15:53:18] Hash[String, Hash] $vlan_data = lookup('lvs::interfaces::vlan_data'), [15:53:31] and has this block near the bottom: [15:53:33] # Set up tagged interfaces to all subnets with real servers in them [15:53:33] profile::lvs::tagged_interface {$tagged_subnets: [15:53:33] interfaces => $vlan_data [15:53:33] } [15:54:05] and then modules/profile/manifests/lvs/tagged_interface.pp has the code that executes that [15:54:20] eventually invoking a bunch of data-driven interface::tagged { $tag: [15:54:46] and modules/interface/manifests/tagged.pp is what edits e/n/i [15:54:56] (with augeas, ewww) [15:55:32] and then if the switch side is up, puppet will collect up lldp info from the live links and also knows all the host side config, and thus all this goes to puppetdb and can configure netbox [15:58:21] what I don't like about this arrangement, in the abstract, is that the "real" source of truth here is actually hieradata, not netbox (although we will make matching manual pre-reservations in netbox to avoid races on allocations) [15:58:48] jbond: yes please (re onboarding) [15:59:13] I think we could flip it around and make it work "right", but it might not be worth the hassle just for the special case of LVS, if we're destined to replace this with a far simpler solution in the future that doesn't have all these multiple tagged interfaces [15:59:23] bblack: I think this are the first lvs we're setting up since the netbox automation, so please feel free to gather requirements to improve the situation. We can put them in a task to improve things [15:59:33] bblack: thanks... I got most of the way there, I think I need to re-look at modules/interface/manifests/tagged.pp [16:00:38] Emperor: ack :) [16:01:22] Reason I'm asking if WMCS have a similar requirement for connectivity to multiple vlans, and alternative to tagging is using twice the switchports, which doesn't make sense just for the lack of a simple way to configure the sub-interfaces on a host. [16:02:43] well the flow described above for lvs config, is the production of natural selection and evolution of the course of many years of refactorings of all related things. [16:02:55] I think with a clean slate, it might not have exactly its current shape :) [16:03:15] My like myself :) [16:03:18] *Much [16:03:48] But yeah I'm sort of trying to figure out is there an existing easy-to-reuse workflow or should be try to make something more generic from scratch. [16:04:52] an ideal workflow would probably start with manually configuring the vlan/trunk stuff in netbox, and somehow having that affect puppetization of the host, instead of the other way around like LVS is now [16:05:09] but I haven't really thought through how that should work [16:06:01] for that matter, there's lots of imperfections about how we manage /e/n/i in general, even in the simpler case of a single host primary interface on a single vlan [16:06:58] yeah in a previous life I'd have just had the automation read from netbox, and create the files based on the data there with a template of some kind. [16:07:10] it's not really idempotent in real any real sense. We're relying on the OS installer to set up the initial /e/n/i (which in the common case puppet doesn't mess with afterwards), and then we have various bits that augeas-edit that, which can get confused and duplicate things and/or don't revert easily either [16:07:21] and then possibly some scripting to do the initial allocations/assignments in Netbox for a specific class of device [16:07:58] yeah I'm just looking at augeas for the first time here trying to grok what is going on, it's starting to make sense. [16:08:06] so all of this about multiple interfaces + tagging in /e/n/i is riding atop that mess, too [16:08:39] One option may be to make use of the interfaces.d directory, and put all "additional" interface definitions in separate files there, never touching the original file created by the debian installer. [16:09:09] (see also the tuning done via profile::lvs::interface_tweaks which also ends up doing more augeas-edits of /e/n/i for perf parameters, etc) [16:09:28] yeah I seen that [16:10:20] _joe_, jayme: could I get a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/747150? we need to disable x-request-id generation on envoy for the CDN TLS termination layer, (pcc attached on the CR with an envoy NOOP for mw and mw api) [16:10:48] so in a much-simpler case (eqsin, with only one set of public+private vlans and one physical interface needed on lvs) [16:11:00] we end up with this for /e/n/i contents for that interface: [16:11:04] https://phabricator.wikimedia.org/P18230 [16:11:16] <_joe_> vgutierrez: why do you need to disable it? I think it was a desirable effect to generate them at the edge [16:11:24] all of those bits are coming from various interface::foo using various augeas edits, it's kind of a nightmare [16:13:27] and most of the augeus-driven interface config is very fragile. If you change the wrong thing in puppet (whatever the lines aren't keyed on), it will end up duplicating some of those lines with conflicting settings, and in general puppet can't re-create the file if e.g. you just rm /e/n/i and re-run the agent. [16:13:45] yeah, the resulting file isn't pretty. But the various sources for how all the bits get there is a bit tricky to wrap your head around. [16:13:53] _joe_: we should agree on that explicitly IMHO [16:14:16] _joe_: and it impacts on our current comparison between haproxy and ats-tls as UUIDv4 generation isn't that cheap CPU wise [16:14:19] hmm. definitely food for thought. [16:14:33] some of that could be attacked from the perspective of considering the installer-time /e/n/i just an initial config, and having puppet replace it wholesale and manage the whole thing. [16:14:38] <_joe_> vgutierrez: generating X-request-id at the edge was always the plan but if you need that for fairness of tests, sure [16:15:05] but what's at the core there is that the way the host config ever gets its initial IP config is from DHCP->installer->/e/n/i [16:15:39] that's the flow from the source of truth (netbox) to how the host's live config is later. there's no other pathway: from there on its a native fact coming from the host, set by the installer, basically. [16:16:20] so if you mess that file up, a reimage is the only universal way out of that mess [16:17:26] <_joe_> vgutierrez: note that "the edge" for me might also mean ats-backend [16:17:28] _joe_: for what it's worth I think it makes a lot of sense to generate x-request-id on the cp servers but it should be a conscious choice rather than a side effect [16:17:43] that was sort of the reason I'd used the interfaces.d directory before. [16:17:56] _joe_: right, and in terms of requets per second is quite different :) [16:17:59] But there can be problems with that due to dependencies, I was using ifupdown2 in that scenario. [16:18:08] and then once you start staring at attacking that problem (initial interface config and how it works today via DHCP+installer) [16:18:49] you will run into the related mess of how we configure our matching host IPv6 at install-time through into /e/n/i as well, and the great debate about RA vs DHCPv6 and everything else related, and those "ip token" lines that are part of half a transition, etc.... [16:19:36] oh yeah I need to get back to that and work out what's happening there. [16:19:41] I think j.bond has worked on that a lot more recently than I have, and so I'm not even sure what he's fixed about it since :) [16:22:33] from my 2 mins reading of the man page on "ip token" it seems that we shouldn't need to add that, and also add the full v6 address via "ip addr add" [16:23:08] one or the other surely should be done? but it may be to do with what order all this happens during the installer/initial config. [16:29:09] jbond: perhaps you could confirm if my thinking is right here. [16:29:26] topranks: how much back scroll do i need to read :/# [16:29:43] WMCS - or anyone else - defining a new service should be able to make use of this file to add vlan sub-interface definitions to /etc/network/interfaces: [16:29:49] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/interface/manifests/tagged.pp [16:30:27] topranks: honestly I don't remember the details anymore. But I know we already had the "ip up ..." for IPv6 being configured the way it is today, and I added the "ip token" part afterwards to hack around some problem we had [16:30:28] To do so they'd need to define the source data in hierdata somewhere, and then define interface::tagged like it's done for LVS servers here: [16:30:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/lvs/tagged_interface.pp [16:31:06] I think it had to do with a race condition on host boot and when the RA arrived and accidentally ending up with unintended RA-configured IPs on the interface that messed with some other part of this jenga game [16:31:23] jbond: Hopefully none, literally just my question right now. If you don't know off hand leave the rest :) [16:31:51] topranks: in relation to the tokenisation there is a comment in the puppet code (probably from b.black) which hints at an issue i have seen before. namley that if you just relay on RA's it can take some time between systemd-networking emiting up and the and the ip address actully being applied [16:31:54] ("ip token" is a linux-ism to say "if we're going to autoconfigure an IPv6 based on an RA, use this for the low-order bits, instead of whatever you'd do by default using the macaddr or randomization or whatever) [16:32:05] this can lead to damons sfailing to start because there is no ipv6 to bind too [16:33:31] ok yeah makes sense. [16:33:39] jbond: sorry I realise there is a lot of chatter here. [16:34:14] jbond: none of it's too-important, I was just trying to give a rundown of all related messes in our existing state of affairs for topranks to catch him up :) [16:34:15] The question I was hoping you might know about is about using the modules/interface/manifests/tagged.pp file [16:34:19] in relation to interface::tagged i had a quick look earlier and yes that seems correct (alkthough we could improve the puppetisation of the network in genral) [16:34:35] yeah, I think from the discussion there is definitely scope for improvements. [16:34:57] i have not dug to much in that part of puppet myself [16:35:32] But also it seems not too difficult to make use of some of the existing, and might help save on switch ports by using tagged interfaces. [16:35:33] but as always feel free to add me to a CR, and we can take it from there [16:35:48] * jbond gose to read the backlog [16:36:22] Cool. I'll point artur.o in that direction and work with him to get something going. [16:36:40] thanks for the info :) [16:46:24] topranks: so after reading the backlog i think brandon has given a good picture of the current land scape i have made some changes but nothing to the core of the system. For improvments my preference would be to move to a place where all hosts had there network configuered via puppet using data from netbox. this would be done with the puppet netbox integration however in the first iteration of that [16:46:30] we will just pass meta data (e.g. rack/row name) ... [16:46:33] ... and not configueration details like ip addreses (of course the line between theses two is subjective). As such i think current system i.e. relying on dhcp will probably be around for a while. [16:47:42] I wuold also say that in theory we can derive and right a statci config file based soley on the IPv4 addresses give out during d-i and disable ra/dhcpv6 but of course this all fails when we move to having ipv6 only hosts [16:47:57] so im not sure its worth exploring [16:48:39] its also worth noting that i have this holding task https://phabricator.wikimedia.org/T234207 to consider mving away from /etc/network/interfaces to something elses [16:48:41] thanks John appreciate the detailed response. [16:49:08] We can probably get away with keeping the private IPv4 just to bootstrap new hosts for a couple of years after the v4 switch off [16:49:21] also true [16:49:31] On the future direction - puppet configured from data in Netbox, that does seem like the right direction yes. [16:49:46] and as you say we've a task there to begin doing that, even just for rack/row info. [16:50:32] yeah I can't say I've stared at this much lately or have a strong opinion, but, as a general philosophical rule for this kind of thing: [16:51:06] I'd start with envisioning how you'd like this to work ideally, somewhere down the road. Lay out what that whole plan would look like, without regard to the iterative steps it would take to get there. [16:51:33] in relation to interface::tagged it uses augeas which im not a big fan of. i prefer to manage the whole file or make use of $thing.d folderes as such if you wanted to update/add some new puppet resource that dropped things in interfaces.d i would welcome revieing it :) [16:51:39] and then start thinking about what steps you can take to move towards that goal in a relatively-straightforward path, without taking missteps off in eventually-wrong directions. [16:51:50] * jbond but if it works perhaps just use it [16:52:26] hmm yeah, my last place we moved to ifupdown2, I seen recently there is ifupdown-ng also (two alternatives to netplan, or systemd-networkd). To some extent it's a matter of "choose your poison" [16:52:34] I think we (not just this team, but the whole industry) often get mired in the trap of "this is an iterative step in the right direction" sort of improvements that don't actually end up being useful in the eventual long-road plan once it's known. [16:52:36] bblack: I'm 100% in agreemnet. [16:53:11] But there may be a short-term requirement to get something working, or be stuck with certain hosts using 2 interfaces, so there is the short term vs medium term outlook. [16:53:20] yeah that too for sure [16:53:43] but then you can at least scope that as a temporary hack of some imaginable duration and factor that accordingly into how much effort to put into making it pretty [16:53:55] <_joe_> also I think we (not just this team, but the whole WMF engineering) often get mired in the trap of "this is an iterative step which is kinda ugly in itself, but my final goal that I'll reach with the next 25 steps is glorious" and then only get to execute step 5 of 25 ::) [16:54:14] yeah, it's all shades of the same problem [16:54:19] <_joe_> so I think dividing work towards a goal needs to have "acceptable milestones" too [16:54:24] llol yes very much what joe said [16:54:33] +1 [16:54:41] the reality is, if we really had the final goal really nailed down, it wouldn't be 25 steps out probably. It's 25-steps because our ability to see ahead is so limited that we wander on the path a lot :) [16:54:53] the 3 of you have the same irccloud color (blue), it makes scrollback hard to read! [16:55:19] can't you shuffle their colors? [16:55:52] here comes a different shade of blue [16:55:55] hmm.. can I change that? you're probably better off without reading all the junk anyway. [16:56:31] heheh [18:37:51] warning: about to attempt downtiming lvs1016, and moving cables, and bringing lvs1020 into service [18:37:59] there's a lot of new moving parts to this process, and there will be some delays [18:38:07] lvs config changes on hold please! [21:06:54] howdy, we are seeing some spurious email warnings from icinga related to frmx2001. it is reporting "passive check is awol - see T196336" but everything is fine on the host and the icinga page looks clear. [21:06:54] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [21:07:42] this started about 0923 PST / 1723 UTC and is only happening with this one host. in the past when we have seen this behavior it is more widespread. [21:30:11] dwisehaupt: frmx2001 checks have active checks enabled in icinga, while they should be disabled [21:30:41] from the UI it looks like it was done by a human in the UI, but if I was you I would also check the config to make sure it's correct [21:30:49] see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=frmx2001&service=check_load&scroll=61 [21:30:52] compared to [21:30:55] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=frdata1002&service=check_procs [21:31:03] (random fr host that is working fine) [21:31:48] you can go over all frmx2001 checks and hit the Reset Modified Attributes button on the right to reset any modified attribute [21:33:13] interesting. thanks. [21:50:11] volans: thanks. i have disabled the active checks and it's looking good again. neither myself or Jeff_Green made that change so we're a little baffled. [22:16:50] dwisehaupt: so, your server is called frmx2001 and there was an incident about mx2001 few days ago and from SAL I can see: [22:16:56] "icinga - re-enabling active monitoring checks on mx2001 (T297128)" from mutante [22:16:56] T297128: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 [22:17:27] so I guess that it might have been a human error of matching all mx2001 checks and that included also those from frmx2001... [22:17:48] oh, hello. this could be my bad then [22:17:54] for non-fr hosts it's the exact opposite [22:18:11] people sometimes use "disable active checks" when they should stay active [22:18:20] while for fr hosts they should stay passive [22:18:28] looks at icinga UI [22:19:32] dwisehaupt: it's what volans said, I searched for "mx2001" in Icinga and it matched both mx2001 and frmx2001 and I selected them [22:19:46] sorry about that. right now it looks all as it should be though [22:20:22] it's because mx2001 came back into service, so re-enabled monitoring [22:20:44] thanks for fixing it, volans [22:22:14] not related to that old ticket then [22:33:08] mutante: cool. that explains it. thanks!