[06:58:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[07:03:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[09:04:28] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster
[09:11:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3063:9331 is unreachable   - https://alerts.wikimedia.org
[09:15:12] <vgutierrez>	 ^^ expected
[09:21:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3063:9331 is unreachable   - https://alerts.wikimedia.org
[09:40:47] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster executed with errors: - cp30...
[09:41:42] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster
[09:56:26] <vgutierrez>	 PXE boot is pretty slow today from esams :(
[09:56:36] <vgutierrez>	 and the previous attempt crashed during the install
[10:22:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3063:9331 is unreachable   - https://alerts.wikimedia.org
[10:32:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3063:9331 is unreachable   - https://alerts.wikimedia.org
[10:58:53] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster completed: - cp3063 (**WARN*...
[12:07:03] <vgutierrez>	 godog: somehow prometheus is having a hard time fetching envoy metrics from cp3063
[12:08:08] <vgutierrez>	 but the target file seems to be as expected on /srv/prometheus/ops/ @ prometheus3001
[12:12:19] <vgutierrez>	 tcpdump doesn't show any kind of connection attempt against port TCP 9631 @ cp3063
[12:24:43] <wikibugs>	 10netops, 10Infrastructure-Foundations: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) p:05Triage→03Medium
[12:24:57] <wikibugs>	 10netops, 10Infrastructure-Foundations: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney)
[12:26:39] <vgutierrez>	 godog: it looks like a manual reload of prometheus@ops.service on prometheus3001 has fixed the issue, is that expected?
[12:55:59] <godog>	 vgutierrez: interesting, no not expected :(
[12:59:41] <godog>	 IIRC prometheus does pick up changes by itself to existing target files, and puppet does a reload on config changes, I'm guessing it got stuck somehow
[13:00:19] <vgutierrez>	  hmm nope it doesn't
[13:00:21] <vgutierrez>	 at least not at the moment
[13:00:25] <vgutierrez>	 drops the file and nothing else
[13:01:55] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Currently waiting on T299759 to be completed to gain console access to these devices and begin the process.
[13:07:59] <godog>	 (in the kumospace now)
[14:48:35] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez)
[15:42:56] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[17:20:07] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) Services have been moved to lvs_setup, but there are some pybal icinga alerts still open e.g.   ` lvs1015 PyBal IPVS diff check CRITICAL 2022-01-2...
[17:24:14] <herron>	 hello!  could I ask a pybal/lvs expert to have a look at https://phabricator.wikimedia.org/T299700#7640829 when they have a moment?  trying to clear a few icinga alerts, and retire those services.  I think the next step is a pybal restart but want to confirm that before proceeding
[17:46:38] <bblack>	 herron: pong
[17:47:02] <herron>	 bblack: hey
[17:47:12] <bblack>	 herron: so you've merged the patches that should remove the service from pybal config, but not yet restarted pybal, correct?
[17:47:23] <bblack>	 (and I assu,e also not done any manual removal at the LVS level?)
[17:47:42] <herron>	 bblack: so far I've merged the patches to remove the discovery record and switch state to lvs_setup
[17:47:55] <bblack>	 ok
[17:49:46] <herron>	 and yeah, no manual removal the the lvs level yet, have not touched pybal or lvs there yet
[17:50:07] <bblack>	 ok
[17:50:16] <bblack>	 so yeah, the discovery-dns part and the pybal part are kinda separate matters
[17:50:22] <bblack>	 lets look at the disc-dns part first
[17:50:53] <herron>	 ok
[17:51:46] <bblack>	 so, it might be poorly documented (and maybe we should fix that!)
[17:52:25] <bblack>	 but there's some in-tree docs here on the disc-dns part:
[17:52:26] <bblack>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/utils/mock_etc/discovery-metafo-resources
[17:52:41] <bblack>	 basically, disc-kibana also needs to be removed in that files
[17:52:52] <bblack>	 (before the lvs_setup stuff)
[17:53:44] <herron>	 ah ok, missed that one, I'll write a patch now
[17:56:11] <bblack>	 and then after that, I guess we can catch up on the steps in https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service
[17:56:48] <bblack>	 basically, do the authdns-update for the mock_etc patch you just made
[17:56:56] <herron>	 sounds good, doing
[17:56:57] <bblack>	 then make sure the agent has run on authdns+icinga
[17:57:07] <bblack>	 sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent
[17:59:05] <herron>	 authdns-update just finished, running cumin/puppet now
[18:02:21] <herron>	 100.0% (16/16) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent'.
[18:02:22] <herron>	 done
[18:02:31] <bblack>	 ok
[18:02:50] <bblack>	 so next up on the list:
[18:02:54] <bblack>	 -- Change state: lvs_setup to state: service_setup, and remove the service stanza from profile::lvs::realserver::pools. 
[18:03:32] <herron>	 ok
[18:04:11] <bblack>	 it's that latter part I'm a little confused on, as I don't know all the new+old names and roles/profiles
[18:04:45] <bblack>	 it may be that you've already removed the correct entries for p::lvs::realserver::pools
[18:06:06] <bblack>	 it's "kibana7" that's staying, and just the "kibana" that's going away?
[18:06:24] <bblack>	 or?
[18:07:10] <herron>	 yes that's right.  kibana is going away, along with several logstash entries
[18:07:57] <bblack>	 there don't seem to be any realserver entries in the puppet repo for "kibana"
[18:08:02] <bblack>	 maybe I have a bad grep
[18:08:02] <herron>	 kibana7 will stay, and logstash.wm.o should still map to kibana7.svc as the trafficserver backend
[18:08:18] <herron>	 yeah I was noticing that as well
[18:08:42] <bblack>	 where would we have expected them? they had to be there for it to work before
[18:08:59] <bblack>	 or... is it htat kibana and kibana7 share addresses?
[18:09:48] <bblack>	 oh I get it now I think
[18:10:06] <bblack>	 there's two different ways to configure the realserver side, and the instructions are only talking about one of them
[18:10:17] <herron>	 on the legacy elk5 hosts, so logstash100[789] and logsatsh200[456]
[18:10:34] <herron>	 I notice there are lvs::realserver::realserver_ips set, but yeah not seeing pools
[18:10:36] <bblack>	 lvs::realserver::realserver_ips
[18:10:38] <bblack>	 yeah
[18:11:24] <bblack>	 so basically we need to remove those data from hieradata/role/eqiad/logstash.yaml + hieradata/role/codfw/logstash.yaml
[18:11:56] <bblack>	 I guess logstash was eqiad-only
[18:12:10] <bblack>	 ok
[18:12:17] <herron>	 ok, making the patch
[18:15:11] <bblack>	 looking ahead in the instructions a bit, they're not very clear about a number of important details heh
[18:16:21] <bblack>	 anyways, probably want to disable puppet on various related hosts before merging this, so we can watch it work in a controlled fashion and deal with any puppet failures appropriately, etc
[18:16:38] <herron>	 ha, that makes me feel better
[18:16:53] <herron>	 ok sounds good
[18:17:08] <bblack>	 it's sort of implicit in the fact that it asks you to manually run puppet in various places in the steps, which kind of implies it shouldn't have already run automagically while you were busy
[18:17:19] <bblack>	 so basically all the logstash hosts mentioned earlier, and the affected LVSes
[18:17:40] <bblack>	 which would be the "low-traffic" and backup lvses in codfw+eqiad
[18:18:25] <bblack>	 the easiest way to know those for sure is looking in modules/lvs/manifests/configuration.pp
[18:18:27] <herron>	 lvs1015, lvs1020, lvs2009, lvs2010 yea?
[18:18:34] <bblack>	         'lvs1015'      => 'low-traffic',
[18:18:34] <bblack>	         'lvs1020'      => 'secondary',
[18:18:40] <bblack>	         'lvs2009'      => 'low-traffic',
[18:18:40] <bblack>	         'lvs2010'      => 'secondary',
[18:18:42] <bblack>	 yeah
[18:18:48] <herron>	 ok great
[18:19:30] <herron>	 re: the logstash hosts, I may have prematurely moved them to role::spare::system, they don't have the logstash role anymore
[18:19:40] <bblack>	 oh nice
[18:19:43] <bblack>	 well, kinda
[18:19:48] <herron>	 not sure if that's a good thing or a bad thing heh
[18:19:55] <bblack>	 did you reimage them too?
[18:20:06] <herron>	 no not yet, they are VMs so prepping for removal
[18:20:30] <herron>	 but as-is they are running with puppet assigning them role(spare::system)
[18:21:03] <bblack>	 so the "problem" with moving them to spare without going through the parts we're doing now first, is it leaves the addresses confiured on these spare hosts:
[18:21:06] <bblack>	 root@logstash1007:~# ip -4 addr|grep /32 inet 10.2.2.33/32 scope global lo:LVS inet 10.2.2.36/32 scope global lo:LVS
[18:21:12] <bblack>	 hmmm bad paste
[18:21:23] <bblack>	 but basically the addresses are still there; the spare role doesn't clean them
[18:21:33] <bblack>	 which technically doesn't create any real problem for us in practice
[18:22:04] <bblack>	 (but in theory, one might re-role this again without reimage to profile::foobar, and it might try to access a new different service which has been assign the reuse of 10.2.2.33, and fail)
[18:22:57] <bblack>	 anyways
[18:23:01] <herron>	 gotcha, ok.  I'm planning to nuke the old hosts soon, but could manually remove those addresses too
[18:23:23] <bblack>	 but since they're being eventually nuked and they're VMs, so not reimaging to anything new/different, it's kinda moot
[18:23:29] <herron>	 ok
[18:23:50] <bblack>	 so yeah no point in even puppet-disabling the service hosts, we just need to merge that patch and step through the lvs hosts
[18:25:27] <herron>	 sounds good, I've just disabled puppet on those 4 lvs hosts
[18:25:56] <bblack>	 right, so puppet-merge, and then run the agent on the secondaries: 1020 and 2010
[18:26:08] <bblack>	 it should show removing some lines from pybal.conf when the agent runs
[18:26:30] <herron>	 doing
[18:28:29] <bblack>	 and then do a "systemctl restart pybal.service" on both of those same secondary hosts
[18:28:57] <bblack>	 and then make sure pybal is still running and healthy afterwards.  I guess that's what the "wait 300 seconds" is about, maybe.  Time to see more unexpected icinga alerts or something.
[18:29:46] <herron>	 ok, just about done with puppet runs
[18:29:50] <bblack>	 but you can also just confirm the process is alive ("systemctl status pybal") afterwards, and maybe check tail -f /var/log/pybal.log and see that it's outputting normal things again (which is often cluttered by a stream of monitoring failures)
[18:30:13] <herron>	 puppet runs look good, doing pybal restarts on the secondaries now
[18:30:54] <herron>	 and of course checking the logs before restarting to get a feel for pre-existing errors
[18:31:18] <bblack>	 lots of logstash related ones heh
[18:31:57] <bblack>	 [kibana7_443 ProxyFetch] WARN: logstash1030.eqiad.wmnet
[18:32:03] <bblack>	 [search-psi-https_9643 ProxyFetch] WARN: elastic1043.eqiad.wmnet
[18:32:13] <bblack>	 I guess these are known/expected?
[18:32:20] <bblack>	 I only ask since they're related services
[18:32:22] <herron>	 😅
[18:32:33] <herron>	 yeah I *think* logstash1030 is known, at least is one of many backends
[18:32:38] <bblack>	 ok
[18:33:28] <herron>	 alright, finished pybal restarts on the secondaries
[18:33:41] <herron>	 and seeing recoveries too, that's nice
[18:34:35] <bblack>	 the BGP alert is a little concerning, but maybe it just has a hair trigger
[18:35:26] <bblack>	 I did a re-sched in icinga to recheck it
[18:36:22] <herron>	 ok, yeah seems ok in the log?
[18:36:30] <herron>	 pybal log
[18:37:17] <bblack>	 yeah it claims so
[18:37:19] <bblack>	 Jan 21 18:33:15 lvs1020 pybal[30179]: [bgp.FSM@0x7f6f8b743750 peer 208.80.154.197:52724] INFO: State is now: ESTABLISHED
[18:37:22] <bblack>	 Jan 21 18:33:15 lvs1020 pybal[30179]: [bgp.BGPFactory@0x7f6f8b644d40] INFO: BGP session established for ASN 64600 peer 208.80.154.197
[18:37:25] <bblack>	 Jan 21 18:33:15 lvs1020 pybal[30179]: [bgp.FSM@0x7f6f8841af10 peer 208.80.154.196:60454] INFO: State is now: ESTABLISHED
[18:37:28] <bblack>	 Jan 21 18:33:15 lvs1020 pybal[30179]: [bgp.BGPFactory@0x7f6f8b649c20] INFO: BGP session established for ASN 64600 peer 208.80.154.196
[18:37:31] <bblack>	 but the icinga check is still WARNING
[18:37:47] <bblack>	 hmmm
[18:38:29] <bblack>	 it's not critical anymore, like it was when it hit -ops channel though
[18:39:09] <bblack>	 the warning seems to be about an amazon ASN though
[18:39:12] <bblack>	 so not related to us
[18:39:48] <herron>	 huh ok
[18:40:08] <bblack>	 I guess it was already in WARNING over the Amazon ASN, and the pybal blip temporarily upgraded it to critical
[18:40:17] <bblack>	 but we don't get an IRC recovery because it never went back to OK either :)
[18:40:25] <topranks>	 bblack:  let me check that alert, what CR is it for?
[18:40:30] <bblack>	 cr2-eqiad
[18:40:41] <herron>	 makes sense re: recovery
[18:40:48] <topranks>	 We did have an issue with some down peering to AWS I was emailing them during the week about, so it's probably just that
[18:40:49] <bblack>	 warning about a stuck Connecting state for BGP to 16509
[18:40:54] <topranks>	 (i.e. alert was already in warning)
[18:41:00] <bblack>	 yeah ok, thanks
[18:41:29] <topranks>	 Yep that is the same pre-existing problem, so alert state is back to what it was before this.
[18:41:39] <bblack>	 herron: so moving on - basically go through the same on the other two LVSes now: agent + pybal restart/check
[18:41:44] <topranks>	 And in general it's in progress, waiting for Amazon to get back to me.
[18:42:05] <herron>	 bblack: sounds good, moving on to the primaries now
[18:42:33] <bblack>	 this step fails over the live traffic, very briefly to the secondaries while pybal is restarting
[18:42:43] <bblack>	 so they's why checking out the pybal state on the secondaries first is critical :)
[18:43:08] <herron>	 ack
[18:45:20] <bblack>	 [oh also, it might be a good idea to log some of these steps, like the pybal restarts, but I'm probably saying that too late.  It does help if things end up going off the rails and we have to figure out what happened]
[18:45:29] <herron>	 true!
[18:47:55] <bblack>	 the next step is going to be all the manual deletes, for the specific UDP and TCP ports affected, which is quite a few
[18:49:03] <herron>	 pybal restarts are done
[18:49:21] <herron>	 ok, yeah there's a handfull of them
[18:50:17] <bblack>	 the logstash part is eqiad only, and seems to be: udp/12201 tcp/11514 udp/11514 udp/8324
[18:50:36] <bblack>	 so the commands would be:
[18:51:54] <bblack>	 ipvsadm -du 10.2.2.36:12201; ipvsadm -dt 10.2.2.36:11514; ipvsadm -du 10.2.2.36:11514; ipvsadm -du 10.2.2.36:8324
[18:52:04] <bblack>	 (on both eqiad lvses)
[18:52:27] <herron>	 ok
[18:52:36] <bblack>	 and then there's the kibana bit, which is tcp only on ports 80 and 443 of 10.2.[12].33 as appropriate in each DC
[18:53:09] <bblack>	 so codfw is: ipvsadm -dt 10.2.1.33:80; ipvsadm -dt 10.2.1.33:443
[18:53:20] <bblack>	 eqiad is: ipvsadm -dt 10.2.2.33:80; ipvsadm -dt 10.2.2.33:443
[18:53:31] <herron>	 You need to supply the 'real-server' option for the 'delete-server' command
[18:53:41] <bblack>	 oh sorry
[18:53:57] <bblack>	 uppercase D for the option
[18:54:08] <bblack>	 replace all those -du -dt with -Du -Dt :)
[18:54:12] <herron>	 ahh, ok
[18:54:29] <bblack>	 that's what I get for trying to save all that typing.  more typing
[18:54:50] <herron>	 haha
[18:54:58] <bblack>	 if any of them say "Memory Allocation Problem", we probably typo'd something
[18:55:17] <herron>	 so far so good, logstash part done
[18:56:39] <herron>	 and kibana eqiad/codfw done too
[18:56:42] <bblack>	 nice
[18:56:56] <bblack>	 so at this point the runtime/functional parts are done, what's left is basically cleanup commits
[18:57:25] <bblack>	 as documented in the LVS removal instructions: kill the whole service_catalog entries, and kill the related conftool-data too
[18:57:48] <herron>	 excellent, ok something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/755480 ?
[18:57:51] <bblack>	 as not mentioned there: should probably clean up other host-level leftovers (e.g. the other hieradata nearby where we removed the realserver IPs?)
[18:58:12] <bblack>	 and probably should also clean up / deallocate those IPs from DNS as well (the entries for e.g kibana.svc.eqiad.wmnet and such)
[18:58:21] <herron>	 (that's going to be a fun patch to rebase)
[18:59:04] <bblack>	 yeah that one
[19:00:24] <bblack>	 and yeah auto-rebase will fail because we already changed the state from "production" if nothing else
[19:00:52] <bblack>	 CLI git rebase might figure it out easier
[19:02:02] <herron>	 alright, that wasn't so bad after all
[19:03:39] <herron>	 alright, I think that patch is ready to merge as well
[19:04:35] <bblack>	 yeah
[19:07:04] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron)
[19:07:36] <herron>	 thanks so much for the help bblack
[19:07:44] <bblack>	 np
[19:07:48] <bblack>	 it's mostly our mess :)
[19:07:55] <bblack>	 do clean up the DNS part too at least
[19:08:14] <herron>	 the old .svc records?
[19:08:18] <bblack>	 yeah
[19:08:21] <herron>	 kk doing
[19:08:37] <bblack>	 logstash.svc.eqiad.wmnet + kibana.svc.{codfw|eqiad}.wmnet, and their reverse entries in the 10.x files
[19:11:04] <bblack>	 hmmm seeing som enew icinga alerts, probably from the last patch
[19:12:44] <bblack>	 herron: I think the cleanup patch merged earlier is actually missing a peice, causing all that spam on -ops
[19:12:52] <herron>	 ahh good times
[19:12:55] <herron>	 let's see
[19:13:05] <bblack>	 there's a conftool-data/discovery/services.yaml
[19:13:20] <bblack>	 probably gotta remove kibana and logstash lines there
[19:13:47] <herron>	 ah sure enough, ok uploading patch
[19:14:51] <bblack>	 all of this config seems to suffer from a lot of multiply-defined metadata :P
[19:16:14] <herron>	 haha 
[19:16:16] <herron>	 alright patch uploaded
[19:17:43] <bblack>	 that patch does make me wonder: why is there no "kibana7" in there? :)
[19:18:22] <bblack>	 maybe that's a question for another day!
[19:21:48] <bblack>	 hmmm mysteries deepen
[19:21:58] <bblack>	 was the conftool stuf already alerting about kibana7 before all this decom?
[19:22:38] <herron>	 kibana7 doesn't have a discovery setup afaik
[19:22:50] <herron>	 hmm, no don't think so
[19:22:52] <bblack>	 yeah but this isn't discovery
[19:22:59] <bblack>	 this is jus tbasic per-site LVS stuff
[19:23:54] <bblack>	 in any case, we seem to have broken conftool for kibana7 now, they're somehow inter-related here
[19:24:07] <herron>	 hmmm
[19:25:15] <bblack>	 either that or those entries were already not working right
[19:26:49] <herron>	 Compilation of file '/srv/config-master/pybal/eqiad/kibana7' is broken -- wonder what broken means specifically?  the file is there and looks fine
[19:27:36] <bblack>	 well it's stale
[19:27:50] <bblack>	 in any case, I think there's some naming confusion going on here with the cluster names and service names
[19:28:13] <bblack>	 in service terms, we were removing logstash-* and kibana from the services metadata in hieradata
[19:28:41] <bblack>	 but both were in the cluster "logstash"
[19:28:49] <bblack>	 neither was in the cluster "kibana"
[19:29:25] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/756046/1/conftool-data/discovery/services.yaml
[19:29:36] <bblack>	 ^ I think this didn't want/need to remove the "kibana" entry
[19:30:03] <bblack>	 what's still puzzling, though, is that the kibana7 service does seem to explicitly state in the service config that it uses cluster "kibana7" (which doesn't exist?)
[19:30:36] <bblack>	 but maybe let's start with restoring that kibana line we just removed
[19:30:42] <bblack>	 and then solving this other mystery
[19:30:52] <herron>	 ok
[19:31:00] <herron>	 there is e.g. conftool-data/node/codfw.yaml:  kibana7
[19:31:17] <herron>	 putting the kibana line back now
[19:31:37] <bblack>	 yeah I donno
[19:31:55] <bblack>	 there's also no matching kibana7 line where we removed the kibana one, though
[19:35:09] <herron>	 yeah, it should be independent
[19:36:44] <bblack>	 yeah something isn't though, I just can't put my finger on where this went off the rails at
[19:37:23] <bblack>	 :q
[19:38:01] <bblack>	 while looking, stumbled on something else related and fishy:
[19:38:10] <bblack>	 hieradata/role/common/logging/opensearch/collector.yaml :
[19:38:16] <bblack>	 # reusing kibana.discovery.wmnet to squelch PCC missing secret() errors
[19:38:19] <bblack>	 profile::tlsproxy::envoy::global_cert_name: "kibana.discovery.wmnet"
[19:38:25] <bblack>	 ^ we just killed off that name in DNS I think
[19:40:05] <herron>	 hmm that should be running with ats mapping logstash.wm.o to https://kibana7.svc.codfw.wmnet
[19:40:22] <herron>	 although the cert may have that name too
[19:41:58] <herron>	 yeah that's the CN, with SANs DNS:kibana.discovery.wmnet, DNS:kibana.svc.eqiad.wmnet, DNS:kibana.svc.codfw.wmnet, DNS:logstash.wikimedia.org, DNS:cas-logstash.wikimedia.org, DNS:kibana-next.svc.eqiad.wmnet, DNS:kibana-next.svc.codfw.wmnet, DNS:logstash-next.wikimedia.org, DNS:kibana7.svc.eqiad.wmnet, DNS:kibana7.svc.codfw.wmnet
[19:42:07] <bblack>	 have there been any errors during all this with puppet-merge, when it does its conftool actions?
[19:42:55] <bblack>	 re: the cert stuff, yeah, I guess file that as further future cleanup (to remove refs to the dead names)
[19:43:19] <herron>	 here's the puppet-merge conftool output https://phabricator.wikimedia.org/P18974
[19:43:50] <bblack>	 I mean the earlier ones during the removals
[19:45:13] <herron>	 added that to the paste too
[19:45:55] <bblack>	 we're clearly missing something somewhere
[19:46:45] <bblack>	 kibana7 doesn't have discovery at all, right?
[19:47:23] <herron>	 yes that's right
[19:48:31] <bblack>	 hmmm
[19:50:54] <bblack>	 I don't really get what's going on with the confd generation stuff here
[19:52:33] <taavi>	 did you run puppet on the alert hosts? maybe they had some checks that should have been removed
[19:52:58] <herron>	 there's a bunch of empty .err files in /var/run/confd-template/
[19:53:35] <bblack>	 yeah
[19:53:54] <bblack>	 there's also still live output files, like it doesn't clean up dead entries?
[19:54:19] <bblack>	 e.. /srv/config-master/pybal/eqiad/logstash-udp2log still exists, with a timestamp from two days ago
[19:54:42] <bblack>	 even though no puppet config or etcd data has "logstash-udp2log" to drive its existence
[19:55:01] <herron>	 thanks taavi I'll kick a puppet run off there just to be sure
[19:55:02] <bblack>	 maybe they need manual cleanup
[19:55:10] <herron>	 wondering about that too, manual cleanup
[19:56:22] <herron>	 I'll try cleaning out the .err files and see
[19:56:32] <herron>	 oh they are gone now :)
[19:57:05] <bblack>	 they're still there
[19:57:16] <bblack>	 removed them all now
[19:57:20] <herron>	 ok*
[19:57:22] <bblack>	 (the errors)
[19:57:27] <herron>	 thanks yeah
[19:58:20] <bblack>	 apparently the kibana7 errors went away too, even though there were no kibana7 error files
[19:58:44] <bblack>	 ahahah
[19:58:49] <herron>	 -rw-r--r--  1 root root    0 Jan 21 19:19 .kibana734769812.err
[19:58:52] <bblack>	 yeah, that
[19:58:55] <herron>	 lol
[19:59:02] <bblack>	 .kibana363806690.err
[19:59:13] <bblack>	 if one of them happens to start with a 7, it matches an error check somewhere
[19:59:14] <herron>	 lucky number 734769812
[19:59:27] <herron>	 yeah the glob in the icinga check is name*
[19:59:52] <bblack>	 ok, so, mysteries seem resolved
[19:59:58] <herron>	 that's hilarious, but at least solves the mystery I guess yeah
[20:00:01] <bblack>	 probably should revert that one revert
[20:00:11] <herron>	 ok coming up
[20:01:27] <bblack>	 the stale files in /srv are still there of course, but I guess meaningless now
[20:01:32] <bblack>	 will remove just for my own sanity
[20:01:42] <herron>	 ok good idea thanks
[20:02:28] <herron>	 alright, ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/756001 ?
[20:03:11] <bblack>	 yeah
[20:03:39] <herron>	 ok, take two here we go
[20:04:05] <bblack>	 our tooling for adding/removing services in lvs+discovery is pretty appalling
[20:04:23] <herron>	 https://phabricator.wikimedia.org/P18975
[20:04:26] <bblack>	 s/tooling/process/ or whatever
[20:04:30] <herron>	 looks good re: puppet-merge
[20:06:02] <bblack>	 this is still outstanding on the other patch, trivial: https://gerrit.wikimedia.org/r/c/operations/dns/+/756045/2/templates/10.in-addr.arpa#43
[20:06:16] <herron>	 ah thanks, updating
[20:10:59] <bblack>	 some of it comes down to puppet lacking native support for the 4th dimension
[20:11:28] <bblack>	 which the whole "state: lvs_setup" and related bits is meant to cover over
[20:11:43] <bblack>	 but it's a generally-problematic pattern
[20:12:05] <herron>	 ha yeah things will be configured eventually, when puppet gets around to it
[20:12:24] <bblack>	 I mean more that puppet doesn't have any notion of git history, only what the current head rev says
[20:12:39] <bblack>	 when you delete a thing, it ceases to exist for puppet
[20:13:04] <bblack>	 and so "state: lvs_setup" and also various "file asdf { ensure: absent }" related logic
[20:13:09] <bblack>	 all these things are hackarounds for that
[20:13:35] <herron>	 yes true, some kind of graceful diffing/hanling of resources that actually became absent as opposed being ensured absent
[20:14:19] <bblack>	 the puppet repo is in theory just CM data, not state
[20:14:32] <bblack>	 but we add state-as-config to make up for the lack of state in how the agent handles the OS
[20:15:44] <bblack>	 anyways
[20:16:19] <bblack>	 most of these problems will go poof if most services move to k8s and k8s services move away from lvs and lvs ceases to exist
[20:17:05] <bblack>	 which should all eventually happen, I think
[20:18:38] <herron>	 yeah I guess a model of more regularly (re)building OSes either within containers, or vms too would help solve that config skew
[20:18:54] <herron>	 but for now I am happy to not be showering in icinga alerts
[20:21:49] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) 05Open→03Resolved These have been removed with much help from @BBlack thank you!
[20:23:44] <herron>	 thanks again for all the help, have a good weekend 👋
[20:23:51] <bblack>	 you too! :)