[09:57:05] <jbond>	 mutante: as you mention install is the proxy and the al;ert is likley from the puppetmaster downloading the update before the cron was removed ~4 april.  as to geoip-database it is ultimatly brought in by dnsutils which depends on liblwres161 -> libgeoip1 -> geoip-database
[10:29:24] <jbond>	 hi all fyi planning to rebuild sretest1001 in a few minutes to test something
[13:17:04] <cdanis>	 just want to quickly point out that last night's (rather confusing and spurious) pages also wouldn't've happened if we had gotten around to T230733
[13:17:05] <stashbot>	 T230733: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733
[13:44:17] <bblack>	 ?
[13:44:25] <bblack>	 you mean the cp5002 issue?
[13:44:33] <bblack>	 cdanis: ^
[13:46:09] <bblack>	 it's kind of unique, in the sense that the "site" was in fact pooled, and even the server was pooled (intentionally, for a while).  But even when it was depooled, it was the only thing left of its type in the world, so there still would've been some anomaly somewhere.
[13:49:05] <bblack>	 to recap a bit the important parts of what happened in sequence, starting from the distant past:
[13:49:35] <bblack>	 1) We started a process of replacing ats-tls with haproxy on all the nodes, reimaging them into a new role as we went through the list (cache::upload -> cache::upload_haproxy)
[13:50:03] <bblack>	 2) cp5002 had a hardware memory issue, so it was depooled and taken out of the flow of normal operations, including these upgrades.
[13:50:28] <bblack>	 3) We finished the upgrade process, but left this one dead host sitting there in the old role (which has a different set of healthchecks and icinga alerts than the new one)
[13:51:02] <bblack>	 4) A long time later - remote hands replaced the DIMM, and we re-pooled it afterwards (which clearly was not what should've happened in the state it was in)
[13:51:40] <bblack>	 at that point - the ats-tls cluster/service/whatever didn't really functionally exist anymore, but suddenly a new host tried to come online and into service running that role and using those old alerts
[13:52:19] <bblack>	 so suddenly cp5002 was the only ats-tls node in the world, and eqsin was the only site with such a service in the world, and it had a broken puppet config that failed to bring up services anyways.
[13:53:38] <bblack>	 there were multiple points along this timeline where we could've done better :)
[13:54:14] <bblack>	 we probably could/should have switch it to the new role in site.pp even if it was dead at the time, when we got to the end of the process
[13:54:35] <bblack>	 we probably could/should have removed the old role and its trailing leftovers from puppet in general by now
[13:58:48] <bblack>	 as general policy (without having to remember any of these lingering details) - if a node's been out for hwfail for a while and is now coming back after a fix, it's probably a smart general policy to start with an immediate reimage, for two reasons: (1) The hwfail itself could've introduced various corruption of software on disk that puppet won't correct + 2) Avoiding any chaos from the many 
[13:58:54] <bblack>	 software changes that may have happened while it was out of the loop.  It's going to violate some expectations of those who have been applying serial changes over long periods
[14:00:04] <bblack>	 and of course in the most-immediate sense: there's no way that node was icinga-healthy, so it shouldn't have been pooled in at all (but its pooled-ness isn't what caused the alerting.  It merely booting up in the old config did that regardless of pool state)
[14:00:26] <bblack>	 at the very least, it wasn't even successfully running the puppet agent
[14:05:05] <cdanis>	 bblack: right, but, a depooled server shouldn't be contributing to 'availability' metrics
[14:05:15] <cdanis>	 I think this will start to matter more as we start actually caring about SLOs
[14:05:17] <bblack>	 but what would in this case?
[14:05:39] <bblack>	 the global availability metric we're talking about here didn't exist, until this node became the only contributing node to it
[14:06:17] <cdanis>	 I'm suggesting that the global availability metric would only aggregate from pooled hosts
[14:06:21] <cdanis>	 and so it would continue to not exist
[14:07:05] <bblack>	 yeah but if we're making things more-ideal some other metric or alert should probably tell us that the fraction of pooled hosts in a site or in the world is below the threshold, too
[14:07:20] <bblack>	 either way, something should logically have paged when we suddenly defined a service with only one dead host in the world
[14:07:45] <cdanis>	 I'm not sure about paged but I'll agree with you about something should have alerted, yes
[14:14:45] <bblack>	 in this particular timeline, if nothing else changed, we still would've had one page, when it actually did get pooled
[14:14:59] <bblack>	 we juts wouldn't have had the follow-on confusing ones one it was re-depooled
[16:42:13] <jynus>	 because reimage + hw issues, backups are running a bit late this week, which may cause some (real, but expected) alert noise later in the day
[16:50:35] <mutante>	 thanks jbond,ACK
[17:03:21] <dduvall>	 does anyone here know how i make contributions to operations/software/tegola?
[17:04:08] <dduvall>	 seems like it's hosted on github but also depends on our pipeline jobs via zuul/jenkins to publish images
[17:04:35] <dduvall>	 (i'm trying to migrate everything off of old service-pipeline-test* jobs)
[17:04:55] <mutante>	 dduvall: isn't it https://gerrit.wikimedia.org/r/q/project:operations/software/tegola and github is just a mirror of that?
[17:05:08] <dduvall>	 i'm not sure
[17:05:49] <dduvall>	 ah maybe you're right
[17:05:52] <mutante>	 I think it is, there shouldn't be mirroring from github to gerrit, only the other way around
[17:06:06] <dduvall>	 got it, ok
[17:06:07] <dduvall>	 thanks!
[18:53:20] <mutante>	 su: warning: cannot change directory to /nonexistent: No such file or directory
[18:53:34] <mutante>	 ^ during 'apt-get update'. seems new
[21:51:06] <moritzm>	 not even remotely new :-) https://phabricator.wikimedia.org/T216832
[21:52:27] <moritzm>	 but most hosts should no longer print this (if they have been installed after the debmonitor system user was converted to systemd-sysuser)
[21:52:27] <mutante>	 oh, haha alright
[21:52:42] <mutante>	 that at least explains why I did not see it everywhere, ack