[09:57:05] mutante: as you mention install is the proxy and the al;ert is likley from the puppetmaster downloading the update before the cron was removed ~4 april. as to geoip-database it is ultimatly brought in by dnsutils which depends on liblwres161 -> libgeoip1 -> geoip-database [10:29:24] hi all fyi planning to rebuild sretest1001 in a few minutes to test something [13:17:04] just want to quickly point out that last night's (rather confusing and spurious) pages also wouldn't've happened if we had gotten around to T230733 [13:17:05] T230733: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 [13:44:17] ? [13:44:25] you mean the cp5002 issue? [13:44:33] cdanis: ^ [13:46:09] it's kind of unique, in the sense that the "site" was in fact pooled, and even the server was pooled (intentionally, for a while). But even when it was depooled, it was the only thing left of its type in the world, so there still would've been some anomaly somewhere. [13:49:05] to recap a bit the important parts of what happened in sequence, starting from the distant past: [13:49:35] 1) We started a process of replacing ats-tls with haproxy on all the nodes, reimaging them into a new role as we went through the list (cache::upload -> cache::upload_haproxy) [13:50:03] 2) cp5002 had a hardware memory issue, so it was depooled and taken out of the flow of normal operations, including these upgrades. [13:50:28] 3) We finished the upgrade process, but left this one dead host sitting there in the old role (which has a different set of healthchecks and icinga alerts than the new one) [13:51:02] 4) A long time later - remote hands replaced the DIMM, and we re-pooled it afterwards (which clearly was not what should've happened in the state it was in) [13:51:40] at that point - the ats-tls cluster/service/whatever didn't really functionally exist anymore, but suddenly a new host tried to come online and into service running that role and using those old alerts [13:52:19] so suddenly cp5002 was the only ats-tls node in the world, and eqsin was the only site with such a service in the world, and it had a broken puppet config that failed to bring up services anyways. [13:53:38] there were multiple points along this timeline where we could've done better :) [13:54:14] we probably could/should have switch it to the new role in site.pp even if it was dead at the time, when we got to the end of the process [13:54:35] we probably could/should have removed the old role and its trailing leftovers from puppet in general by now [13:58:48] as general policy (without having to remember any of these lingering details) - if a node's been out for hwfail for a while and is now coming back after a fix, it's probably a smart general policy to start with an immediate reimage, for two reasons: (1) The hwfail itself could've introduced various corruption of software on disk that puppet won't correct + 2) Avoiding any chaos from the many [13:58:54] software changes that may have happened while it was out of the loop. It's going to violate some expectations of those who have been applying serial changes over long periods [14:00:04] and of course in the most-immediate sense: there's no way that node was icinga-healthy, so it shouldn't have been pooled in at all (but its pooled-ness isn't what caused the alerting. It merely booting up in the old config did that regardless of pool state) [14:00:26] at the very least, it wasn't even successfully running the puppet agent [14:05:05] bblack: right, but, a depooled server shouldn't be contributing to 'availability' metrics [14:05:15] I think this will start to matter more as we start actually caring about SLOs [14:05:17] but what would in this case? [14:05:39] the global availability metric we're talking about here didn't exist, until this node became the only contributing node to it [14:06:17] I'm suggesting that the global availability metric would only aggregate from pooled hosts [14:06:21] and so it would continue to not exist [14:07:05] yeah but if we're making things more-ideal some other metric or alert should probably tell us that the fraction of pooled hosts in a site or in the world is below the threshold, too [14:07:20] either way, something should logically have paged when we suddenly defined a service with only one dead host in the world [14:07:45] I'm not sure about paged but I'll agree with you about something should have alerted, yes [14:14:45] in this particular timeline, if nothing else changed, we still would've had one page, when it actually did get pooled [14:14:59] we juts wouldn't have had the follow-on confusing ones one it was re-depooled [16:42:13] because reimage + hw issues, backups are running a bit late this week, which may cause some (real, but expected) alert noise later in the day [16:50:35] thanks jbond,ACK [17:03:21] does anyone here know how i make contributions to operations/software/tegola? [17:04:08] seems like it's hosted on github but also depends on our pipeline jobs via zuul/jenkins to publish images [17:04:35] (i'm trying to migrate everything off of old service-pipeline-test* jobs) [17:04:55] dduvall: isn't it https://gerrit.wikimedia.org/r/q/project:operations/software/tegola and github is just a mirror of that? [17:05:08] i'm not sure [17:05:49] ah maybe you're right [17:05:52] I think it is, there shouldn't be mirroring from github to gerrit, only the other way around [17:06:06] got it, ok [17:06:07] thanks! [18:53:20] su: warning: cannot change directory to /nonexistent: No such file or directory [18:53:34] ^ during 'apt-get update'. seems new [21:51:06] not even remotely new :-) https://phabricator.wikimedia.org/T216832 [21:52:27] but most hosts should no longer print this (if they have been installed after the debmonitor system user was converted to systemd-sysuser) [21:52:27] oh, haha alright [21:52:42] that at least explains why I did not see it everywhere, ack