[07:12:00] jhathaway: o/ I noticed that on mirror1001 nginx seems in a broken state, I saw some recent changes in modules/profile/manifests/mirrors.pp but I am not sure what the final status should be (nginx is un, nginx-common is ii) [07:13:28] (IIUC we can just purge nginx* in there since httpd is doing all the work, but it is early morning for me so I am not taking actions for a bit :D) [08:38:24] elukey: Jesse switched mirrors.w.o. from nginx to Apache for T300985 (and because everyone loves the httpd devs!), that [08:38:24] T300985: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 [08:38:33] that's some leftover, I'll clean it up [08:40:30] ack thanks! [13:13:22] heads up, I'm merging ttps://gerrit.wikimedia.org/r/c/operations/puppet/+/772869 which might mean some http checks will fire, I'm here and I'm keeping an eye out [13:13:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/772869 even [13:22:33] moritzm: thanks [14:10:53] jbond: want me to merge 'John Bond: PKI: double escape, one for puppet one for icinga'? [14:11:51] andrewbogott: i have the lock now ill merge yours :) [14:11:55] thanks [16:33:10] Do we have a page anywhere that lists things like # of servers, types of servers, etc? Just big picture statistics? Someone just asked me what kind of hardware we run and I don't want to point them to netbox :) [16:36:40] I don't think so - I remember seeing some on meta, or somewhere, but years out of date [16:37:55] ok, that's what I suspected. thx [16:41:29] andrewbogott: you should be able to pull that data out of puppetboard https://puppetboard.wikimedia.org/fact/bios_vendor or netbox but neither are public [16:42:07] jbond: ok. It was just an idle question so I'll just tell them it isn't public [16:42:23] ack [16:47:02] thx all [16:56:57] it wouldn't be too hard to get such data in Prometheus, and might be useful for graphing stuff over time [16:58:39] there is a DMI module in node_exporter, but it doesn't look like we enable it, and we'd probably want to filter out serial numbers etc at scrape time [16:58:59] example output https://github.com/prometheus/node_exporter/commit/9def2f9222d61babbcaeb95a407d1558601cb4d1#diff-08cee1542e86d8dff4b4c65779ce084f7b2b667a8e754b823d3d36bd905e0b72R511 [17:08:08] another way I've estimated it quickly before, just for total host counts, is just looking at cumin dry-run outputs [17:08:34] it's an internal source, but it's not hard to infer that stuff from public data anyways [17:09:02] you could round it off as ~1600 physical servers [17:09:08] yeah, https://grafana.wikimedia.org/d/000000377/host-overview is public and you can count how many hosts there are [17:09:26] I guess the selector only goes up to L but it still has to be doable from there [17:09:27] but of course that doesn't include some things, like fr-tech infra and so-on [17:10:54] the kernel deployment dashboard is also a good overview https://grafana.wikimedia.org/d/000000302/kernel-deployment [17:11:42] the 'kernel distribution' used to be useful in the ubuntu days of course, now it is host count really [17:19:17] there's also https://debmonitor.wikimedia.org/kernels/ with a breakdown of what else is installed [17:37:50] debmonitor is private, isn't it moritzm ? [17:38:09] I'm also not convinced we have a Prometheus metric that corresponds to "is a physical machine" [17:54:58] cdanis: I think perhaps "node_hwmon_sensor_label" will give you that, don't see any VMs when I query for the instances that expose that [17:55:05] ahh good shout [17:56:09] whether it's complete/authoritative I can't say, would merit more investigation before relying on it [17:57:39] `count(count by (instance) (node_hwmon_sensor_label))` gives you the number of physical hosts known to Prometheus then, although it's more expensive to compute than I expected [20:21:10] https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1 [20:48:50] seems like with the new alerting_hosts() change, cookbooks now disable downtime after the debian install but before the initial puppet run. Lots of alerts! [22:43:28] andrewbogott: what do you mean? There has been no change in when the downtime/silence is done [22:46:44] if the host is not new, the downtime/silence is done before starting, then when the host get's removed from puppetdb the icinga checks (and related downtimes) will be deleted at the next puppet run, while the alertmanager silence will still be in effect [22:47:27] then once the first NOOP run of puppet is completed, populating puppetdb for the exported resources, the downtime cookbook is called, that will set the downtime on icinga and the silence on alertmanager. [22:48:03] after that the initial alertmanager silence get's deleted, because the new one is now in effect anyway [22:49:29] if that was for cloudvirt104[1-2] I see from SAL that the downtime cookbook failed, that's most likely the cause for the alerts [22:55:05] and those failed because the downtime cookbook is called with the option to force a puppet run on the icinga host (to get the new exported resources) and that timedout [22:55:29] [ERROR clustershell.py:398 in failed_nodes] 100.0% (1/1) of nodes timeout to execute command 'run-puppet-agent...et --attempts 30': alert1001.wikimedia.org [22:56:35] and that run is done with a timeout of 300 seconds [22:57:27] that includes also any time spent waiting for the puppet lock by run-puppet-agent [22:58:48] the run around 22:24:13 (from the log above) started at 22:21:45 and ended at 22:24:26, that's almost 3 minutes for a puppet run... [22:59:07] andrewbogott: ^^^ [23:01:34] if that should keep being a problem the 'timeout' parameter can be added to the call to the puppet run in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hosts/downtime.py#118 with a higher value [23:05:17] a quick look at puppetboard shows that alert1001 applies a catalog with no changes in ~70 seconds and one with changes in ~85 seconds, but then the total run time (start to end) is higher, getting closer to 3 minutes [23:09:33] * volans|off off [23:46:54] volans|off: ok, that makes sense, we'll see if it continues to be a problem