[09:28:34] OH: "I wonder if it was easier for Cisco to just buy Splunk than pay their Splunk bill." [09:47:37] I'm getting this error from "cookbook sre.hosts.reboot-single -t T344590 {host}", am I doing some silly mistake? "phabricator.APIError: ERR-CONDUIT-CORE: Monogram "T344590" does not identify a valid object." [09:48:26] dhinus: You can't use a security ticket in the `-t ` option, because it can't find it. [09:48:43] You can mention the ticket number in the reason field instead. [09:48:57] ah-ha, thanks :) [09:49:03] yw [10:43:31] related T335879 [10:43:32] T335879: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 [10:43:56] feedback on preferred behaviour are welcome! [14:03:37] Hi folks, the monthly SRE happy hour is on in case you’d like to join: https://www.kumospace.com/hangops [17:17:11] herron: cwhite: puppetdb just had an issue the widespredd puppet alert may fire but im running cumin now so it shuld also clear shortly jjust a heads up [17:17:31] jbond: gotcha thanks for the heads up [17:17:41] np [17:45:17] if anyone's still around to look at a new partman recipe, LMK. https://gerrit.wikimedia.org/r/c/operations/puppet/+/960114 [17:45:32] * inflatador misses kickstart [21:34:16] cwhite: it seems 'coal' is still running on webperf1003. I guess we didn't absent it and/or intentionally removed it simply with intention to remove by hand but haven't yet? [21:34:26] ref T335242 [21:34:28] T335242: Decommission 'coal' and 'coal-web' services - https://phabricator.wikimedia.org/T335242 [21:36:12] I'm looking for 'excimer' in Logstash but not seeing it yet under program:excimer in last 2 days. That's probably normal since errors are so far only theoretical (never managed to trigger a real 500 in prod), but you mentioned "seeing" it in logstash at https://phabricator.wikimedia.org/T339137#9189024 - Is there some kind of internal/health signal of some kind to see that it is working? [21:38:18] I did a manual test for now, and do see it in the local journal under `systemctl status apache2`. Including (by my surprise) a seemingly real PDOException from yesterday. [21:38:25] I don't see it in Logstash though [21:38:49] Krinkle: does this link take you to the excimer saved search? https://logstash.wikimedia.org/app/discover#/view/3124ebd0-5990-11ee-a629-9f3cbf56e658 [21:39:04] it does :) [21:39:10] ecs-* [21:40:00] https://logstash.wikimedia.org/app/dashboards#/view/5f2fda90-e8ad-11eb-81e9-e1226573bad4 [21:40:06] Was expecting it there under host:webperf* [21:42:08] ah, yes the legacy schema is incompatible with ecs so it has to live in separate indexes. it looks like all there is to migrate from that dashboard is navtiming, then they can live on the same dashboard? [21:43:14] right. we're pending ownership decision on that one. still unclear. [21:44:00] in any event, glad to see they're in logstash now. I'll leae it to you to decide whether to switch hte dashboard without navtiming, or to find the excimer logs another way when you need them. [21:45:40] Regarding coal, it seems there was no intermediate "absent" step. I think it's worth reopening that case and do a manual cleanup given the puppet code is long gone. [21:46:30] For my clarity, the intention was to eliminate coal altogether, not relocate it? [23:11:40] cwhite: correct yeah, we ran it alongside navtiming.py to bypass statsd and bypass graphite's xff logic and thus something that writes direclty to graphite and then stays there untouched. we had very little confidence in it at first, still quite unreliable for this purpose, but promethues takes the last part of that concern away, and we weren't really using it in practice anyway [23:12:05] in it = it being the "nroma' statsd/graphite pipeline [23:12:09] normal* [23:12:30] given the average of averages and how it gets messier the older the data gets https://timotijhof.net/posts/2018/measuring-wikipedia-page-load-times/