[07:08:18] errand [09:51:42] lunch [13:15:01] o/ [14:05:17] \o [14:12:22] o/ [14:13:13] \moti wave2 [14:13:18] .o/ [14:13:41] low priority CR for fixing some alert runbook links: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1074247 [14:28:10] not our fault in any way, but it's a little annoying how when we link to a header in a page the header is just off the top of the screen (checking the links in patch ^) [14:28:30] or well, i guess it's probably underneath the page header [14:33:04] independantly, from some content on categorizing content and why our categories system just doesn't seem that useful to us: Ideally, the categories should be coherent, distinctive, and exhaustive. [14:33:30] i don't know if our categories meet any of those requirements :P [14:34:44] definitely not :) [14:40:45] yeah, that's definitely annoying [15:07:43] w [16:02:08] workout, back in ~40 [16:20:32] heading out, have a nice week-end [16:21:30] i still wonder why we get so many 503's in the updater...the saneitizer finding so many oldVersionInIndex problems has me thinking it's related to these 503s, but sampling the last 10 failed requests and requesting them locally...they all render nearly instantly. It's not simply giant pages that are too slow [16:21:48] (maybe sampling is the wrong word there :P) [16:22:46] They are mostly `Empty body: 503 Service Unavailable' [16:27:51] fwiw codfw has way more, event though it has half the number of updaters. 20k fetch failures from codfw, 8.8k from eqiad, 7.3k from cloudelastic [16:28:14] (for september) [16:50:09] perhaps also curious, # of mw logs containing url:cirrusbuilddoc don't seem to have any correlation to high error rates on our side. [16:55:28] back, but heading to lunch/office, back in ~90m [17:22:35] * ebernhardson is still having no luck correlating 503 Service Unavailable in fetch_failure topic to anything in logstash... [17:23:26] it's weird how we can have 1300 such failures in a single minute, but no other spikes anywhere [17:23:45] (all from consumer-search-eqiad) [17:36:16] i suppose the best guess would be the apache hc5 is imputing 503 no content's, i wonder how we can distinguish those [18:36:21] i dunno, i guess i'll give up on this for now :P Even more mysterious, we explicitly log in FetchFailureRouter::processElement before sending something to the fetch output tag, and the busiest minute of fetch failure events doesn't have corresponding logs :S [19:20:17] back [19:21:09] Bah, getting more categories-related alerts on graph split hosts ;) [20:35:29] hmm, helm-lint diffs in ci aren't too useful...can't tell which release each diff is for :S [20:35:38] i can guess from the changes, but that's not great :P [20:47:24] * inflatador wonders how many other teams are using releases [20:48:01] i'm poking around the Rakefile bits to see if there are any ways to change the diff output...there are options to helm-lint but it's not clear yet if any would be helpful [20:53:57] hmm, might require restructuring instead :P At the bottom of it all we run: diff --show-function-line=kind -au8 --color=always '#{orig.path}' '#{change.path}' [21:13:40] * ebernhardson is a little confused why commenting the line that removes the files that were diffed (so i can play with diff options) results in no files being left behind :S [21:18:18] curiously, it seems like since they use Tempfile.new there is no reason to unlink the files. But they do it anyways to be extra safe [21:37:22] hmm, i can make it do what i want but it's not clearly better all around :S I guess i can be satisfied to have local options to generate a reasonable diff