[01:03:00] 10serviceops: httpbb random read timeout on cumin2002 - https://phabricator.wikimedia.org/T323707 (10RLazarus) a:03RLazarus Good find, thanks. It looks like this page is just a slow parse (in the HTML comment I see `Real time usage: 9.438 seconds`), so usually we get lucky and it's in parsercache, but when we... [01:13:34] 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus) [01:55:57] 10serviceops, 10Patch-For-Review: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus) Changed my mind on this -- still going to look into other solutions, but I did bump the deadline to 60s so that it doesn't spuriously alert in the meantime. [09:13:12] 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10Joe) Maybe if the page we're trying to fetch is that cumbersome, we should switch to a different, lighter one? [09:33:30] hnowlan: I don't recall how many replicas you wanted to run in prod [09:33:56] ultimately I mean. The current value of 2 seems a tad to low :) [09:43:16] 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10Volans) >>! In T323707#8418963, @RLazarus wrote: > Changed my mind on this -- still going to look into other solutions, but I did bump the deadline to 60s so that it doesn't spuriously alert... [10:12:14] <_joe_> heads up: I'm converting most charts to using modules, I'll go on the whole day basically [11:12:35] <_joe_> btullis: around? [11:12:54] Certainly am. [11:13:00] <_joe_> I am converting datahub and I wanted to make sure you knew [11:13:10] <_joe_> because while I'm confident from the diffs it will be a noop [11:13:12] Yes, I've just had a chat about converting my spark-operator CR to use modules as well. [11:13:24] <_joe_> datahub is vaguely "peculiar" [11:13:55] <_joe_> btullis: the rake task chart_to_modules[spark-operator] should do the basics for you [11:14:14] Re: datahub, many thanks. Feel free to ask if I can help at all, but also I trust you to proceed at will too. [11:14:15] <_joe_> the modules have some additional candy to shorten your charts verbosities [11:14:25] <_joe_> I'm going to staging rn [11:14:32] Ah, great. I didn't know about the rake task. [11:14:55] <_joe_> yeah I saw the sheer size of the conversion and decided I needed to automate it [11:15:22] <_joe_> took ~ 2 work days to perfect it, and it already paid the effort back :P [11:16:04] 'Ti'n seren' - as we say in Welsh. [11:16:55] <_joe_> ok, datahub is failing in production but not in CI [11:17:05] <_joe_> I'll let you know after some more debugging [11:17:19] <_joe_> if we need to deploy it, though, we can rollback my changes ofc [11:19:30] Yep, should be low impact to anyone. I can drop a note in the #data-engineering and #product-analytics channels on Slack if we expect any extended downtime for it. [11:27:48] <_joe_> there is no downtime expected, I just need to figure out why some stuff doesn't seem to be working [11:28:25] <_joe_> sigh nevermind [11:28:39] <_joe_> it's a problem of charts not being updated on disk lol [11:30:51] <_joe_> btullis: can you imagine a reason why datahub would try to use datahub-frontend 0.0.14 when the latest one is 0.0.15? [11:30:58] <_joe_> probably I'm missing something dumb [11:31:16] <_joe_> as in, I'm being very dumb :P [11:32:13] <_joe_> I mean I could force it with "version: ">=0.0.15" [11:32:34] <_joe_> uhm, actually, I *am* dumb [11:32:38] <_joe_> sorry, nevermind [11:32:39] I think I might have left too many `appVersion` parameters in there? [11:33:04] <_joe_> btullis: no, but the PARENT chart was built *before* I bumped the subcharts [11:33:13] <_joe_> told you, it was me being very dumb :P [11:33:22] Aha, gotcha. [11:34:13] <_joe_> :) [11:36:52] I'm having a bit of trouble with the command `sextant vendor charts/spark-operator` - It's just this gem is it? https://rubygems.org/gems/sextant/versions/0.2.4 [11:37:40] Huh, no [11:37:44] sextant is not the rubygem [11:37:51] https://gitlab.wikimedia.org/repos/sre/sextant [11:38:23] btullis: ^ [11:38:42] Ah, thanks claime. I wasn't aware of that. :-) [11:39:26] I'll put something on Wikitech, shall I? [11:44:58] <_joe_> btullis: sorry [11:45:03] <_joe_> pip install sextant :) [11:45:35] <_joe_> btullis: I intended to add instructions to use sextant once I was done with the conversion, thanks for starting something :) [11:45:36] All good, thanks again :-) [11:45:52] <_joe_> ok, I deployed datahub to staging [11:47:26] <_joe_> now going with codfw [11:49:22] Ack, thanks. [11:53:47] <_joe_> btullis: uh it seems that something is not working [11:54:20] <_joe_> the diffs were only whitespace, so I'm not sure what is wrong here [11:54:31] In codfw? I'll have a look. [11:55:08] <_joe_> no I mean [11:55:19] <_joe_> I was getting errors from datahub.wikimedia.org [11:55:32] <_joe_> but apparently I had to log out and log in again, and everythign works [11:56:05] OK, cool. Yep, seems fine to me too. Thanks. [11:56:16] <_joe_> sorry for the noise [11:57:18] Never an issue. [14:01:46] <_joe_> hnowlan: re changeprop, I'll merge the change but not deploy it right now I think [14:02:10] <_joe_> a redeployment is mildly disruptive for changeprop, so I'd spare it [14:03:46] _joe_: ack [14:05:01] <_joe_> aand I'd say same for echostore/sessionstore [14:05:28] <_joe_> they don't use the envoy service proxy so the risk of errors is minimal and the diffs seem null [14:05:53] <_joe_> I'll just ensure the diffs are as advertised [14:07:34] sgtm [14:07:51] There might be more patches in the future for sessionstore/echostore so they won't sit undeployed for very long I think [14:16:04] <_joe_> cool [16:28:02] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) [16:30:33] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) 05Open→03In progress [16:30:37] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:50:24] just tested https://istio.io/latest/docs/tasks/policy-enforcement/rate-limit/#local-rate-limit in the istio mesh, works really nicely [16:50:48] (basically a per-pod rate limit for basic traffic volume protection) [16:51:19] nice [16:58:56] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) [16:59:36] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) [17:44:14] 10serviceops, 10Maps, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) https://gerrit.wikimedia.org/r/c/operations/puppet/+/860634/ is a simpler attempt at this work - there are...