[00:20:19] 10serviceops, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Legoktm) >>! In T253675#7430192, @Krinkle wrote: > As I understand it, the reason we used mod_security for this is that Apache does not allow unsetting or changing... [05:32:21] 10serviceops, 10Patch-For-Review, 10User-jijiki: Roll out remote gutter pool - https://phabricator.wikimedia.org/T258779 (10jijiki) [08:58:07] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) I'm gonna resolve this task for now rather than have it sitting around. The strong suspicion is still that the drops on our core switching fabric are responsib... [08:58:25] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) 05In progress→03Resolved [09:06:55] hello folks, I have created the kafka rebalance plan for main-eqiad, will probably execute it between this afternoon and monday [09:08:08] 10serviceops, 10SRE: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe) [09:08:18] 10serviceops, 10SRE: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe) p:05Triage→03Medium [09:10:34] 10serviceops: Allow coexisting php version in our puppet code - https://phabricator.wikimedia.org/T293450 (10Joe) [09:12:31] 10serviceops: Allow sending traffic to php 7.2 or 7.4 selectively in the apache configuration for MediaWiki - https://phabricator.wikimedia.org/T293451 (10Joe) [09:13:23] <_joe_> legoktm, effie, mutante, jayme: I created a few tasks related to the php 7.4 migration; I might take on T293450 myself today as it's the thorniest issue at hand [09:13:44] _joe_: isnt that for Q3 though ? [09:14:24] <_joe_> effie: not sure, it seemed like it made to our OKRs? [09:15:03] alright, I wasn't sure either [10:09:24] 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) 05Resolved→03Open a:05Joe→03JMeybohm I'll reopen this one as it has more context on the topic of "which API to use for configuration". I've created a simple is... [10:09:29] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) [11:15:16] <_joe_> jelto: lmk when you want to transition mwdebug to use helm3; it's a perfect candidate for an early migration IMHO [11:33:07] joe: I created a proposal for the helm3 migration for blubberoid (just an arbitrary choice) here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/721301 If it makes sense I can refactore this to mwdebug and write down some steps what I would like to do (purge old release, change some values, redeploy and so on). [11:42:05] <_joe_> no I think it's ok to start with blubberoid [11:42:15] <_joe_> but uuuh, we'll need to destroy the old release? [11:42:46] <_joe_> I was just ranting in private with jayme about how the more I look into helm internals, the more I hate it :P [11:43:22] <_joe_> I'll take a look after lunch is over, probably [11:44:23] _joe_: yeah, plan is to *not* do the migration [11:44:41] and instead just re-deploy with helm3 [11:44:48] that will not work for mwdebug ofc :/ [11:45:03] <_joe_> jayme: oh thankfully no one uses it now [11:45:08] yep [11:45:26] <_joe_> as in, it's expected and explicitly declared as experimental and no promise of uptime is mande [11:46:31] While there is a migration path from helm2 to helm3, I would very, very much like us to not take it. The option of just depooing a cluster and re-initializing everything is way more appealing [11:46:48] <_joe_> sure I get it's easier [11:46:58] <_joe_> but is the actual migration that bad? [11:47:07] <_joe_> I mean I fully expect it to be by now [11:47:16] maybe not... [11:47:33] bug it requites several steps to be done for every service [11:47:36] *but [11:47:45] *requires ... you get it :) [11:49:06] we could potentially automate that - but as we have a way to do it on a per-cluster basis I think we can get even more out of it using that route. [11:49:59] clean state, no potential leftover stuff, more testes approach on cluster re-init, maybe even some automation around that that will be of user later [11:50:25] <_joe_> sure I get the advantages [11:50:35] <_joe_> I thought at first we were forced to take this route [11:50:41] <_joe_> it seems we're not really, but almost [11:50:43] <_joe_> :P [11:52:01] it's more like a deliberate decision at gunpoint :p [11:52:57] 10serviceops, 10Prod-Kubernetes: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10akosiaris) To be honest, I am not understanding what the solution exactly is. I gather that the point is to not have a lot of boilerplate code in the charts? Because the rest sounds a l... [11:53:53] lol, nice way to put it [11:54:52] <_joe_> the better metaphor is that you have to choose between one path that's a brittle wooden bridge over a pool full of alligators, or a german autobahn. [11:55:01] <_joe_> and surprisingly, you chose the latter :P [11:55:38] <_joe_> it requires grazing the existing environment to the ground, but who likes alligators [11:56:00] lol [13:29:49] as promised, I am going to rebalance some low traffic topics in main-eqiad [14:01:32] 10serviceops, 10SRE: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) 05Open→03Invalid Since we have no mcrouter proxies, and we won't have any scap proxies in the future, closing. [14:15:01] If we wanted to increase the concurrency on a job in changeprop-jobqueue, is that a no go on a friday? [14:17:57] the change in question would be https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/731098 fwiw [14:41:47] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [14:42:05] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) 05Open→03Declined We are not using proxies anymore, but some TKOs we see every now and then could be related to T291385, not much we can d... [16:31:25] moved a lot of topics today in main-eqiad, will finish the rest (~30 topics) on Monday [16:32:09] I just noticed that on main-codfw we have ~2500 partitions, on main-eqiad ~3150 [16:32:30] the diff should be probably old topics not used anymore [16:32:58] all the topics and their move timings in the task if needed later on (hopefully not) [20:57:00] I am running into the 2m timeout pulling from docker registry when deploying in staging, even though it is only like 1.5GB size image (with many small files in it). I hear though once we get SSDs in staging this shouldnt be an issue anymore. [20:58:14] in another matter, did a debugging session with Arnold about gitlab-backup-restore. it works when run manually but breaks things (restarts fail) when run by systemd. not solved yet but currently that stuff is disabled again on both servers in a clean way, without having to disable puppet