[00:20:19] <wikibugs>	 10serviceops, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Legoktm) >>! In T253675#7430192, @Krinkle wrote: > As I understand it, the reason we used mod_security for this is that Apache does not allow unsetting or changing...
[05:32:21] <wikibugs>	 10serviceops, 10Patch-For-Review, 10User-jijiki: Roll out remote gutter pool - https://phabricator.wikimedia.org/T258779 (10jijiki)
[08:58:07] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) I'm gonna resolve this task for now rather than have it sitting around.  The strong suspicion is still that the drops on our core switching fabric are responsib...
[08:58:25] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) 05In progress→03Resolved
[09:06:55] <elukey>	 hello folks, I have created the kafka rebalance plan for main-eqiad, will probably execute it between this afternoon and monday
[09:08:08] <wikibugs>	 10serviceops, 10SRE: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe)
[09:08:18] <wikibugs>	 10serviceops, 10SRE: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe) p:05Triage→03Medium
[09:10:34] <wikibugs>	 10serviceops: Allow coexisting php version in our puppet code - https://phabricator.wikimedia.org/T293450 (10Joe)
[09:12:31] <wikibugs>	 10serviceops: Allow sending traffic to php 7.2 or 7.4 selectively in the apache configuration for MediaWiki - https://phabricator.wikimedia.org/T293451 (10Joe)
[09:13:23] <_joe_>	 legoktm, effie, mutante, jayme: I created a few tasks related to the php 7.4 migration; I might take on T293450 myself today as it's the thorniest issue at hand
[09:13:44] <effie>	 _joe_:  isnt that for Q3 though ?
[09:14:24] <_joe_>	 effie: not sure, it seemed like it made to our OKRs?
[09:15:03] <effie>	 alright, I wasn't sure either
[10:09:24] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) 05Resolved→03Open a:05Joe→03JMeybohm I'll reopen this one as it has more context on the topic of "which API to use for configuration".  I've created a simple is...
[10:09:29] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm)
[11:15:16] <_joe_>	 jelto: lmk when you want to transition mwdebug to use helm3; it's a perfect candidate for an early migration IMHO
[11:33:07] <jelto>	 joe: I created a proposal for the helm3 migration for blubberoid (just an arbitrary choice) here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/721301 If it makes sense I can refactore this to mwdebug and write down some steps what I would like to do (purge old release, change some values, redeploy and so on).
[11:42:05] <_joe_>	 no I think it's ok to start with blubberoid
[11:42:15] <_joe_>	 but uuuh, we'll need to destroy the old release?
[11:42:46] <_joe_>	 I was just ranting in private with jayme about how the more I look into helm internals, the more I hate it :P
[11:43:22] <_joe_>	 I'll take a look after lunch is over, probably
[11:44:23] <jayme>	 _joe_: yeah, plan is to *not* do the migration
[11:44:41] <jayme>	 and instead just re-deploy with helm3
[11:44:48] <jayme>	 that will not work for mwdebug ofc :/
[11:45:03] <_joe_>	 jayme: oh thankfully no one uses it now
[11:45:08] <jayme>	 yep
[11:45:26] <_joe_>	 as in, it's expected and explicitly declared as experimental and no promise of uptime is mande
[11:46:31] <jayme>	 While there is a migration path from helm2 to helm3, I would very, very much like us to not take it. The option of just depooing a cluster and re-initializing everything is way more appealing
[11:46:48] <_joe_>	 sure I get it's easier
[11:46:58] <_joe_>	 but is the actual migration that bad?
[11:47:07] <_joe_>	 I mean I fully expect it to be by now
[11:47:16] <jayme>	 maybe not...
[11:47:33] <jayme>	 bug it requites several steps to be done for every service
[11:47:36] <jayme>	 *but
[11:47:45] <jayme>	 *requires ... you get it :)
[11:49:06] <jayme>	 we could potentially automate that -  but as we have a way to do it on a per-cluster basis I think we can get even more out of it using that route.
[11:49:59] <jayme>	 clean state, no potential leftover stuff, more testes approach on cluster re-init, maybe even some automation around that that will be of user later
[11:50:25] <_joe_>	 sure I get the advantages
[11:50:35] <_joe_>	 I thought at first we were forced to take this route
[11:50:41] <_joe_>	 it seems we're not really, but almost
[11:50:43] <_joe_>	 :P
[11:52:01] <jayme>	 it's more like a deliberate decision at gunpoint :p
[11:52:57] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10akosiaris) To be honest, I am not understanding what the solution exactly is. I gather that the point is to not have a lot of boilerplate code in the charts? Because the rest sounds a l...
[11:53:53] <akosiaris>	 lol, nice way to put it
[11:54:52] <_joe_>	 the better metaphor is that you have to choose  between one path that's a brittle wooden bridge over a pool full of alligators, or a german autobahn.
[11:55:01] <_joe_>	 and surprisingly, you chose the latter :P
[11:55:38] <_joe_>	 it requires grazing the existing environment to the ground, but who likes alligators
[11:56:00] <akosiaris>	 lol
[13:29:49] <elukey>	 as promised, I am going to rebalance some low traffic topics in main-eqiad
[14:01:32] <wikibugs>	 10serviceops, 10SRE: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) 05Open→03Invalid Since we have no mcrouter proxies, and we won't have any scap proxies in the future, closing.
[14:15:01] <addshore>	 If we wanted to increase the concurrency on a job in changeprop-jobqueue, is that a no go on a friday?
[14:17:57] <Lucas_WMDE>	 the change in question would be https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/731098 fwiw
[14:41:47] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[14:42:05] <wikibugs>	 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) 05Open→03Declined We are not using proxies anymore, but some TKOs we see every now and then could be related to T291385, not much we can d...
[16:31:25] <elukey>	 moved a lot of topics today in main-eqiad, will finish the rest (~30 topics) on Monday
[16:32:09] <elukey>	 I just noticed that on main-codfw we have ~2500 partitions, on main-eqiad ~3150
[16:32:30] <elukey>	 the diff should be probably old topics not used anymore
[16:32:58] <elukey>	 all the topics and their move timings in the task if needed later on (hopefully not)
[20:57:00] <mutante>	 I am running into the 2m timeout pulling from docker registry when deploying in staging, even though it is only like 1.5GB size image (with many small files in it). I hear though once we get SSDs in staging this shouldnt be an issue anymore.
[20:58:14] <mutante>	 in another matter, did a debugging session with Arnold about gitlab-backup-restore. it works when run manually but breaks things (restarts fail) when run by systemd. not solved yet but currently that stuff is disabled again on both servers in a clean way, without having to disable puppet