[04:33:08] <wikibugs>	 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) This is the server that has had hardware issues a few weeks ago, see https://sal.toolforge.org/production?p=0&q=parse1002&d=  I'll set it as inactive, then open the task for dcops I guess.
[04:34:55] <wikibugs>	 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) p:05High→03Medium Also, please don't set priorities in tasks unless you're going to triage and work on them.  A server failure is a minor issue that we're used to and isn't urgent or need priority.
[04:38:30] <wikibugs>	 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) 05Open→03Resolved a:03Joe Resolving as the hardware problem will be treated elsewhere.
[06:36:49] <elukey>	 hi folks, starting the rebalances in a bit
[06:41:59] <_joe_>	 elukey: <3
[07:50:00] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm)
[08:21:07] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) Correcting myself, the edit that is causing all these pages to be re-rendered is a template edit:  https://commons.wikimedia.org/w/index.php?title=Template%3ANA...
[08:25:12] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi) Today the PHPFPMTooBusy alert for parsoid paged, and it is bound to do so again I think, we've been hovering around the threshold basically this is issue...
[08:36:09] <godog>	 hi folks, I could use some assistance re: ^ specifically around adding capacity to parsoid even if temporarily, what do you think ?
[08:37:53] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019664, @JMeybohm wro...
[08:42:44] <godog>	 I guess we'll be taking the parsoid outage if it comes to that ?
[08:45:56] <elukey>	 maybe a couple of nodes from apis could be moved via confd's settings to parsoid?
[08:46:21] <elukey>	 basically one to replace parse1002 and an additional one for extra capacity
[08:46:52] <elukey>	 I am not 100% if a confd change is enough for these kind of repurposes, of it pybal needs a restart to update the map vips -> backends
[08:47:06] <elukey>	 but it should give us time to investigate and possibly resolve
[08:47:54] <godog>	 indeed, in that case shuffling hosts in conftool-data/node/eqiad.yaml should be enough I think, IIRC no pybal restart needed
[08:48:10] <godog>	 more wishful thinking tbh
[08:50:39] <_joe_>	 akosiaris, jayme can you take a look please?
[08:50:44] <_joe_>	 I'm busy with other stuff
[08:51:16] * jayme looking
[08:51:24] <_joe_>	 the main thing to be careful about when moving a server (which amounts to changing the "servergroup") is that IIRC we need to list the parsoid servers in mediawiki-config
[08:51:41] <jayme>	 I'm not sure about the repurpose thing - didn't clem do that recently?
[08:51:59] <jayme>	 that was towards api IIRC
[08:52:16] <_joe_>	 parsoid => jobrunner
[08:52:19] <jayme>	 ah
[08:52:46] <jayme>	 so we lost parsoid capacity there as well
[08:52:55] <elukey>	 we could target mw135[6,7] (less ram but same nproc as the mw14XX series, so we add capacity to parsoid but we remove the two least powerful from apis)
[08:53:17] <elukey>	 checking mediawiki config
[08:53:36] <_joe_>	 elukey: actually I'd prefer better CPUs for parsoid if possible
[08:53:47] <_joe_>	 cpu generation counts a lot towards latency there
[08:53:53] <godog>	 yeah parse1002 isn't well, one out of 24 hosts
[08:54:01] <_joe_>	 godog: 24?
[08:54:16] <godog>	 sorry, 20
[08:54:26] <_joe_>	 yeah
[08:54:32] <elukey>	 _joe_: yep the nproc is the same, 14xx has only more ram
[08:54:33] <_joe_>	 so we can throw back in another 2-3
[08:54:43] <elukey>	 so I'd leave the more ram-beefy in the api pool
[08:54:48] <_joe_>	 ack
[08:54:56] <_joe_>	 elukey: same processor?
[08:55:05] <_joe_>	 it's not nproc, but the processor generation
[08:55:26] <elukey>	 I doubt it is the same, checking
[08:56:07] <elukey>	 Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz vs Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
[08:56:54] <elukey>	 I don't find parse1xxx in mediawiki's config atm
[08:59:42] <godog>	 ok it looks like wgLinterSubmitterWhitelist has been removed in 49076ac8d15d
[08:59:46] <elukey>	 afaics it shouldn't be necessary to update wmf config
[08:59:47] <elukey>	 yeah
[08:59:48] <godog>	 so no changes to mediawiki-config needed
[09:02:02] <elukey>	 let's brainbounce for the hosts repurpose
[09:02:15] <godog>	 I'm prepping a patch in the meantime with the proposed move
[09:02:27] <_joe_>	 godog: oh great
[09:02:33] <elukey>	 1) assumption - we don't need to do anything on the api nodes before repurposing them to parsoid since they run the same stuff (to be verified?)
[09:02:41] <elukey>	 2) set the target apis inactive
[09:02:46] <elukey>	 3) merge Filippo's patch
[09:02:54] <_joe_>	 yes
[09:03:04] <elukey>	 4) check ipvs on pybal to verify 
[09:03:08] <_joe_>	 then set them up in the parsoid cluster
[09:03:12] <elukey>	 5) set the nodes active in parsoid
[09:03:21] <_joe_>	 you need to set weight and enable them
[09:03:28] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/940095
[09:03:30] <elukey>	 yes yes 
[09:03:36] <elukey>	 no puppet role changes right?
[09:04:18] <godog>	 afaik no
[09:05:10] <_joe_>	 eh, wait a sec
[09:05:11] <jayme>	 what about the cpu family concern? mw13 is skylake while 14 is cascade
[09:05:17] <_joe_>	 I'm checking a couple of things
[09:05:57] <elukey>	 jayme: no idea, but to me we could proceed with the selected ones, seems a good compromise
[09:06:33] <jayme>	 fine by me. We can switch to something else if it does not seem to be enough
[09:06:41] <elukey>	 yes exactly
[09:08:28] <godog>	 ok elukey's plan LGTM, are we ok to proceed? _joe_ jayme ?
[09:08:48] <_joe_>	 wait a sec, writing a CR
[09:09:36] <godog>	 ok
[09:09:48] <_joe_>	 done, basically I'd move them in site.pp
[09:10:00] <_joe_>	 there will be slight differences in configuration
[09:10:41] <godog>	 ok so changing the role will change the cluster too, I'll try that
[09:10:42] <_joe_>	 most importantly, the certificate SAN 
[09:10:50] <_joe_>	 yes
[09:11:04] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm)
[09:11:07] <_joe_>	 let's run pcc on them to make sure
[09:11:17] <_joe_>	 so yeah the plan would be
[09:11:27] <_joe_>	 1) set nodes to inactive in conftool
[09:11:37] <_joe_>	 2) merge the patch, run puppet on the hosts
[09:12:20] <_joe_>	 3) set weight=10, pooled=active in conftool
[09:13:29] <elukey>	 (going afk for an errand, bbl!)
[09:13:41] <godog>	 ok
[09:15:39] <_joe_>	 running pcc on the patch
[09:16:07] <godog>	 I did too, there's a link
[09:16:09] <akosiaris>	 sigh, that commons template change is still hammering parsoid? 
[09:16:43] <akosiaris>	 I thought it was something different after subb.u's comment
[09:17:02] <_joe_>	 yes
[09:17:09] <_joe_>	 and no, subbu's comment was correct
[09:17:18] <_joe_>	 it's "traffic" to commons
[09:17:26] <_joe_>	 it's just coming from changeprop
[09:17:50] <akosiaris>	 it's what now? 5 days ? 
[09:18:00] <_joe_>	 july 14th was the edit
[09:18:13] <akosiaris>	 sigh, temporary is the new permanent I guess
[09:18:37] <_joe_>	 akosiaris: but basically, it's a lethal mix of us wanting to pregenerate everything in restbase
[09:18:41] <godog>	 alright I'm going ahead with the plan, i.e. start to set inactive in conftool
[09:18:43] <_joe_>	 more than we do in meodiawiki
[09:19:06] <_joe_>	 godog: yes, I have to go afk for ~ 10 minutes tops
[09:19:14] <godog>	 _joe_: ack, thanks
[09:19:16] <_joe_>	 I'll be back before puppet has run on the hosts :)
[09:19:53] <godog>	 heheh
[09:20:50] <godog>	 ok hosts depooled, going ahead with the merge
[09:24:27] <godog>	 puppet ran, I'm going ahead and pool the hosts
[09:26:45] <godog>	 ok all done, I'll force a puppet run on prometheus to pick up the new config
[09:28:22] <akosiaris>	 we can also lower the concurrency of the job in changeprop if the above doesn't end up being sufficient 
[09:28:59] * godog nods
[09:29:57] <_joe_>	 akosiaris: sure, but the problem with these damn pages is that they take on average 20 seconds to parse
[09:30:06] <godog>	 jeremiah-johnson-nod-of-approval.flv
[09:30:11] <akosiaris>	 lol
[09:30:15] <_joe_>	 ahah
[09:31:22] <godog>	 hehehe ok waiting for a little, see if things improve
[09:33:33] <_joe_>	 we also have the option to re-edit the template removing the references to all the other pages
[09:33:35] <_joe_>	 :)
[09:34:23] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10daniel) >>! In T342085#9030518, @Joe wrote: > Correcting myself, the edit that is causing all these pages to be re-rendered is a template edit: >  > https://commons....
[09:35:41] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi) >>! In T342085#9030521, @fgiunchedi wrote: > Today the PHPFPMTooBusy alert for parsoid paged, and it is bound to do so again I think, we've been hovering...
[09:40:50] <godog>	 we're back at ~40% idle workers for parsoid
[09:41:29] <godog>	 going afk for 10
[09:41:43] <akosiaris>	 thanks godog!
[09:47:39] <godog>	 sure np
[11:58:11] <wikibugs>	 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05Open→03In progress p:05Triage→03Medium
[11:58:17] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[11:58:52] <wikibugs>	 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) a:05Clement_Goubert→03Joe
[11:59:34] <wikibugs>	 10serviceops, 10MW-on-K8s: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 (10Joe) 05In progress→03Resolved
[11:59:43] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[12:01:12] <wikibugs>	 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe) 05Open→03In progress p:05Triage→03Medium a:03Joe
[12:01:15] <wikibugs>	 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10Joe)
[12:23:34] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10JMeybohm)
[12:47:30] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) A basic check of the orchestra...
[13:08:49] <wikibugs>	 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) That error is from our top-level, las...
[14:13:21] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MSantos)
[14:24:12] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) >>! In T342085#9030615, @daniel wrote: >>>! In T342085#9030518, @Joe wrote: >> Correcting myself, the edit that is causing all these pages to be re-r...
[14:50:40] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe)
[14:51:05] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe) p:05Triage→03Medium
[14:52:31] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10daniel) >>! In T342085#9031409, @Joe wrote: >>>! In T342085#9030615, @daniel wrote: > Ok that explains it. Can't we make restbase just invalidate content...
[14:58:20] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) >>! In T342085#9031569, @daniel wrote: >>>! In T342085#9031409, @Joe wrote: >>>>! In T342085#9030615, @daniel wrote: >> Ok that explains it. Can't we...
[15:38:41] <elukey>	 hello folks, I completed the first round of kafka main eqiad moves, but the first three brokers are clearly still overloaded and I didn't see any big move in the related metric
[15:38:49] <elukey>	 very weird, I'll keep investigating
[15:39:41] <_joe_>	 elukey: did you restart changeprop in the process?
[15:39:48] <_joe_>	 yes it's weird indeed
[15:42:55] <elukey>	 _joe_ nope, it seems to only dislike when we increase the partitions, not if we change where they are
[15:43:42] <elukey>	 I concentrated on the topics with the most traffic, but I could generate a plan for all the topics and topicmappr would create a config that leads to a perfect balance
[15:44:15] <elukey>	 it will take more time but maybe worth it
[15:44:24] <elukey>	 I expected some differences after today though
[15:44:36] <_joe_>	 I would suggest to try to restart changeprop before calling it a loss
[15:44:55] <_joe_>	 but yeah, strange
[15:44:58] <elukey>	 sure, lemme do it
[15:45:04] <_joe_>	 I can try to dig into it with you tomorrow
[15:47:59] <elukey>	 we are still heavily unbalanced so maybe the overload is not only related to the topics that generate traffic
[15:48:56] <elukey>	 but this is great though: https://phabricator.wikimedia.org/T338357#9031668
[15:49:34] <_joe_>	 ok so there is the effect we were mostly interested into
[15:49:52] <_joe_>	 what is not moving much ?
[15:51:22] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-12h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=75
[15:51:33] <elukey>	 this is basically how busy the request handler threads are
[15:51:54] <elukey>	 (in reverse, this shows idleness)
[15:52:13] <elukey>	 usually it is due to too many partitions assigned to few brokers, like we have now
[15:52:35] <elukey>	 today I was able to move only some 0.X %, expected a little more
[15:52:43] <elukey>	 upstream suggests to not go below 20%
[15:52:55] <elukey>	 roll restarts completed :)
[15:53:23] <elukey>	 (we gained some wins in other metrics, like how data is spread, partition leaders, etc..)
[15:56:41] <elukey>	 anywayyy, I am logging off for today, have a nice rest of the day folks!
[16:11:28] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Lucas_Werkmeister_WMDE) I hope that `mwscript-k8s` will finish with the right `kubectl logs --follow` command, so that deployers can see the output.  But also: if this system allows any deployer to...