[04:33:08] 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) This is the server that has had hardware issues a few weeks ago, see https://sal.toolforge.org/production?p=0&q=parse1002&d= I'll set it as inactive, then open the task for dcops I guess. [04:34:55] 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) p:05High→03Medium Also, please don't set priorities in tasks unless you're going to triage and work on them. A server failure is a minor issue that we're used to and isn't urgent or need priority. [04:38:30] 10serviceops: parse1002 down - https://phabricator.wikimedia.org/T342298 (10Joe) 05Open→03Resolved a:03Joe Resolving as the hardware problem will be treated elsewhere. [06:36:49] hi folks, starting the rebalances in a bit [06:41:59] <_joe_> elukey: <3 [07:50:00] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [08:21:07] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) Correcting myself, the edit that is causing all these pages to be re-rendered is a template edit: https://commons.wikimedia.org/w/index.php?title=Template%3ANA... [08:25:12] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi) Today the PHPFPMTooBusy alert for parsoid paged, and it is bound to do so again I think, we've been hovering around the threshold basically this is issue... [08:36:09] hi folks, I could use some assistance re: ^ specifically around adding capacity to parsoid even if temporarily, what do you think ? [08:37:53] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019664, @JMeybohm wro... [08:42:44] I guess we'll be taking the parsoid outage if it comes to that ? [08:45:56] maybe a couple of nodes from apis could be moved via confd's settings to parsoid? [08:46:21] basically one to replace parse1002 and an additional one for extra capacity [08:46:52] I am not 100% if a confd change is enough for these kind of repurposes, of it pybal needs a restart to update the map vips -> backends [08:47:06] but it should give us time to investigate and possibly resolve [08:47:54] indeed, in that case shuffling hosts in conftool-data/node/eqiad.yaml should be enough I think, IIRC no pybal restart needed [08:48:10] more wishful thinking tbh [08:50:39] <_joe_> akosiaris, jayme can you take a look please? [08:50:44] <_joe_> I'm busy with other stuff [08:51:16] * jayme looking [08:51:24] <_joe_> the main thing to be careful about when moving a server (which amounts to changing the "servergroup") is that IIRC we need to list the parsoid servers in mediawiki-config [08:51:41] I'm not sure about the repurpose thing - didn't clem do that recently? [08:51:59] that was towards api IIRC [08:52:16] <_joe_> parsoid => jobrunner [08:52:19] ah [08:52:46] so we lost parsoid capacity there as well [08:52:55] we could target mw135[6,7] (less ram but same nproc as the mw14XX series, so we add capacity to parsoid but we remove the two least powerful from apis) [08:53:17] checking mediawiki config [08:53:36] <_joe_> elukey: actually I'd prefer better CPUs for parsoid if possible [08:53:47] <_joe_> cpu generation counts a lot towards latency there [08:53:53] yeah parse1002 isn't well, one out of 24 hosts [08:54:01] <_joe_> godog: 24? [08:54:16] sorry, 20 [08:54:26] <_joe_> yeah [08:54:32] _joe_: yep the nproc is the same, 14xx has only more ram [08:54:33] <_joe_> so we can throw back in another 2-3 [08:54:43] so I'd leave the more ram-beefy in the api pool [08:54:48] <_joe_> ack [08:54:56] <_joe_> elukey: same processor? [08:55:05] <_joe_> it's not nproc, but the processor generation [08:55:26] I doubt it is the same, checking [08:56:07] Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz vs Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz [08:56:54] I don't find parse1xxx in mediawiki's config atm [08:59:42] ok it looks like wgLinterSubmitterWhitelist has been removed in 49076ac8d15d [08:59:46] afaics it shouldn't be necessary to update wmf config [08:59:47] yeah [08:59:48] so no changes to mediawiki-config needed [09:02:02] let's brainbounce for the hosts repurpose [09:02:15] I'm prepping a patch in the meantime with the proposed move [09:02:27] <_joe_> godog: oh great [09:02:33] 1) assumption - we don't need to do anything on the api nodes before repurposing them to parsoid since they run the same stuff (to be verified?) [09:02:41] 2) set the target apis inactive [09:02:46] 3) merge Filippo's patch [09:02:54] <_joe_> yes [09:03:04] 4) check ipvs on pybal to verify [09:03:08] <_joe_> then set them up in the parsoid cluster [09:03:12] 5) set the nodes active in parsoid [09:03:21] <_joe_> you need to set weight and enable them [09:03:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/940095 [09:03:30] yes yes [09:03:36] no puppet role changes right? [09:04:18] afaik no [09:05:10] <_joe_> eh, wait a sec [09:05:11] what about the cpu family concern? mw13 is skylake while 14 is cascade [09:05:17] <_joe_> I'm checking a couple of things [09:05:57] jayme: no idea, but to me we could proceed with the selected ones, seems a good compromise [09:06:33] fine by me. We can switch to something else if it does not seem to be enough [09:06:41] yes exactly [09:08:28] ok elukey's plan LGTM, are we ok to proceed? _joe_ jayme ? [09:08:48] <_joe_> wait a sec, writing a CR [09:09:36] ok [09:09:48] <_joe_> done, basically I'd move them in site.pp [09:10:00] <_joe_> there will be slight differences in configuration [09:10:41] ok so changing the role will change the cluster too, I'll try that [09:10:42] <_joe_> most importantly, the certificate SAN [09:10:50] <_joe_> yes [09:11:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [09:11:07] <_joe_> let's run pcc on them to make sure [09:11:17] <_joe_> so yeah the plan would be [09:11:27] <_joe_> 1) set nodes to inactive in conftool [09:11:37] <_joe_> 2) merge the patch, run puppet on the hosts [09:12:20] <_joe_> 3) set weight=10, pooled=active in conftool [09:13:29] (going afk for an errand, bbl!) [09:13:41] ok [09:15:39] <_joe_> running pcc on the patch [09:16:07] I did too, there's a link [09:16:09] sigh, that commons template change is still hammering parsoid? [09:16:43] I thought it was something different after subb.u's comment [09:17:02] <_joe_> yes [09:17:09] <_joe_> and no, subbu's comment was correct [09:17:18] <_joe_> it's "traffic" to commons [09:17:26] <_joe_> it's just coming from changeprop [09:17:50] it's what now? 5 days ? [09:18:00] <_joe_> july 14th was the edit [09:18:13] sigh, temporary is the new permanent I guess [09:18:37] <_joe_> akosiaris: but basically, it's a lethal mix of us wanting to pregenerate everything in restbase [09:18:41] alright I'm going ahead with the plan, i.e. start to set inactive in conftool [09:18:43] <_joe_> more than we do in meodiawiki [09:19:06] <_joe_> godog: yes, I have to go afk for ~ 10 minutes tops [09:19:14] _joe_: ack, thanks [09:19:16] <_joe_> I'll be back before puppet has run on the hosts :) [09:19:53] heheh [09:20:50] ok hosts depooled, going ahead with the merge [09:24:27] puppet ran, I'm going ahead and pool the hosts [09:26:45] ok all done, I'll force a puppet run on prometheus to pick up the new config [09:28:22] we can also lower the concurrency of the job in changeprop if the above doesn't end up being sufficient [09:28:59] * godog nods [09:29:57] <_joe_> akosiaris: sure, but the problem with these damn pages is that they take on average 20 seconds to parse [09:30:06] jeremiah-johnson-nod-of-approval.flv [09:30:11] lol [09:30:15] <_joe_> ahah [09:31:22] hehehe ok waiting for a little, see if things improve [09:33:33] <_joe_> we also have the option to re-edit the template removing the references to all the other pages [09:33:35] <_joe_> :) [09:34:23] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10daniel) >>! In T342085#9030518, @Joe wrote: > Correcting myself, the edit that is causing all these pages to be re-rendered is a template edit: > > https://commons.... [09:35:41] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi) >>! In T342085#9030521, @fgiunchedi wrote: > Today the PHPFPMTooBusy alert for parsoid paged, and it is bound to do so again I think, we've been hovering... [09:40:50] we're back at ~40% idle workers for parsoid [09:41:29] going afk for 10 [09:41:43] thanks godog! [09:47:39] sure np [11:58:11] 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05Open→03In progress p:05Triage→03Medium [11:58:17] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [11:58:52] 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) a:05Clement_Goubert→03Joe [11:59:34] 10serviceops, 10MW-on-K8s: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 (10Joe) 05In progress→03Resolved [11:59:43] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [12:01:12] 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe) 05Open→03In progress p:05Triage→03Medium a:03Joe [12:01:15] 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10Joe) [12:23:34] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10JMeybohm) [12:47:30] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) A basic check of the orchestra... [13:08:49] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) That error is from our top-level, las... [14:13:21] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MSantos) [14:24:12] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) >>! In T342085#9030615, @daniel wrote: >>>! In T342085#9030518, @Joe wrote: >> Correcting myself, the edit that is causing all these pages to be re-r... [14:50:40] 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe) [14:51:05] 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe) p:05Triage→03Medium [14:52:31] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10daniel) >>! In T342085#9031409, @Joe wrote: >>>! In T342085#9030615, @daniel wrote: > Ok that explains it. Can't we make restbase just invalidate content... [14:58:20] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) >>! In T342085#9031569, @daniel wrote: >>>! In T342085#9031409, @Joe wrote: >>>>! In T342085#9030615, @daniel wrote: >> Ok that explains it. Can't we... [15:38:41] hello folks, I completed the first round of kafka main eqiad moves, but the first three brokers are clearly still overloaded and I didn't see any big move in the related metric [15:38:49] very weird, I'll keep investigating [15:39:41] <_joe_> elukey: did you restart changeprop in the process? [15:39:48] <_joe_> yes it's weird indeed [15:42:55] _joe_ nope, it seems to only dislike when we increase the partitions, not if we change where they are [15:43:42] I concentrated on the topics with the most traffic, but I could generate a plan for all the topics and topicmappr would create a config that leads to a perfect balance [15:44:15] it will take more time but maybe worth it [15:44:24] I expected some differences after today though [15:44:36] <_joe_> I would suggest to try to restart changeprop before calling it a loss [15:44:55] <_joe_> but yeah, strange [15:44:58] sure, lemme do it [15:45:04] <_joe_> I can try to dig into it with you tomorrow [15:47:59] we are still heavily unbalanced so maybe the overload is not only related to the topics that generate traffic [15:48:56] but this is great though: https://phabricator.wikimedia.org/T338357#9031668 [15:49:34] <_joe_> ok so there is the effect we were mostly interested into [15:49:52] <_joe_> what is not moving much ? [15:51:22] https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-12h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=75 [15:51:33] this is basically how busy the request handler threads are [15:51:54] (in reverse, this shows idleness) [15:52:13] usually it is due to too many partitions assigned to few brokers, like we have now [15:52:35] today I was able to move only some 0.X %, expected a little more [15:52:43] upstream suggests to not go below 20% [15:52:55] roll restarts completed :) [15:53:23] (we gained some wins in other metrics, like how data is spread, partition leaders, etc..) [15:56:41] anywayyy, I am logging off for today, have a nice rest of the day folks! [16:11:28] 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Lucas_Werkmeister_WMDE) I hope that `mwscript-k8s` will finish with the right `kubectl logs --follow` command, so that deployers can see the output. But also: if this system allows any deployer to...