[07:25:17] 10serviceops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 5 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9554129 (10Marostegui) [09:12:27] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, and 2 others: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122#9554282 (10Clement_Goubert) It doesn't look like linkrecommendation logs a request_id that would correspond to t... [10:14:24] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9554373 (10taavi) [12:41:04] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, and 2 others: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122#9554845 (10Urbanecm_WMF) >>! In T357122#9554282, @Clement_Goubert wrote: > It doesn't look like linkrecommendati... [12:58:33] hey hnowlan should i start gradually disabling storage on enwiki? [12:58:54] nemo-yiannis: could it wait an hour? [12:59:06] we can limit scap to a few servers at a time to monitor jobs load [12:59:07] sure [13:30:33] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661#9554968 (10MoritzMuehlenhoff) [13:34:05] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661#9554982 (10MoritzMuehlenhoff) >>! In T356661#9520597, @MoritzMuehlenhoff wrote: > This leaves the buster hosts: alert (in the process of being reimaged to bookworm currently), cloudweb and the deployment servers. I'll update the... [14:19:28] 10serviceops, 10MW-on-K8s: Migrate remaining internal MW API traffic to k8s - https://phabricator.wikimedia.org/T357907#9555117 (10kamila) [14:20:19] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Migrate remaining internal MW API traffic to k8s - https://phabricator.wikimedia.org/T357907#9555129 (10Clement_Goubert) [14:20:29] 10serviceops, 10Data-Engineering, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555132 (10Clement_Goubert) [14:20:44] 10serviceops, 10Data-Engineering, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555135 (10Clement_Goubert) 05duplicate→03In progress [14:20:51] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9555136 (10Clement_Goubert) [14:27:02] 10serviceops, 10Data-Engineering, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555154 (10Clement_Goubert) We're all set for this, according to [[ https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment#Upgr... [14:28:12] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Migrate remaining internal MW API traffic to k8s - https://phabricator.wikimedia.org/T357907#9555162 (10Clement_Goubert) [14:31:07] nemo-yiannis: we should be all good on the jobrunner front. how much of an increase in requests to mw-api-int are you anticipating? [14:33:52] last time i checked turnilo enwiki traffic is 1/5th of the total page/html traffic [14:36:02] so 20% more traffic [14:41:12] ah the requests go to parsoid and not mw-api-int [14:41:13] hnowlan: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1004683 [14:43:59] hnowlan: yes [14:44:34] lgtm [14:46:31] should 1 start with 1 server per cluster? [14:46:46] sure [14:46:50] https://etherpad.wikimedia.org/p/restbase-parsoid-storage-rollout [14:46:51] ok [14:46:59] if you want to keep track ^ [14:47:11] ok [14:48:01] oh hm, there are hosts in conftool that aren't in scap again [14:48:44] which hosts? [14:49:01] i remember we removed some targets from scap config because they were deprecated [14:49:31] restbase2034 and restbase2035 [14:50:16] and a whole load of restbase104*. maybe they're not ready yet, let me confirm. You can still scap to the first server numerically anyway in the mean time [14:50:43] this is the ones i removed: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/998488 [14:51:27] these are new rather than old hosts [14:52:28] okay, the two new codfw hosts are in conftool but the eqiad ones aren't yet [14:53:05] I am marking with strikethrough the restbase nodes I deployed already [14:53:12] cool [14:53:30] https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1004729 [14:55:21] do you have a link to the jobs dashboards we were following last time ? [14:56:27] for the jobrunners: https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-release=main&var-container_name=All&var-site=&from=now-1h&to=now&refresh=5m [14:56:31] for the jobqueue itself https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s [14:57:32] ok thanks [14:57:55] I'm just gonna reformat the pad a bit [14:58:03] i think last time the impact was visible ~10 mins after the deployment [15:03:11] dunno how much impact I'd expect to see from two servers [15:06:02] yeah [15:06:02] I'd say go ahead and do another 3 in each DC [15:15:04] <_joe_> again, I'd keep an eye on parsoid more than on the jobqueue [15:15:39] <_joe_> but it looks mostly unimpressed [15:16:49] <_joe_> we went from 10% to 15% busy workers [15:17:10] <_joe_> but that looks like organic traffic [15:24:36] in total i have deployed 4 nodes on each cluster [15:24:49] out of ~13 [15:29:06] nemo-yiannis: I'd say keep going, another 4 per dc [15:29:11] ok [15:38:04] done [15:48:36] I'd do another 4 in both again [15:48:53] still looks like fairly small impact [15:49:29] ok [15:49:46] yeah overall looks uneventful [16:03:18] do the rest whenever [16:04:06] o [16:04:12] ok [16:04:26] given the comparative impact of the last migration I am a little suspicious tbh [16:05:09] yeah me too [16:05:17] we had a ~600rps jump for the last wikis and we're not seeing anything like that here. We're obviously smoothing things out by splaying things [16:05:40] can we verify that the enwiki requests are bypassing storage? [16:06:43] and/or that we definitely were not bypassing for enwiki before this by accident [16:11:31] yup we also deployed enwiki last week [16:12:14] the config didn't handle the combination of default/except domain we had [16:12:30] oof [16:12:33] eh [16:12:36] ok [16:14:31] heh, oh well. Nothing broke [16:15:51] also no complaints for a week [16:15:54] success [16:17:50] :D [16:18:16] put some slides together, sell this as Radical QA to a conference [16:44:19] 10serviceops, 10RESTBase Sunsetting, 10Epic, 10Parsoid (Tracking): Prepare amount of workers to handle enwiki traffic for parsoid endpoints - https://phabricator.wikimedia.org/T357504#9555675 (10hnowlan) 05Open→03Resolved a:03hnowlan [16:44:26] lol [16:44:37] now we need to figure out how to handle lint jobs without pregeneration and then we can disable it too [17:21:21] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796#9555793 (10hnowlan) [17:21:29] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796#9352065 (10hnowlan)