[03:10:13] my apologies, m.utante, I missed this earlier - that's odd, as I spot checked their disk utilization earlier this morning ... [03:18:54] ah, ok - I must have somehow managed to skip mw1446 when checking the host-overview dashboard for filesystem utilization. I know c.laime has a cleanup command for this, I'll flag it. [08:05:27] 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9925107 (10elukey) Next steps: * Get a list of Docker images running Jessie/Stretch from the registry (somehow, not sure how to do it right now)... [09:04:12] claime: o/ [09:04:41] yo [09:04:50] I noticed that php7.4 docker images have versions like 7.4.33-1-s1, what does the 's' stands for? [09:05:35] special [09:05:41] basically I am reviewing https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1021922 and I the new version proposed seems not correct, but not sure what we should use [09:05:44] kidding, I don't know [09:06:06] ah wait is it part of the php versioning? [09:06:19] I think we just bump that when we make a change that isn't a php version change actually [09:06:32] yeah [09:07:06] s2 to s3 is wikimedia-buster to buster [09:08:23] from https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1021922/5/images/php/7.4/cli/changelog I see that, at least for php-cli, Janis bumped to s2 when a new ICU was introduced [09:08:35] so I guess that bumping to bookworm could be s4? [09:08:54] yep [09:09:10] ack thanks :) [09:10:39] but erm, I'm not sure we're ready to bump to bullseye [09:10:54] or bookworm [09:11:17] Moritz told me that all the php components should be ready for it, already tested etc.. [09:11:23] The problem is right now, if you bump production-images, it'll rebuild mediawiki on it the next scap run [09:11:28] And deploy it everywhere at once [09:11:52] yes yes I think it is written in the commit msg, I think that James will only do it when everybody gives the green light [10:26:32] <_joe_> elukey: that change would break production, tbw [10:26:46] <_joe_> there's many changes that clearly weren't accounted for [10:27:06] <_joe_> the -sN naming was originally for security rebuilds [10:42:37] looks like there's nothing running garbage-collect on the docker registry [10:42:52] That could explain some of the awful performance cc elukey [11:23:54] we're definitely not able to use bookworm (our custom PHP 7.4 build hasn't been built for it), but the component/php74 itself is running on cloudweb [11:43:09] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9926076 (10MSantos) [12:32:48] _joe_ ahhh okok so we shouldn't use -s unless we upgrade for security, and this is not the case. For the actual changes, this is of course something that serviceops needs to validate first, I was just reviewing James' changes, didn't want to push for them [13:08:15] on registry1003: [13:08:28] "GET /v2/bullseye/manifests/latest HTTP/1.1" 200 3242 "-" "check_http/v2.3.3 (monitoring-plugins 2.3.3)" rt=0.196 uct="0.000" uht="0.196" urt="0.196" [13:08:59] 'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"' [13:09:41] all seconds [13:09:52] everything looks fine, I'll roll it out to the other nodes [13:23:16] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9926374 (10Bmueller) [13:24:14] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9926376 (10Bmueller) @daniel thank you for al the prep work! This is good to go :-) [13:36:26] 06serviceops, 10MW-on-K8s, 10Observability-Logging, 06SRE: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9926423 (10kamila) I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the p... [13:40:34] elukey: let's talk about upgrading the docker-registry vms to bullseye here [13:40:42] so the others can weigh in :) [13:41:25] claime: I know that you don't like to chat with me in pvt, you can say that [13:41:40] :( [13:41:42] not true [13:41:54] :D :D [13:42:17] python3-docker-report is our package, and it's installed on build hosts that are already bullseye [13:42:18] I'd be bold and propose to jump to bookworm directly, if possible [13:42:36] so we'll not need to upgrade again soon-ish [13:42:55] we could maybe reimage one of the eqiad ones [13:43:00] see what shakes [13:43:03] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9926454 (10Joe) >>! In T364400#9780622, @BBlack wrote: >>>! In T364400#9779996, @hnowlan wrote: >> Could we implement this remapping at the ATS layer rather than the Apache one, in a... [13:43:08] exactly yes, I thought the same.. [13:43:34] <_joe_> the real risk is nginx [13:43:50] <_joe_> we do perverse things in nginx on the registries [13:44:13] <_joe_> so I would suggest to prepare a set of tests to run [13:44:14] <_joe_> like [13:44:24] <_joe_> - publish a non-restricted image from build2001 [13:44:34] <_joe_> - publish a restricted image from deploy1002 [13:45:07] <_joe_> - pull a non-restricted image without credentials from teh public interface (.wikimedia.org) [13:45:21] <_joe_> - pull a restricted image without credentials and see it fail [13:45:32] <_joe_> - check the pipeline on gitlab still works [13:45:44] <_joe_> some of the above is in httpbb tests IIRC [13:47:26] not sure about that [13:47:55] oh, yeah, in templates, of course [13:55:27] yeah, I wrote some back in the days because it [13:55:42] it's hard to type on a new keyboard... [13:56:01] it's hard to reason about what is "correct" in terms of the registry [14:09:08] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9926617 (10daniel) >>! In T364400#9926454, @Joe wrote: > Even more to @bblack's comment, I would just have apache funnel anything under `/api` it receives to an endpoint in mediawiki... [14:17:48] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368079#9926632 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:19:47] can you guys add the brainstorm to https://phabricator.wikimedia.org/T332016 ? [14:21:17] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9926653 (10jijiki) Unless something else pops up, we shall be retiring the old hosts (aka the VMs) next week [14:28:43] 06serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016#9926687 (10Clement_Goubert) * Necessary packages `docker-registry` and `python3-docker-report` are available for bullseye in the right versions * Summarizing from irc, the real risk is the nginx config. Tests woul... [14:30:49] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9926695 (10VirginiaPoundstone) @SGupta-WMF and @Scott_French Thank you for you... [14:34:07] /14 [14:34:10] err :) [14:46:01] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544 (10Vgutierrez) 03NEW [14:46:21] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9926761 (10Vgutierrez) p:05Triage→03Medium [14:51:00] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545 (10Vgutierrez) 03NEW [14:51:13] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9926781 (10Vgutierrez) p:05Triage→03Medium [14:53:27] 06serviceops, 10MoveComms-Support, 10MW-on-K8s, 06SRE, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9926798 (10Clement_Goubert) [15:12:34] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9926889 (10cmooney) > IPIP encapsulation has a 20 bytes overhead that needs to be accounted somehow, in high-traffic[12] services we chose... [15:29:54] 06serviceops, 10MoveComms-Support, 10MW-on-K8s, 06SRE, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9926945 (10Jdforrester-WMF) Should we call this Resolved and track the remaining migrations in the parent, T290536? [15:34:10] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9926991 (10Scott_French) [15:38:32] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9927038 (10Scott_French) @VirginiaPoundstone - I believe there was one tick ma... [16:12:52] 06serviceops, 10MW-on-K8s, 06Release-Engineering-Team, 10Scap: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes - https://phabricator.wikimedia.org/T341441#9927255 (10akosiaris) Just to point out that this is probably not from the network. We don't have networking rate limitin... [16:15:48] 06serviceops, 10MW-on-K8s, 06Release-Engineering-Team, 10Scap: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes - https://phabricator.wikimedia.org/T341441#9927265 (10Clement_Goubert) It's possible it's to do with docker using single-threaded gzip for compression on push https... [18:56:04] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9927952 (10Scott_French) @SGupta-WMF - Ahmon merged [0] this morning, so you s... [19:30:08] rzl: any issue with me pushing out this revert https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1050032?tab=checks [19:30:14] I believe the underlying issue is fixed [19:33:42] jhathaway: nope fire away [19:33:49] nod, thanks [22:11:43] 06serviceops: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932#9928800 (10Scott_French) 05In progress→03Resolved Alright, this is now live, and should be sufficient to catch future instances of the scenario that originally motivated this task (missing quota) and more gener... [22:14:25] 06serviceops, 10conftool: requestctl should fail with error if fails parsing yaml file - https://phabricator.wikimedia.org/T355256#9928809 (10Scott_French) @Clement_Goubert FYI, this was released last week.