[05:54:14] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [08:42:12] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:43:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:52:34] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10dcausse) @ItamarWMDE the patch finally got deployed, th... [09:18:10] 10serviceops, 10Machine-Learning-Team: docker-pkg fails to upload big Docker images to the registry - https://phabricator.wikimedia.org/T335177 (10elukey) @akosiaris thanks for the in depth answer! I figured it was something nginx-related but I didn't think to check the max upload size (TIL for the next time).... [09:56:23] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:56:46] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:58:03] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:58:38] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:59:05] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [10:42:29] We're getting an alert on parsoid errors [10:43:11] timeouts according to the dashboard [10:44:43] Almost all on /w/rest.php/commons.wikimedia.org/v3/page/pagebundle/User%3AA._Wagner%2Fgallery/574545713 [10:46:09] Seems to have passed now [10:46:45] https://grafana.wikimedia.org/goto/vecRFzs4z?orgId=1 [11:17:05] 10serviceops, 10Infrastructure-Foundations, 10WikimediaDebug, 10Performance-Team (Radar): Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 (10MoritzMuehlenhoff) 05Open→03Resolved The ICU67 build has also been updated to 1.1.1, marking as resolved. [11:17:10] 10serviceops, 10Arc-Lamp, 10Performance-Team, 10WikimediaDebug, 10Patch-For-Review: Add per-request flamegraph option to WikimediaDebug - https://phabricator.wikimedia.org/T291015 (10MoritzMuehlenhoff) [11:52:50] 10serviceops, 10MW-on-K8s, 10Recommendation-API, 10Patch-For-Review: Migrate recommendation-api to mw-api-int - https://phabricator.wikimedia.org/T334062 (10Clement_Goubert) Without further input, I will be relying on our monitoring and `service-checker-swagger` which checks the `x-amples` from the `/?spec... [11:59:22] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10JMeybohm) While refactoring kubernetes puppet code I came across the fact that we place credentials t... [12:31:00] 10serviceops, 10API Platform, 10ChangeProp, 10EventStreams, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10VirginiaPoundstone) [12:31:15] 10serviceops, 10API Platform, 10SRE, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10VirginiaPoundstone) [12:31:25] 10serviceops, 10API Platform, 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10VirginiaPoundstone) [13:17:09] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:21:42] hi folks! I'd like to add a new user called 'ml-runner' to role::builder (https://gerrit.wikimedia.org/r/c/operations/puppet/+/912240), seems easy enough - shall I proceed or is there a specific process? [13:23:13] elukey: it's fine - if you add it to the deployment-charts repo config.yaml as well [13:23:22] 10serviceops, 10RESTbase Sunsetting, 10API Platform (RESTbase Deprecation Roadmap), 10Epic, 10Platform Engineering Roadmap: Survey RESTBase services and find which ones accesses Parsoid via RESTBase - https://phabricator.wikimedia.org/T333536 (10VirginiaPoundstone) [13:24:38] 10serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10VirginiaPoundstone) [13:24:53] 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10VirginiaPoundstone) [13:25:27] 10serviceops, 10SRE, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10VirginiaPoundstone) [13:26:40] jayme: ack thanks! [13:29:30] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:39:04] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:13:29] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:15:07] 10serviceops, 10SRE: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10RLazarus) [14:15:25] 10serviceops, 10SRE: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10RLazarus) [14:15:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10RLazarus) [14:16:32] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [14:21:57] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: sre.discovery.datacenter breaks on services not in "production" state - https://phabricator.wikimedia.org/T335341 (10Clement_Goubert) 05Open→03Resolved [14:22:01] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:38:16] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) 05Open→03Resolved [14:38:23] 10serviceops, 10SRE, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Aklapper) [14:45:29] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Clement_Goubert) Everything went great, thanks for your support! [14:46:26] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:46:46] 10serviceops, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) 05In progress→03Resolved [14:47:02] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) [14:47:15] 10serviceops, 10SRE, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Clement_Goubert) [14:47:18] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:47:47] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) Thank you for the ping, we will prioritze t... [15:32:51] 10serviceops, 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review, 10Performance-Team (Radar): MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 (10dancy) 05Open→03Resolved [15:37:46] 10serviceops, 10Machine-Learning-Team: docker-pkg fails to upload big Docker images to the registry - https://phabricator.wikimedia.org/T335177 (10elukey) 05Open→03Resolved a:03elukey [17:03:24] claime: yea, so.. I _could_ just arm it and resolve the alert. but the motd says "don't deploy from here" and not arming the keyholder is a good way to make sure that is not happening. So feels more like maybe the fix would be in monitoring itself. [17:03:40] (follow-up from -operations, re: keyholder on deploy2002 alert) [17:06:02] Yeah, the alert should probably only be for the one tagged deployment_server [17:06:35] in hiera [17:07:12] +1 [17:09:16] will need a bit of puppet refactoring though [17:09:45] keyholder::monitoring is included by profile::keyholder::server, which is itself included by role::deployment_server [17:09:53] And we can't do hiera lookups in roles [17:11:12] I gotta go, but it's something to discuss afa does the secondary deployment server needs the keyholder armed [17:11:46] I made https://phabricator.wikimedia.org/T335435 [17:15:23] 10serviceops, 10SRE: keyholder monitoring should not alert on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [17:17:53] 10serviceops, 10SRE: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [19:37:07] 10serviceops, 10MW-on-K8s, 10SRE, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) >>! In T288629#8807158, @JMeybohm wrote: > I don't see helm defaults being installed to releases or ci nodes since t... [21:08:44] 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BBlack) Probably needs subtasks for two things: 1. Fix "safe-service-restar...