[07:12:22] 06serviceops, 10Parsoid (Tracking), 13Patch-For-Review: parsoidtest1001 implementation tracking - https://phabricator.wikimedia.org/T363402#10209487 (10akosiaris) 05Open→03Resolved `/etc/envoy/envoy.yaml` was empty on the new host. Deleting it and running puppet fixed it, and it seems fine now ` ak... [07:41:11] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Steady-state sizing of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T376519#10209530 (10akosiaris) In an ideal world, this process would inform the upper and lower bounds of an HPA and we wouldn't need to come up with exact numbers, but... [10:06:40] 06serviceops, 06Data-Engineering, 06SRE-OnFire, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994#10209907 (10Clement_Goubert) 05Open→03Resolved No more action needed on this incident. [10:10:18] 06serviceops, 10conftool: requestctl should fail with error if fails parsing yaml file - https://phabricator.wikimedia.org/T355256#10209943 (10Clement_Goubert) 05In progress→03Resolved Patch has been merged, resolving. Feel free to reopen if we encounter this issue again. [10:13:40] 06serviceops, 13Patch-For-Review: Remove tls-proxy cpu limits on eventgate - https://phabricator.wikimedia.org/T345244#10209947 (10Clement_Goubert) 05In progress→03Resolved Abandoning in favor of the less service-specific parent task's approach. [10:16:55] 06serviceops, 13Patch-For-Review: Remove tls-proxy cpu limits on eventstreams - https://phabricator.wikimedia.org/T345243#10209960 (10Clement_Goubert) 05Open→03Resolved Abandoning in favor of the less service-specific parent task's approach. [10:25:33] 06serviceops, 10Add-Link, 06Growth-Team, 10Observability-Tracing, and 3 others: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122#10210034 (10Clement_Goubert) 05In progress→03Stalled >>! In T357122#9554845, @Urbanecm_WMF wrote: >>>!... [10:25:40] 06serviceops, 06Data-Engineering, 10EventStreams, 10Observability-Tracing, and 3 others: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005#10210044 (10Clement_Goubert) Maybe some internal OpenTelemetry instrumentation could help shed some light on thi... [10:27:16] 06serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671#10210055 (10Clement_Goubert) 05In progress→03Declined As we have moved almost completely away from puppetized appservers, abandoning this. [10:30:55] 06serviceops: Ensure configcluster bootstraps cleanly - https://phabricator.wikimedia.org/T318699#10210061 (10Clement_Goubert) [10:30:57] 06serviceops: Ensure configcluster bootstraps cleanly - https://phabricator.wikimedia.org/T318699#10210063 (10Clement_Goubert) 05In progress→03Stalled [10:30:58] 06serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671#10210062 (10Clement_Goubert) [10:31:02] 06serviceops: Ensure role::mediawiki::appserver bootstraps cleanly - https://phabricator.wikimedia.org/T319168#10210064 (10Clement_Goubert) 05In progress→03Declined As we have moved almost completely away from puppetized appservers, abandoning this. [10:43:25] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555#10210171 (10Clement_Goubert) p:05Triage→03High [10:44:30] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10210193 (10JMeybohm) [10:45:38] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10210196 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm [10:54:17] 06serviceops, 10MW-on-K8s, 06SRE: Update Parsoid wikitech documentation following mw-on-k8s migration - https://phabricator.wikimedia.org/T370646#10210213 (10Clement_Goubert) 05Open→03Resolved p:05Triage→03Low a:03Clement_Goubert [11:18:33] I'm having a quick look at why mobileapps-tls-proxy containers (mostly in eqiad, 25 of them) are flirting with their memory limit, and it would seem that *something* happened during the night after the switchover https://grafana.wikimedia.org/goto/HXtrLUkNR?orgId=1 [11:19:06] I'm tempted to just roll restart the service in eqiad and see if it comes back [11:19:46] (or maybe 600M of ram isn't enough for envoy in this usecase but I doubt it) [11:20:33] I have witnessed some similar high memory usages in other services at time [11:20:40] and after a restart it doesn't complain for a long time [11:20:56] my gut feeling says "some very small memory leak" [11:21:59] yeah same, although this time it just jumps up instantly for a bunch of containers [11:23:15] we can also take the laid back approach and let oomkiller handle it, it'll kill the envoys that use more than their limit and they should not climb back to such high values again [11:30:44] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10210251 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm completed: - kubestage200... [11:35:21] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10210265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm [12:19:52] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10210332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm completed: - kubestage200... [12:48:01] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Jackal (not a fox) Fox (Sept 23 - Oct 4)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10210414 (10akosiaris) @Pginer-WMF @santhosh, I 've tried below to summarize our conversati... [13:00:01] hey folks, I'd need to re-run provision on parsoidtest1001 [13:00:07] it will reboot the host [13:00:24] should I do it during the mw infra window? [13:00:30] same for deploy1003 :( [13:00:48] (they are supermicros and some virt-related bios settings are enabled) [13:01:45] by provision, you don't mean reimage, right? [13:01:50] just bios settings [13:03:08] correct, but they need a host reboot [13:04:11] so deploy1003 isn't even the primary one, go ahead [13:04:40] and parsoidtest1001, hasn't been used succesfully yet, I just fixed an issue with envoy today, so go ahead too [13:05:37] ooook [13:08:28] 06serviceops, 10MW-on-K8s: Evaluate running a statsd-exporter in the mw-script namespace - https://phabricator.wikimedia.org/T376714 (10akosiaris) 03NEW [13:09:04] akosiaris: last one and I'll stop - shall we wait next week for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380 ? [13:14:57] elukey: well, you got a data persistence member +1ing it [13:15:19] so go ahead? [13:15:54] akosiaris: sure, you mentioned Matthew so I asked :) [13:16:15] yeah, my bad. I should have said a Data Persistence member. [13:26:48] perfect, I'll schedule it for tomorrow's mw infra window [13:26:56] just-to-be-sure-tm [14:41:08] 06serviceops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10211044 (10elukey) Next steps: * Apply https://gerrit.wikimedia.org/r/1078380 during tomorrow's MW Maintenance Window and retest... [15:45:05] 06serviceops, 06Infrastructure-Foundations, 06SRE: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10211194 (10elukey) I dumped all the files stored in swift in a text file on ms-fe1009, and ran the following: ` from pprint import pprint pr... [16:40:26] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10211486 (10Jhancock.wm) [16:59:13] 06serviceops, 10MW-on-K8s, 07Datacenter-Switchover: Control mw-on-k8s periodic maintenance jobs with an etcd value - https://phabricator.wikimedia.org/T367118#10211554 (10Scott_French) a:05Scott_French→03None Unassigning this, as it's not something I'm planning to work on in the near future (i.e., it was... [17:45:10] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Evaluate running a statsd-exporter in the mw-script namespace - https://phabricator.wikimedia.org/T376714#10211814 (10RLazarus) Thanks! > And if for whatever reason, we end up with a different namespace from the currently implemented as systemd timers recurring... [17:51:26] 06serviceops, 13Patch-For-Review: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604#10211837 (10Scott_French) LVS setup is done and mwdebug-next.svc.(codfw|eqiad).wmnet work as expected. [19:01:48] 06serviceops, 13Patch-For-Review: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604#10212154 (10Scott_French) Alright, with the exception of a couple of minor follow-ups, I think that's about as far as we can get for now without the 8.1-based images. [21:36:27] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10212679 (10CDanis) [21:39:12] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10212682 (10CDanis) ===== Works in prod now: {P69502} == Remaining work to do: [ ] {T376291} [ ] {T376762} [22:44:36] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 (10Scott_French) 03NEW [22:44:49] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10212798 (10Scott_French) p:05Triage→03High [23:29:58] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10212860 (10Scott_French) Alright, this //seems// relatively straightforward, with one potential gotcha: `utils/create_ecdsa_cert` uses `/usr/local/bin/puppet-ecdsacert`, which only exists on `role::puppe...