[07:16:52] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) 05Open→03Resolved [07:26:02] 10serviceops, 10Machine-Learning-Team, 10Observability-Metrics, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10Joe) I think the current solution works well. Basically: * If your pod contains `prometheus.io/scrape: true` prometheus will pick up the... [08:11:21] 10serviceops, 10SRE, 10Traffic, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) [08:11:37] 10serviceops, 10SRE, 10Traffic, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) p:05Triage→03Low [08:49:54] 10serviceops, 10SRE, 10Traffic, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10Joe) It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if that's been solved in the latest versions. [09:32:22] 10serviceops, 10SRE, 10Traffic, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) >>! In T337453#8879233, @Joe wrote: > It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if tha... [10:13:34] 10serviceops, 10Thumbor, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Upgrade Thumbor to bullseye - https://phabricator.wikimedia.org/T336881 (10hnowlan) [10:20:56] 10serviceops, 10SRE, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Clement_Goubert) p:05Triage→03Medium [10:41:30] 10serviceops: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10Clement_Goubert) Reconstructing the dependency tree: - `python3` -- `prometheus-nutcracker-exporter` -- `python3-devel` --- `python3-build-stretch` It looks l... [10:42:26] 10serviceops: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10Clement_Goubert) p:05Triage→03Low [10:45:34] 10serviceops, 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) 05Open→03Resolved a:03elukey [10:55:22] <_joe_> o/ kamila_ and I were looking at the benthos 'official' helm chart, https://github.com/benthosdev/benthos-helm-chart [10:55:56] <_joe_> and it seems to me like the typical chart that goes all the way in the direction of flexibility, which translates to "bring your own kubernetes yaml everywhere" [10:56:08] <_joe_> with a lot of stuff we won't / can't use [10:56:19] <_joe_> and lacking other stuff we would probably love to have [10:57:00] <_joe_> I would err on the side of inventing one chart here, it won't be overly complex, and maybe the only complex thing we'll try to do is run unit tests and e2e tests using "helm test" [12:01:23] 10serviceops, 10MW-on-K8s: Better naming for mw-on-k8s pods - https://phabricator.wikimedia.org/T325071 (10Clement_Goubert) 05Open→03Resolved Pod names are now: `$namespace.$datacenter.$release-` ex: `mw-web.eqiad.main-64df5b98b5-4btrm` [12:42:24] 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10TheresNoTime) [13:12:45] jayme: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874 is ready for review/deploy i think [13:12:52] operator in eqiad + codfw [13:12:53] ? [14:51:22] 10serviceops, 10SRE, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [15:11:02] 10serviceops, 10MW-on-K8s: Coordinate testing of testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10Clement_Goubert) [15:17:27] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [15:17:57] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) p:05Triage→03High [15:18:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [15:28:00] 10serviceops, 10SRE, 10Traffic, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10elukey) The ML team is serving its Lift Wing model servers via the API gateway, so we'd benefit as well to have edge caching :) [15:56:07] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [15:56:16] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) p:05Triage→03High [15:57:32] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [15:58:18] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:08:38] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:09:36] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:10:00] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:13:00] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:13:12] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:13:41] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10Scap, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:29:21] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10SRE, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [17:14:14] 10serviceops, 10Abstract Wikipedia team, 10Abstract Wikipedia Fix-It tasks: Please hide from the docker registry two no-longer-used Abstract Wiki images (now moved to GitLab) - https://phabricator.wikimedia.org/T337505 (10Jdforrester-WMF) p:05Triage→03Low