[06:28:10] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Make more use of Calico network policy features - https://phabricator.wikimedia.org/T331894 (10JMeybohm) So is {T340780} [07:20:19] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [07:55:53] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) > #wikimedia-operations: removed docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324 Since 1.24, envoy required libc 2.29 and buste... [10:07:16] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) [10:10:25] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10akosiaris) Since this is fixed, should we resolve this? Do we need a f... [10:18:11] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) >>! In T340780#8980138, @akosiaris wrote: > Since this i... [10:42:28] 10serviceops, 10Data-Engineering, 10Event-Platform (Sprint 14 B): Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10gmodena) > Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)? f/up to what @Ott... [10:54:10] 10serviceops, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-July-September): Remove Flores key from production - https://phabricator.wikimedia.org/T337284 (10Pginer-WMF) p:05Triage→03Medium [11:28:48] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye [11:29:02] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye [11:44:46] 10serviceops, 10Machine-Learning-Team, 10MinT, 10SRE, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [11:59:42] 10serviceops, 10Kubernetes: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye completed: - kubestagemaster1002 (**WAR... [12:20:07] 10serviceops, 10Kubernetes: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye executed with errors: - kubestagemaster... [13:18:06] 10serviceops, 10Data-Persistence: Investigate how to abstract misc Mariadb clusters host/ip information so that no manual action is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) [13:19:02] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T340780#898014... [13:19:32] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) [13:21:34] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) The easy way out is ofc to follow the #mw-on-k8s way, i.e. put in appl... [13:29:25] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) Some of the ideas above can be implemented using ideas from T331894, e... [13:30:38] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) a:05akosiaris→03Urbanecm_WMF [13:47:31] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10akosiaris) `deployment` is the group to be used for deploying to k8s. Initially we had targetted `wikid... [15:53:44] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10JArguello-WMF) [16:00:45] 10serviceops, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10JArguello-WMF) [16:30:33] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10SRE-OnFire, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JArguello-WMF) [16:32:41] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10SRE-OnFire, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10JArguello-WMF) [16:33:03] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10SRE, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10JArguello-WMF) [16:33:47] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JArguello-WMF)