[08:12:47] 10serviceops, 10Machine-Learning-Team, 10MinT, 10SRE, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10akosiaris) [08:51:25] ottomata: is there another flink-operator dashboard apart from https://grafana.wikimedia.org/d/H-sRgqLVk/flink-kubernetes-operator ? That seems a little thin on actual operator metrics tbh [08:52:12] (and it only works after the first flink cluster has been created ;)) [09:00:40] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Add configuration file support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336037 (10Joe) 05Open→03In progress p:05Triage→03High [09:00:51] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [09:37:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Selected IPv6 service-cluster-up ranges are to big - https://phabricator.wikimedia.org/T335285 (10akosiaris) Wow, I did not see that coming! Anyway we don't use them for anything yet, it's fine to switch to /116 address. I had a look at the allocations in netbo... [10:03:41] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Joe) 05Open→03In progress p:05Triage→03High [10:03:53] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [10:06:22] 10serviceops, 10Service-deployment-requests: New Service Request 'IPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) p:05Triage→03High [10:07:40] 10serviceops, 10Security-API-Service, 10Kubernetes: Create helm chart for IPoid in operations/deployment-charts - https://phabricator.wikimedia.org/T336163 (10jijiki) p:05Triage→03High a:03jijiki [10:35:30] 10serviceops, 10Service-deployment-requests: New Service Request 'IPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) [11:31:51] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog-Deprecated: Introduce versioning in PCS output and static assets - https://phabricator.wikimedia.org/T336251 (10Jgiannelos) [11:32:40] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog-Deprecated: Introduce versioning in PCS output and static assets - https://phabricator.wikimedia.org/T336251 (10Jgiannelos) [11:33:12] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog-Deprecated: Introduce versioning in PCS output and static assets - https://phabricator.wikimedia.org/T336251 (10Jgiannelos) [11:33:29] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog-Deprecated: Introduce versioning in PCS output and static assets - https://phabricator.wikimedia.org/T336251 (10Jgiannelos) [12:43:24] jayme: flink operator not really? the operator really does'nt do very much [12:43:43] well...it operates the flink clusters [12:43:48] but for flink app, yes https://grafana.wikimedia.org/goto/ZfN9gW84k?orgId=1 [12:45:38] jayme: yes but the operator is mostly just handling deployment / lifecycle stuff. once the app is running it doesn't do much aside from i dunno, restarting jobmanagers if they fail,, etc. [12:46:53] but ya the operator dash could def use more work, been mostly focused on app deployment recently [12:49:00] I would like to have a proper dashboard for that as lot's of things can go wrong with operators (errors during reconsilation, problems with the k8s api, work that's been queuing up for whatever reason...) [12:50:16] We should be able to get an idea if that thing is healthy and can keep up with it's work before we start relying on it [12:51:00] as we don't have webhook support, deployers will also not get feedback on errors during deployment. So it's imporant to make those visible elsewhere [12:58:15] okay cool, i can try and improve this dash. [12:58:29] are there any particular metrics at https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/ that you think should def go in there? [13:06:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Selected IPv6 service-cluster-up ranges are to big - https://phabricator.wikimedia.org/T335285 (10JMeybohm) >>! In T335285#8836067, @akosiaris wrote: > Wow, I did not see that coming! > > Anyway we don't use them for anything yet, it's fine to switch to /116 ad... [13:11:18] ottomata: you probably know the system better than me. But I would say the resource and lifecycle metrics do look pretty important (even to derive alerts from). An overview of errors of API requests migt be helpful as well and usually the operator sdk provides some insights into the workqueue and reconcilation [13:48:44] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) **Task Review**: - This may produce false... [13:50:56] 10serviceops, 10Wikidata, 10Wikidata Dev Team, 10Wikidata-Query-Service, and 4 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) [13:53:43] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Selected IPv6 service-cluster-up ranges are to big - https://phabricator.wikimedia.org/T335285 (10akosiaris) >>! In T335285#8836849, @JMeybohm wrote: >>>! In T335285#8836067, @akosiaris wrote: >> Wow, I did not see that coming! >> >> Anyway we don't use them fo... [14:05:10] okay jayme thanks i'll add some stuff. i mostly put that dash together so I could get you resource usage last time you asked. [14:39:09] 10serviceops, 10Wikidata, 10Wikidata Dev Team, 10Wikidata-Query-Service, and 4 others: [SW] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) [17:19:30] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [17:59:16] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase, 10SRE, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Kappakayala) @KOfori, Could you please have someone from your team to help with consultation. Based on my chat with Frant... [18:02:07] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase, 10SRE, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10KOfori) @Kappakayala indeed. Had a quick chat earlier with @FJoseph-WMF and briefly with the team. We'll set something up... [18:28:10] jayme: etc. which envoy service proxy mw endpoint should we be using instead of api-ro.discovery.wmnet? [18:28:54] oh wait, i guess we talked about this before? https://phabricator.wikimedia.org/T333575 [18:29:41] ottomata: exactly, yes [18:30:05] based on https://phabricator.wikimedia.org/T333120 i will use mwapi-async [18:30:24] ty [18:38:40] jayme: after using service proxy, I won't have any specific network policy, do I still need to set egress true? [18:38:52] hmm, that's not true, i do have kafka brokers [18:38:52] hmm [18:39:32] but i don't have any values for dst_nets? [19:06:32] think i got it, i don't need dst_nets, kafka_brokers will fill it in [19:29:20] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:31:34] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:32:59] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking)