[08:09:50] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) >>! In T333464#8884003, @Ottomata wrote: > [[ https://grafana-rw.wikimedia.org/d/H-sRgqLVk/flink-kuberne... [10:30:38] 10serviceops, 10MW-on-K8s: Coordinate testing of testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10Clement_Goubert) p:05Triage→03Medium [11:08:35] 10serviceops, 10MW-on-K8s: Coordinate testing of testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10Clement_Goubert) [11:08:46] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:10:29] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:16:05] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) > Apart from the managed flink clusters in staging-eqiad being empty I agree Ah, the value was 0 (?) so... [12:31:42] jayme: o/ would you be okay with me proceeding with flink operator in eqiad + codfw? [12:50:21] I can't seem to find the issue where I asked the question of there the tombstone for the latest state in swift will be stored if not in zookeeper [12:51:07] *of where the [13:08:19] jayme: that is an app specific setting, but for now we can't use zookeeper, so we've disabled JobManager HA and will come back to that issue [13:08:31] https://phabricator.wikimedia.org/T331283 [13:52:10] yes, that's why I was asking where that data lives instead [13:52:45] my understanding was that there must be a reference to the lastest snapshot in swift stored somewhere [13:53:02] that thing which rdf-streaming-updater stores in config maps [14:07:57] yes, for JobManager HA. https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/overview/ [14:10:02] so for non-ha this thing is not required? that feels strage [14:10:41] ah, because without HA you're basically doomed when the jobmanager dies [14:10:59] right ;) [14:11:17] tbh. I'm very inclined to not want this in a prod cluster [14:11:28] we can do it via configmaps as search does for now [14:11:31] if you prefer [14:12:07] jayme: if okay with you, we'd be okay with deploying in wikikube wthout HA, but having 0 SLO/support for the thing. mostly trying to undeploy in DSE so we can remove our "POC" namespace [14:12:13] I'm just not sure what happens, how to recover, who will recover, will we know etc.etc. in case of issues [14:12:20] and to make it possible to move forward [14:12:36] for now no recovery needed? would that be okay if documented? [14:13:05] undeploying from dse should already be possible as you run in staging, no? [14:13:57] staging uses kafka test so no real data [14:14:35] jayme: if you like, perhaps we could move forward with operator deployment, but we can block mw page content change deployment until we figure out HA? [14:15:22] yeah, that works for me :) [14:16:41] Also if we're very clear (in wikitech) about what to do in case of issues (e.g. delete and recreate release) and who to call out to non-HA might be fine ... if it is for your usecase [14:16:48] but I'll have to bring that to the team [14:18:24] okay, gmodena is working on SLO doc via SRE's template. he'll be moving to wikitech soon. https://phabricator.wikimedia.org/T333833 https://docs.google.com/document/d/1U2bYVqmEsn7ryP0dtFUr-S5xPqF9_plLIFdzk883HBc/edit [14:19:39] nice [14:26:04] okay so! if you are okay with operator deployment... https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874 :D [14:29:09] I left yet another nit about the egress rules, apart from that I think we're gtg [14:44:15] done [15:05:06] ottomata jayme https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change_Enrichment [15:15:06] 10serviceops: Some httpbb checks are flapping - https://phabricator.wikimedia.org/T336590 (10Clement_Goubert) [15:15:30] 10serviceops, 10MW-on-K8s: httpbb fails requesting mw-web during deployments - https://phabricator.wikimedia.org/T331609 (10Clement_Goubert) [15:16:30] 10serviceops, 10MW-on-K8s: httpbb fails requesting mw-on-k8s during deployments - https://phabricator.wikimedia.org/T331609 (10Clement_Goubert) p:05Medium→03High [15:16:54] 10serviceops, 10MW-on-K8s: httpbb fails requesting mw-on-k8s during deployments - https://phabricator.wikimedia.org/T331609 (10Clement_Goubert) [15:20:51] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [15:21:09] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [15:34:03] yeehaw [15:34:23] jayme: okay done that, i got a bit of time before some other meetings, would you be okay if I deployed the operator? [15:35:15] gmodena: ottomata: Feeback from the team is that we need clear expectations in the SLO ("nobody is going to even look at this off hours" etc.). If we have that and some operational guidelines like "delete and recreate will be fine", than we can go with it being on wikikube [15:35:40] jayme: is that re operator or mw-enrich app? [15:35:57] that is re: enrich app [15:36:00] aye k [15:36:03] ottomata: operator is +1ed [15:36:05] ah and I see your +1, thank you [15:36:43] i will apply in all 4 clusters, including staging eqiad and codfw to remove zk egredss [15:40:06] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10akosiaris) @Trizek-WMF, should we resolve this? [15:40:52] ottomata: sounds good [15:46:10] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) 05In progress→03Resolved [15:46:14] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [15:49:12] actually there are no diffs for staging-codfw for flink-operator. there is an upplied calico diff though [15:49:39] i'll apply for flink-operator only [15:51:13] calico diff is probably a version bump only, but please only deploy -l name=flink-operator everywhere [15:51:17] okay [15:51:39] -crds first then [15:51:51] ah, yeah. sure [15:52:00] hm, it didn't like -crds on its own in codfw [15:52:06] Release "flink-operator-crds" does not exist. Installing it now. [15:52:06] Error: create: failed to create: namespaces "flink-operator" not found [15:52:19] eheh, sorry [15:52:29] you need to deploy -l name=namespaces forst [15:52:31] *first [15:52:33] hm okay [15:52:35] for the namespace to be created [15:52:54] should have been the same in staging, though [15:53:02] in staging i didn't -l [15:53:08] ah, okay [15:53:08] because there were no others diffs [16:03:42] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:04:36] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) Deployed in all wikikube clusters. We'll have to re-enable operator egress to Zookeeper when we figure... [16:09:00] thanks jayme deployed. [16:27:13] 10serviceops: Some httpbb checks are flapping - https://phabricator.wikimedia.org/T336590 (10RLazarus) Thanks for the task and sorry for the slow reply -- per @Clement_Goubert's merge, these failures are concurrent with deployments. We could paper over them with retries on the httpbb side, but we think it's not... [17:40:56] jayme ack [17:49:07] jayme ottomata I updated the SLO draft. [18:49:47] 10serviceops, 10Documentation: Create template on Wikitech for documenting production services - https://phabricator.wikimedia.org/T336354 (10kostajh) @jijiki fwiw https://wikitech.wikimedia.org/wiki/Service/Etcd looks like it could become the basis for a template without too much trouble.