[07:38:15] errands, back in 60’ [08:59:11] dcausse: I'll be 2' late, smoking a coffee before our 1:1 [08:59:22] gehel: np! [10:01:11] lunch [11:47:54] luncn [13:00:59] o/ [13:18:56] dcausse do you still need me to bring down the operator/clean up the namespace as described at https://phabricator.wikimedia.org/T342149#9234883 ? [13:19:37] Happy to do so, but I was going to start migrating the staging app to flink-operator if not [13:21:43] hmm, maybe we need a new namespace for staging? Looks like staging/prod use the same helm chart, not sure if it's possible to use different charts [13:22:38] or maybe we change it and just don't apply prod until we're ready? [13:35:19] inflatador: for simulating the k8s upgrade test, it's up to you, I'm fine to skip it and figure this out once we need it or spend some time now [13:37:05] for the session -> k8s operator migration, it's certainly a lot easier to have at least another service folder I think [13:37:24] we could possibly re-use the same namespace I don't know [13:38:16] but perhaps for simplicity reasons we should go for a new namespace... [14:56:49] dcausse re: new namespace I'm more inclined to just update the existing services helmfile to point to the flink-app chart and only apply to staging. We can to roll back the chart to session mode if we need to do a prod release before prod deployment. Open to feedback if you'd rather do it a different way [14:59:05] inflatador: the difficulty is that the helmfile is shared by 3 envs: staging/codfw/eqiad and if the update procedure is: 1/ undeploy using helmfile destroy, 2/ merge the deployment-prep patch to switch to flink-app, 3/ helmfile apply [14:59:59] doing this for staging is OK, but doing this again for eqiad or codfw might fail at the undeploy step since the deployment-chart repo would no longer have the the flink-session chart [15:00:22] or we use a separate copy of the deployment-chart git repo pointing at an older version [15:00:47] like what we've done with Ben when migrating the release name from wdqs to wikidata [15:01:26] dcausse: could you invite me to that RDF streaming meeting? [15:01:39] gehel: sure [15:01:45] dcausse thanks, great feedback. Definitely let's think about this and I'll reach out to Ben, Luca et al as well [15:01:51] dcausse: I haven't manage to send that community communication yet. I'm probably going to do it end of my day ~22pm [15:02:12] ok, I'll send mine tomorrow morning then [15:02:14] I'll start documenting our options [15:02:57] thanks! [15:05:26] \o [15:05:44] o/ [15:18:01] hmm, after double checking i'm not sure it's the firewall puppet installs :( It looks like staging is already part of PRODUCTION_NETWORKS: https://github.com/wikimedia/operations-puppet/blob/production/modules/network/data/data.yaml#L136C9-L136C40 [15:18:08] but then why do the connections timeout :S [15:18:32] could it be kafka-test? [15:18:43] no, the error messages include the topic name [15:19:22] could we have mixed topic names with the wrong kafka cluster? [15:20:13] hmm, it's configured as source and sink, i guess can verify the code didn't mix them up [15:20:35] also the consumer comes up and talks to kafka-test without failing [15:20:56] so it's not that :/ [15:21:51] perhaps the k8s network policies then? [15:22:08] is there a way to enter a pod and prevent it from dieing? I guess i probably wouldn't have the tools necessary, but it would be nice to type `telnet foo.bar 1234` and get a rejection, to get a very clean answer if it's open or not :) [15:22:08] might be the first we do multi-release with the flink-app chart [15:22:50] that'd be too easy :) [15:23:50] the network policy looks reasonable, via `kubectl get -o yaml networkpolicy` i verified yesterday it has all the hosts in the connection string [15:28:03] * ebernhardson separately wonders about all the configuation flink says is supplied but "isn't a known config"...but ignoring that until things are working :P [15:28:57] yes this one is annoying I think they mixup AdminClient and the "normal" client both have different options [15:29:34] oh ok, that makes some sense [15:31:51] I wonder how the egress is tied to a particular pod [15:33:03] should be the top level podSelector section [15:34:06] it uses a matchLabels on `app: flink-app-producer` and `release: proucer`. I mostly pretend those do what they say i haven't looked closely :P [15:34:19] `release: producer` even [15:34:57] kubectl get pod -l app=flink-app-producser -l release=consumer [15:35:11] ah but it's not up obviously [15:35:43] there's a typo [15:35:46] i see a container running for 19h, seems like flink just keeps retrying in the same one [15:35:50] dcausse: oh? [15:36:34] ah no it's me :( [15:36:56] :) i type badly all the time [15:36:57] kubectl get pod -l app=flink-app-producer -l release=producer [15:37:08] I see it [15:39:20] oh nifty, you can get a shell [15:39:42] with: KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-staging.config kubectl exec --stdin --tty flink-app-producer-64ccdf75dc-kknnp -- /bin/bash [15:39:59] oh, but it's the tls-proxy...more selectors needed [15:40:04] hu? nice! I thought that was disable for us [15:41:10] the trick is that KUBECONFIG, i had asked how to be able to see rendered resources since i was blocked on secrets and was told that magic config file is the way to get full "deploy" access [15:41:57] interesting, naively that kuve_env would enter the "deploy" access but apparently this yet another account [15:42:04] *I thought* [15:43:31] * ebernhardson is not yet figuring out how to get the main container instead of the tls one though :P [15:44:22] ahh, it needed a `-c flink-main-container` . But we don't have telnet or netcat here :P [15:44:30] but we have python, close enough i guess [15:44:52] :) [15:50:58] sigh [15:51:01] OPEN : kafka-main1001.eqiad.wmnet:9093 [15:51:19] so, the network is open, python can talk to kafka-main. but flink fails with timeouts :P [15:51:40] ebernhardson: Stumbled across this ancient patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856655 [15:51:59] I'm not super familiar w/ how the cirrus dumps work. Do we have a lot of sharded dumps and then there's also this non-sharded dump that we want to clean up? [15:53:10] ryankemper: sorta, there is one dump run per mysql cluster (even though we don't dump those, it was just a convenient split), then there is cirrus metadata that gets dumped [15:53:45] ryankemper: that should be safe to remove, that is from the transition between a single job and a job-per-cluster [15:55:04] kk I'll merge the patch [15:55:05] yea all 5 kafka-main hosts are available from the producer container...sigh i'm not really sure what threads to pull on then [15:55:25] * ebernhardson abandons firewall patch, another wrong path :P [15:55:39] ebernhardson: do we set ssl options on the kafka properties? [15:56:22] dcausse: oh! yea thats a good thing to check, maybe source and sink didn't end up using the same config [16:06:23] "Note: SSL is deprecated; its use in production is not recommended.". I'm going to guess that might technically be true, but is misleading [16:06:30] (from kafka AdminClient docs) [16:07:30] but indeed it looks like security.protocol = PLAINTEXT, so that might be our problem. looking [16:22:42] workout, back in ~40 [16:24:18] producer claims to be producing checkpoints [16:27:21] \o/ [16:28:10] consumer not happy though. It gives a thread dump instead of a stack trace even :P but progress is something :) [16:28:35] :/ [16:29:42] ahh: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "flink-app-consumer-search-taskmanager-1-3" is forbidden: exceeded quota: quota-compute-resources, requested: limits.memory=2000Mi, used: limits.memory=9000Mi, limited: limits.memory=10 [16:33:13] hmm, it's been 5 minutes so data should be flowing out, but not seeing anything in kafkacat [16:46:49] also some things transition to failed, and then it resets a bunch of stuff, restarts the various pieces, but doesn't print an exception to the kubectl logs... [16:49:01] oh it does, my tooling was just ignoring it. Unknown field name namespace_id from UpdateEventconverters.fromRevisionScoring [16:52:43] ryankemper, inflatador: I'll skip the pairing session today, I need to get the wdqs communication out [16:52:48] And interview after that [16:53:06] gehel: inflatador: ack [17:22:57] ACK [17:23:14] back, but going to hit lunch, back in time for pairing [17:58:12] back [18:00:34] ebernhardson just curious, where does the consumer run? It's not part of the flink app, is it? [18:01:04] inflatador: it runs as another release in k8s, one for each destination cluster (eqiad, codfw, cloudelastic) [18:01:11] same service [18:01:21] it is a flink app [18:02:29] ebernhardson ACK, is that true of the WDQS streaming updater? I always thought the consumer was the streaming updater service on the wdqs hosts [18:15:59] inflatador: i'm not entirely sure, but a quick look through the rdf streaming updater consumer code suggests to me that it is standalone and not a flink thing [18:19:02] ACK, been grepping around there myself [18:19:38] it makes more sense to run standalone on WDQS hosts, since they're all independent [18:47:14] WDQS comm sent to the Wikidata ML and on wiki (https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/October_2023_scaling_update) [18:47:48] Let's see how much noise this makes... [19:00:10] dcausse: I looked through the scholia email, LGTM! [20:39:14] bah the wdqs updater is in a crashloop with org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partitions: [eqiad.mediawiki.page-delete-0] ... [20:40:29] I guess that somehow it needs to reset to earliest or latest on streams that are not receiving any events... [20:48:45] ahhh...I was wondering about that [20:51:50] going to depool EQIAD and start an incident report I guess! [20:53:14] dcausse ryankemper ^^ just depooled eqiad wdqs [20:53:23] yes... shipping a quick fix to staging to see if that works [20:53:50] codfw is likely to fail soon on the same bug I guess [20:54:52] ack finishing up at gym, will be back in 40’ to take over if needed [20:58:34] the fix seems to work on staging will apply to eqiad [21:05:21] the job seems back, deploying the fix to other jobs (wdqs@codfw, wcqs@eqiad, wcqs@codfw) that did not fail yet [21:07:24] dcausse ACK, watching lag for eqiad hosts [21:08:03] wow, they caught up quickly. Repooling... [21:31:44] d-causse thanks for addressing this in the middle of the night! [21:33:10] np, it's a bug in what I wrote [21:36:28] my guess so far is related to offset retention mechanism, would not have happened with kafka >= 2.1 I guess (https://issues.apache.org/jira/browse/KAFKA-4682) [21:37:10] assuming that the old FlinkKafkaConsumer was comitting frequently on idle sources while the new KafkaSource is not [21:38:13] all jobs have been patched [21:47:08] OK, created https://phabricator.wikimedia.org/T349147 to follow up, we can go over tomorrow