[05:24:20] Neat WDQS-powered browser extension I came across when looking at https://phabricator.wikimedia.org/T362570: https://www.wikidata.org/wiki/Wikidata:Entity_Explosion [10:04:53] added notes to the staff meeting slides, please double check and fix/add anything I would have missed (cc pfischer, gehel, dr0ptp4kt) [10:05:53] hm... seems like the slides publicly editable so not sharing the link here [10:08:08] lunch [12:24:15] pfischer / dcausse: do we have a number for the updates / second on SUP? [12:31:21] gehel: at the top of https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer&var-flink_job_name=cirrus_streaming_updater_producer_eqiad&var-operator_name=All [12:32:05] consume rate total is the number of events ingested, produce rate total is the number of updates we will apply to elastic [12:33:04] minus the updates triggered by the saneitizer [13:08:46] gehel: additionally, if you are just interested in updates sent to ES, our dedicated SUP dashboard has a metric for that: [13:08:46] https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&var-k8sds=eqiad%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops&var-site=eqiad&var-app=flink-app-consumer-cloudelastic&var-app=flink-app-consumer-search&var-operator_name=rerender_fetch_failure_sink:_Writer&var-fetch_operator_name=fetch&var-fetch_operator_name=rerender_fetch&var-quartile=0.5&var-fetch_by=authority&var-top_k=1&from=now-24h&to= [13:08:47] now&viewPanel=50 [13:10:22] But, as dcausse pointed out, this includes updates from more sources than just page_change(-inflicted) [13:35:22] ryankemper nice extension! [13:35:26] o/ [17:11:01] dinner [19:12:57] dcausse: thank you for the review and added tests, I’ll adapt the implementation tomorrow [20:01:59] inflatador: o/ [20:02:15] ottomata I'm gonna try `helmfile destroy/apply` as opposed to `rollout restart` (already tried that and it failed) [20:02:43] oh okay yes. [20:02:50] sorry yeah, i think rollout restart is just going to delete pod like I did [20:04:03] OK, destroy/apply finished, let's see what happens now [20:05:13] better so far! [20:05:16] we have task managers at least [20:07:11] failed. looks like a json error? [20:07:17] at least i have a cause now! [20:08:30] back to having no task mgrs [20:08:58] Ah, what are the next steps? Roll back a code change somewhere? [20:10:38] no we haven't deployed anyting in a long time [20:10:40] i'm not sure [20:10:44] trying to keep https://phabricator.wikimedia.org/T368667 updated [20:14:19] We might have to wait for g-modena to come in tomorrow? [20:14:38] The other question is why DPE SRE didn't get these alerts, at least I haven't found them yet. I can probably fix that [20:15:53] dcausse do the errors listed above look familiar? Just curious if we have had similar issues with SUP [20:16:46] i see them going to sre-observability@wikimedia.org [20:16:46] cc: data-engineering-alerts@wikimedia.org [20:17:28] that 2nd email addr looks right. Not sure if the list addrs are considered sensitive data though [20:18:09] wait. [20:18:12] i think i know. [20:18:14] oh, I did get the alert btw [20:18:20] started about an hr ago [20:19:13] this is my bad [20:19:14] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1047995 [20:19:16] needs deployed.... [20:19:20] the train rolled out to this wiki today [20:19:24] where this problem is happening! [20:19:29] on it... [20:20:38] I did not follow through [20:20:38] https://phabricator.wikimedia.org/T367923#9913264 [20:21:18] applying [20:21:31] NP, glad you have the solution! [20:23:33] inflatador: um, staging did not deploy for me [20:23:40] helmfile was happy [20:23:52] Error creating: pods "flink-app-main-6844786bb8-w2btg" is forbidden: violates PodSecurity [20:23:55] in kubectl get events [20:24:18] assuming its weird staging stuff, trying codfw... [20:24:45] I know we've been adding PodSecurity contexts lately, but I don't think they're obligatory? [20:25:17] The git log on deployment-charts probably has some examples if we do end up needing to do that [20:26:46] failed in codfw, but also my fault. [20:26:51] need schemas in image [20:26:55] releasing new image now... [20:27:19] https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/pipelines/60820 [20:43:24] how's it looking? [20:43:51] better [20:43:52] Job status changed from CREATED to RUNNING [20:44:48] gotta catch up from lag now [20:53:02] ACK, ping me if you need anything else [21:00:47] ottomata FYI if you need a user that has perms to `helmfile destroy` but you don't wanna use root, you. can do `kube_env ${APP_NAME}-deploy $ENV` . Note the "deploy" [21:01:25] sorry, should be `${NAMESPACE}-deploy` [21:23:02] really!? [21:23:11] cool, inflatador could you add that to the k8s wikitech page? [21:23:20] inflatador: all looks better now! [21:28:32] ottomata added to https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments . Glad everything looks good!