[08:35:54] to trigger a restart restartOnce (even via helmfile --set) can be used I think: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#application-restarts-without-spec-change [08:59:05] kafka-jumbo might need the same partition count increase [10:54:10] dcausse: yes, they need re-partioning too. I wonder if it would make sense to reduce the retention, though, so replicated topics take up less space in jumbo. [10:56:41] dcausse: would you have some time after lunch? I’d like to re-deploy the SUP producer (now consuming page_rerender) [10:56:43] yes, might depend on how fast we react when the kafka ingestion pipeline jumbo -> hive have issues [10:57:09] pfischer: sure [10:58:53] pfischer: we can do it now if you have time, should not take much time [10:59:24] Sure, one moment, I’ll trigger the container build [11:01:35] Hm, looks like the dry-run test currently fails during CI, so if you want to leave, feel free [11:02:26] …and it passes again, snow-flaky behaviour, it’s winter after all [11:03:15] dcausse: To join the video meeting, click this link: https://meet.google.com/qsw-hfyj-air [11:42:40] dcausse: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/980360 [11:51:47] pfischer: +2ed [11:52:18] Thanks, I ping Ben to see if he knows how to get +2 permissions. [12:06:02] lunch [12:06:56] pfischer: you should probably make a task for the records [12:07:21] It will be either ldap group or direct in gerrit [12:07:47] I imagine for operations/* the 'ops' ldap group does [12:16:08] RhinosF1: thanks, I’ll create a task. [12:16:10] lunch [13:36:59] dcausse: deployment now fails with “There is no operator for the state …”. Is there a workaround or do have to start and explicitly ignore save points? [14:05:27] test test...bouncer problems again? [14:05:51] looks like we had a couple of WDQS alerts in eqiad just now...one has already recovered [14:06:42] o/ [14:06:46] inflatador: I can read you (bouncer working?) [14:08:04] inflatador: might be good to take a few stacktraces on one of the stuck server in case we want to do further analysis [14:09:52] gehel ACK, will take a look [14:10:49] looks like they've already all recoverd [14:12:10] gehel: Could you approve my gerrit permission request, please (just in case): https://phabricator.wikimedia.org/T352767 [14:13:05] inflatador: would you know how I can make a flink deployment that ignores save points? [14:13:50] Do I have to clear the stored savpoints or is there a flag I can use to ignore existing save points? [14:15:51] pfischer I think if we update the helm chart to have no save points, it will ignore them [14:16:31] dcausse ^^ didn't we do this with commons staging? [14:19:40] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/rdf-streaming-updater/values-staging-commons.yaml#17 I think this is all that needs to be done [14:19:44] yes, can happen when the job graph changes, here it might mean that we forgot to set some operator uids [14:20:07] we can also delete the save points, but if the helm chart is pointing to missing savepoints, it will probably not run [14:21:10] we don't have to delete the savepoint in this case [14:22:46] pfischer: you can try to redeploy with --set app.job.allowNonRestoredState=true [14:22:59] or --set app.job.allowNonRestoredState true [14:23:06] can't remember the syntax [14:32:13] dcausse: I check, the only operator added would be the page_rerender source and that gets a uid assigned. Do I have to pass the —allowNonRestoredState flag along with the launch command inside the helm chart? [14:33:23] pfischer: that has to go either in the values.yaml or via as an additional arg to helmfile -e staging -i apply --set app.job.allowNonRestoredState=true [14:34:01] but that means an operator had its uid changed so it's likely that we forgot to assign some uids manually [14:37:50] to discard the state the procedure is different, you have to destroy the deployment and deploy it again [14:46:41] dcausse: Okay, I’ll inspect the graph locally to see where we miss a UID. I ended up passing the flag via —set “app.job.args={…,--allowNonRestoredState}” [14:47:11] oh weird... [14:47:58] I believe we can activate that option to force a failure [14:48:22] I was not sure the flag would have been picked up under app.job and all the instructions I could find told me to pass it as CLI arg [14:49:39] we use it here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/rdf-streaming-updater/values.yaml#32 [14:50:02] I should disable it now tho, since we're done migrating away from the old kafka apis [14:50:46] Okay, so that’s an flink-operator convention then to look them up under app.job. Good to know. Thanks! [14:50:57] dcausse: yep, disabling auto generated UIDs reveals the missing one [14:54:34] I think we assigned an uid in most operators with a state but I'm sure it's not that obvious to determine what operator has a state [14:55:21] the annoying part is even if that state is completely empty flink will complain... [14:57:22] interesting to see the checkpoint size of the producer graph increasing from a couple kbs to several MBs now that we have page-rerender on big wikis [15:41:19] writing to the update stream seems more spiky than I was expecting (https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=test-eqiad&var-kafka_broker=All&var-topic=DataHubUpgradeHistory_v1&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0) [15:42:22] looks like the window timers is not spread as the doc suggest [15:52:31] dcausse: no, I was wondering that too: I would expect the event-time-based windows to be spread instead of emitting a batch every 5m [15:52:54] s/no/yes [15:54:53] I thought that the WindowStagger.NATURAL would have a better effect than that :/ [15:55:12] perhaps it's not being used? [15:55:22] and defaults to ALIGNED? [15:55:38] gehel is there an official page for Data Platform? I found https://wikitech.wikimedia.org/wiki/Shared_Data_Platform . This is for the IRC topic in the new room [15:57:41] inflatador: it's in flux :) [15:58:03] For the teams: https://www.mediawiki.org/wiki/Data_Platform_Engineering [15:58:21] gehel works for me [15:58:27] for the systems: https://wikitech.wikimedia.org/wiki/Data_Engineering [15:59:21] we still haven't entirely updated all the docs (far from it). But the goal is to have team documentation (processes, members, contacts, ...) on mw.o and technical documentation and systems on wt.o [16:04:20] \o [16:09:07] o/ [16:09:21] .o/ [16:23:18] o/ [16:23:48] oh wow, we missed alot of uid's [16:24:01] nice finding an option to turn them off [16:25:31] Yeah, I ran into flink complaints of unmapped state today, so hopefully that helps to prevent that in the future. [16:26:09] i didn't get a chance to dig into why the metrics don't seem to add up yesterday, will look today [16:26:34] dcausse: regarding the spiky natural window window staggering: According to the docs, the staggering happens per parallel operation. So if we only have one active window in parallel, this is to be expected. [16:27:38] ebernhardson: cool, thanks I was about to look into that. [16:30:27] should we make those UID changes to rdf-streaming-updater? [16:30:35] since we key by wiki+page_id I was expecting plenty of windows [16:41:40] that's what i was expecting too, flink doc's about the TumblingTimeWindow aren't all that clear and the doc on windows manages to not even mention WindowStagger.NATURAL [16:47:44] reading closer, it seems like if you expect to receive hundreds of events per second, WindowStagger.NATURAL might almost do nothing compared to WindowStagger.ALIGNED [16:50:28] dr0ptp4kt ryankemper may be 5-10m late to the split graph mtg [16:50:44] it makes me wonder what the point of natural stagger is, in that case if you are recieving a new event every couple ms then ALIGNED and NATURAL are almost the same [16:56:34] :/ [16:59:22] ryankemper: as discussed https://docs.google.com/spreadsheets/d/1EYK0x4GCxv1fG77Co-S-uBiPqrvXRiLzIW34kOAJWws/edit#gid=460610916 [17:02:23] maybe we could use a custom window assigner, but not sure [17:04:06] hm... metrics on the number of timers is only available indirectly from rocksdb state metrics :/ [17:05:54] ryankemper is it okay to meet in 25 minutes? inflatador: dcausse and i thought we were meeting 5 minutes ago, but looking at the cal i had actually set that meeting to 25 minutes from now, after ryankemper is finished talking with bblack. timeception [17:05:57] I thought the window was bound to the key, here it suggests that a single timer is re-used for multiple keys [17:06:13] reading the code makes this a little more obvious, the WindowStagger interface doesn't accept any info about the key being processed, it only gets provided the current time and the window size [17:06:29] so WindowStagger.NATURAL couldn't do what we want [17:07:06] TumblingTimeWindow itself is reasonably simple, so we could probably put together something. [17:07:15] sure [17:07:56] is it worth it? I feel like for our use case it's important. But maybe i'm just stuck on a previously decided solution :) [17:09:15] spikes of 300+ evts/s every 5mins does not seem ideal but who knows? better to use more timers to spread this a little more no? [17:09:25] BTW: ES bulk metrics: https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&var-k8sds=eqiad%20prometheus%2Fk8s-staging&var-site=eqiad&var-service=cirrus-streaming-updater&var-opsds=eqiad%20prometheus%2Fops [17:09:56] pfischer: nice! those numbers are much better than what i saw in the flink ui [17:10:24] i wonder what was up with the UI then, i left it attached for an hour and it made it up to like 4 noop's [17:11:16] Good question. But they did change over time, just slowly? [17:13:02] i added 5 metrics for each of the rev_based_update options, only the noop one ever incremented. It was from the operator metrics tab [17:14:15] way fewer page_rerender noops than I was expecting, it's ~50% noops here [17:14:54] looking at this graph, there are ~40 rev_based_update "updated" increments in the last 5 minutes. I just reopened it and displayed the same metric in flink ui and it's reporting a flat 0 [17:15:23] as long as it's in grafana i suppose thats plenty, just curious :) [17:15:28] s/grafana/prometheus/ [17:16:58] How do you expose the web UI? Do you tunnel? [17:17:43] pfischer: yes, ssh tunnel to deployment host, then i use the following (could probably be a simple grep, but i was playing with jq :P) to do the port forwarding: kubectl port-forward "$(kubectl get -o json pods | jq -r '.items[].metadata.name | select(. | contains("consumer")) | select(. | contains("taskmanager") | not)')" 8081:8081 [17:18:11] mostly it's ` port-for` and press enter [17:19:48] curiously the ui is also showing the writer at 100% busy [17:21:41] dr0ptp4kt: dcausse: inflatador: alright finished up traffic meeting [17:22:12] did yall wanna meet now or stick with the original time (8 mins from now) [17:23:42] available now if you all want to do it earlier [17:24:31] i'm ready. i'll hop on. looks like brian may be a bit, but that's okay, we'll catch him up when he joins [17:24:44] kk joining [17:25:00] ^ ryankemper [17:56:29] * ebernhardson never thought about how `-5 % 7` is a negative results, for some reason always thought of modulus as emitting a positive value [17:59:23] dinner [18:34:13] wow, I never thought about mod on a negative either [18:38:30] lunch, back in time for pairing [18:38:34] dinner [19:12:58] back [19:34:13] ryankemper, inflatador: I'll skip pairing once more (sorry for the late message) [19:34:51] gehel oops, reminded me to join [19:35:01] :) [20:09:56] ryankemper ebernhardson ldf changes are up on the prometheus servers now...let's see if we get any phab tickets ;) [20:14:47] aaaand the answer is "yes" [20:15:38] https://phabricator.wikimedia.org/T352807 [20:16:56] inflatador: so I gather the check isn't working as intended then? [20:17:37] ryankemper correct...I silenced it, will try and figure out why it's still alerting. Also need to figure out why it's still bugging ServiceOps [20:17:46] it created a phab task for them as well [20:23:09] Merging the regex patch [20:37:59] Damn, we need to add x-real-ip to our nginx config...annoying to troubleshoot the pollers, but it looks like they need a blackbox service reload to apply changes [20:49:48] * ebernhardson wonders what he did wrong...my local run of the producer socks proxied to a real kafka server is now exiting without any errror messages :S [20:51:51] nope, still a problem...I noticed that we are the only check that uses "body"...others appear to have done it manually [20:52:06] errr..automatically. Looks like their bb checks come from service catalog entries [20:53:22] * ebernhardson is dumb and provided an old kafka-source-end-time :P [21:00:48] Hmm, seems like this might work. I took EventTimeSessionWindows, removed a single line from the implementation so the end of a session stays fixed, and it looks to now be regularly emitting events instead of a giant batch every 5 minutes [21:00:52] but now i need tests :P [21:12:24] Now I'm really confused...if I curl 'https://query.wikidata.org/bigdata/ldf' from prom1006 it ends up in the nginx access log. But if I add a query, it doesn't make it into the access log at all [21:15:13] now we're getting puppet errors...time to revert [21:26:30] inflatador: I'm around for the revert [21:26:47] looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/980468 will need a rebase [21:27:13] ryankemper ACK, I'm trying to revert the regex patch first...hoping that fixes it but ready to revert the other if not [21:28:01] inflatador: could even revert them both in the same patch. the first iteration wasn't working anyway [21:29:00] ryankemper both are broken so reverting the other too...oddly enough, codfw works [21:29:52] inflatador: I find that a bit surprising cause I see a few codfw hosts on https://puppetboard.wikimedia.org/nodes?status=failed [21:33:03] ryankemper yeah, that's confusing. PCC worked with it and I ran puppet on 2007 prior to the revert and it worked too [21:34:01] ryankemper it's reverted/fixed now, but I wonder if it had something to do with the different tiers [21:36:47] !log bking@prometheus1006 re-enable puppet T347355 [21:36:47] inflatador: Not expecting to hear !log here [21:36:48] T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 [21:42:17] ryankemper confirmed, it was the tiers https://phabricator.wikimedia.org/T347355#9384929 [21:44:44] inflatador: 1016 and 2015 were also listed as failing on the puppetboard though [21:45:06] those are missing from the list and they are `wdqs-internal` servers [21:46:11] inflatador: oh wait I have it backwards [21:46:14] it's only internal hosts failing, I see [21:47:05] inflatador: yeah you'd need a default lookup in https://gerrit.wikimedia.org/r/c/operations/puppet/+/979983/23/modules/profile/manifests/query_service/wikidata.pp that returns a string that isn't `wdqs1015.eqiad.wmnet` [21:48:52] ryankemper ACK, gotcha. Looking at the various yaml files in hieradata/role/common/wdqs/ , it looks like we aren't too DRY ... so we'd probably just want to resubmit that part of the patch with the values for all the YAML files [21:49:30] inflatador: we don't need entries in hiera for the non-public hosts actually [21:49:55] we can just add `{'default_value' => ""}` to the lookup [21:50:06] oh good call [21:51:58] the blackbox check is still borked though. Do you know if we send our nginx logs to logstash? I'm seeing some major weirdness where if I add a query string to the URL, it gets a 200 and says it's served by wdqs1015 , but it doesn't appear in the nginx log [21:52:10] works fine w/out query string [21:52:40] interesting [21:52:47] no guesses here on that one haha [21:53:06] I'll get a patch up with the default_value thing though. (We don't need to merge it now but just wanted to get it up) [21:56:08] cool. I think a check will work as long as we omit the body params, but I'd still like to hold up until we have more info [21:57:22] ebernhardson any idea if we send our nginx logs to logstash? I don't see anything explicit in rsyslog.d [21:57:33] WDQS nginx logs, that is [21:58:37] inflatador: hmm, i doubt it [22:04:15] ebernhardson no worries. I don't see the traffic going to any other hosts' nginx, and I see many other log entries hitting the LDF endpoint with a query string, so I dunno [22:18:20] Meh can't get `Hosts: A:wdqs-all` to work in https://gerrit.wikimedia.org/r/c/operations/puppet/+/980499 (`[ 2023-12-05T22:14:59 ] CRITICAL: Unexpected error running run_host: Unable to find fact file for: A:wdqs-all under directory /var/lib/catalog-differ/puppet`) [22:18:44] But I've seen it used here https://gerrit.wikimedia.org/r/c/operations/puppet/+/855962 [22:20:56] Weird, I wonder if it has something to do w/how cumin aliases are created [22:23:17] meanwhile, I'm thinking the ldf stuff might only be a problem when connecting with the wdqs-ldf.discovery.wmnet hostname [22:25:41] ryankemper: it 100% still exists https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/templates/cumin/aliases.yaml.erb#L410 [22:29:12] ryankemper: I don't see any mention of Aliases on https://wikitech.wikimedia.org/wiki/PCC [22:30:45] P:queryservice::wikidata might work? [22:30:46] RhinosF1: no, the documentation's slightly out of date. I know for sure you can do stuff like `P:trafficserver::backend` now but not sure what the deal is with aliases [22:31:01] ryankemper: P will work 100% [22:31:40] Yeah, I was gonna do something like that. I might need 3 different `P::` to get it to actually test one each of `wdqs-public`, `wdqs-internal`, and `wdqs-test` [22:36:31] I guess envoy is cutting around nginx? I don't really understand why a curl gets a 200, but no log lines from nginx [22:38:59] * RhinosF1 groans about CI being slow on his test patch [22:40:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/980470 also fails so it's not because it's an alias with aliases [22:42:52] I've also created https://gerrit.wikimedia.org/r/c/operations/puppet/+/980503/ to revert our trafficserver LDF change...should make it easier to troubleshoot [22:45:59] Headed out, have a good one all [23:47:31] I wonder how intellij decides what static imports to offer when asking it to find functions i can import (alt-enter). For whatever reason when typing `verify` it only offers Verify.verify, but not Mockito.verify. But typing `never` does get the option to import Mockito.never. I've just manually written the correct static import, but kinda tedious