[09:27:39] unsure if expected but the SUP producer is down since yesterday ~16 UTC failing with "Caused by: java.lang.IllegalStateException: There is no operator for the state 9618cf011d954175ffd5bd1895aa7d5b" [09:59:12] and the consumer side if I trust the flink metrics it seems to have been unable to keep-up with the update rate, this could possibly explain the divergence we see [09:59:32] can't correlate that without kafka metrics tho... [09:59:38] s/with/without [09:59:50] meh cant type [10:00:37] I mean can't seem to see the same shape on kafka lag metrics but not even sure I'm looking at the right graph [10:08:49] ah it's because we don't seem to export lag metrics (with kafka burrow) for test-eqiad [10:33:17] errand+lunch [12:46:40] dcausse: I noticed it’s down this morning but didn’t look into it. Will do that now. [12:52:31] dcausse: I tried passing in app.job.allowNonRestoredState=true but that does not solve it. [13:17:36] pfischer: ok, that's annoying, it's probably the command I mentionned that's wrong, I'll check [13:23:12] I tried both app.job.allowNonRestoredState and app.job.args={…,—allowNonRestoredState} [13:26:27] I see "allowNonRestoredState":null in the arg list now I wonder if it's causing a problem [13:27:23] no that's the last stable spec my bad [13:36:00] hm... I manually forced a restart with: helmfile -e staging --selector name=producer -i apply --set app.job.allowNonRestoredState=true --set app.restartNonce=2 --context 5 [13:36:29] and it resumed... perhaps we waited for too long? [13:36:39] still curious tho... [13:40:50] had hard times recovering and failed a dozen chechoints but now seems stable [13:41:38] dcausse: thanks [13:44:12] But as you said, it’s curious to see the numbers for ES-requests go up as if nothing happened: https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin=&forceLogin=&from=1701870179007&orgId=1&to=1701956579007&var-k8sds=eqiad+prometheus%2Fk8s-staging&var-opsds=eqiad+prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=eqiad [13:44:28] (Even between yesterday 16:00 and now) [13:45:06] pfischer: only the producer side failed but the consumer is lagging behing so it was still ingesting its backlog [13:46:02] the consumer might need some tuning on the asyncIO capacity to handle the pressure [13:46:09] Sure, maybe curious was not the right term. It’s alarming, and we have to increase the throughput [13:46:30] yes [14:03:58] Okay, I added a calculated rate metric for the bulk ingestion rate to the dashboard: 16 ops/s. If that’s correct, it’s bad. [14:05:41] …and would explain the lag [14:06:16] some metrics on the fetch operator might be useful as well, so that we get a sense of the MW response time [14:07:37] Sure, good point. [14:16:00] o/ [14:16:53] dcausse: But shouldn’t we be able to derive the request duration from the I/O rate of the fetch operator? If that does not show back pressure we can assume that ES bulk is the bottleneck. [14:44:28] pfischer: oh I was assuming that the fetch operator was the bottleneck but could well be elastic, it's relforge after all [14:49:56] hmm.. I see only 2 flink_taskmanager_job_task_busyTimeMsPerSecond and I flink is collapsing many operator into a single slot so might be hard to see what's blocking... the one being pressured is "enrich_page_change_with_revision____Process_____Sink:_Writer____Sink:_Committer__Filter____Sink:_Writer__Filter____Sink:_Writer__Filter____Sink:_Writer" [14:50:15] which is basically everything (I think?) but the kafka source [15:11:37] https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin=&forceLogin=&forceLogin=true&from=1701918677452&orgId=1&to=1701961877452&var-k8sds=eqiad+prometheus%2Fk8s-staging&var-opsds=codfw+prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=codfw&var-operator_name=enrich_page_change_with_revision&var-operator_name=Source%3A_cirrussearch_update_pipeline_update_rc0_source&var-app=flink-app-c [15:11:37] onsumer-search&var-app=flink-app-producer [15:12:20] Looks like we see a rate of <=20 ops/s throughout the whole consumer app [15:18:10] Commuting, back in 30’ 🚲 [15:21:11] can't seem to find a metric regarding backpressure that's granular enough to distinguish the sink and the io operator, guessing that that backpressure only is monitored when there's buffering between the operators (shuffling) [15:29:56] we don't seem to set the asyncio capacity from the chart so should be the default 100 which seems particularly high... [15:31:11] 100 concurrent requests cannot produce only 20 ops/s, and relforge should be able to ingest more than 20 docs/s... something's not correct [15:46:09] pfischer: in case you missed the flink app dashboard has plenty of neat flink metrics already (https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app) [16:02:39] if we wantto remove relforge as a bottleneck, could probably stand up a super simple http server that just responds with the "ok" response but not actually writing anything [16:04:07] 20 seems very low even for relforge, but haven't looked at the server to see if it was struggling [16:04:22] hmm, 20 writes per sec? that does sound realy low [16:04:55] yes... [16:27:45] We are publishing to update_pipeline at ~50 records/s but the at the consumer side I only see a rate of 20 records/s. So I would a expect an increasing consumer lag, but can’t see that in the dashboard: https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=test-eqiad&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&var-consumer_group=All [16:28:10] hmm, is it all being loaded into flinks state for the async operator somehow? [16:28:29] i was expecting that to fill and backpressure, but i have no clue what limits the "full" state [16:28:58] the async operator only buffers the inflight events [16:29:40] hmm, ya that seems sensible and shouldn't make that difference then [16:29:41] pfischer: we don't capture consumer lag metrics for the kafka cluster test-eqiad [16:30:47] the only sensible metrics I found is flink_taskmanager_job_task_busyTimeMsPerSecond which shows that the problem is in or after the async operator [16:31:45] flink does capture the consumer lag on its own, should be visible in the flink-app dashboard [16:33:53] https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s-staging&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search&var-flink_job_name=cirrus_streaming_updater_consumer_search_staging&var-operator_name=All&viewPanel=73&orgId=1 [16:34:28] the bump in the middle is when the producer job was restarted [16:36:04] the backlog is still in kafka not flink [16:36:41] Hmm, how many items do we allow in parallel in the async operator? Our typical job run is a few hundred ms, could probably assume thats mostly doc building, i suppose i'm wondering what our estimated throughput would be [16:37:14] should be 100 concurrent requests by default and this also does not make sense :/ [16:37:33] hmm, yea. 100 concurrent requests at a few hundred ms should not result in 20/s :) [16:37:55] do we have a metric for in-flight requests? [16:38:05] * ebernhardson is probably headed down the wrong path though [16:38:28] we use ordered perhaps that's causing some throughput slowdowns but I'd not expect only 20/s [16:38:59] was searching some asyincIO metrics but could not find any yet [16:43:37] As long as flink is not smart, that is, its not throttling at the beginning of the graph, if a downstream operator cannot withstand a certain rate, then I don’t understand why we see 20 ops/s starting from the kafka source. If fetching was the bottleneck, then I’d expect a high out on the source but a low out (+ back pressure). But let me include a metric for the in-flight requests. [16:44:19] …low out [for the fetch operator] [16:48:36] pfischer: my understanding is that once an operator is backpressured everything's should run at the same rate (the slowest link in the chain) [16:49:45] if the kafka source was still consuming more msg/s then they would have to be buffered somewhere [16:50:00] or dropped [16:50:27] does it seem worthwhile to setup a fake elasticsearch? Or maybe we could plug a /dev/null sink into kafka? Basically to remove elasticsearch from the equation and verify which side the problem is on [16:50:32] s/into kafka/into flink/ [16:51:05] if it's easy? sure, personnaly I'm more suspicious of something fishy happening on the fetch side [16:52:00] it's basically swapping the sink for BlackHoleSink, i can work that up. But indeed it might not be relevant [16:52:37] oh yes sure that seems like an interesting feature to have, I thought that you were going to setup a fake elastic service [16:53:08] dcausse: that was a thought, a little python script that counts the number of input lines and emits a 200 response for input lines / 2, but realized a black hole is easier [16:53:17] +1 [16:54:23] also we can force a shuffle in the graph to separate the graph between the async op and the sink, that should help to see more granular metrics in flink_taskmanager_job_task_busyTimeMsPerSecond [17:01:23] workout, back in ~40 [17:04:47] the http client seems to be set with DEFAULT_MAX_CONNECTIONS_PER_ROUTE = 5, so that'd be 5 concurrent requests not 100? [17:05:15] dcausse: oh! that would add up. 5 at a couple hundreds ms comes close [17:09:06] good news is that being this far behind it might explain why relforge was having so many different docs compared to prod indices [17:09:15] ahh, yea that would make sense too [17:10:00] also the count was going down most of the night, then started going up recently. looks like minimum values was 0900 UTC [17:10:58] very possible that page rerenders were fixing things (yet unseen rev updates) [17:17:39] making a quick patch and tempted to reduce the default from 100 to 25 [17:17:50] 100 seems particularly aggressive [17:19:20] makes sense [17:21:54] Do you still want the pening_requests metric? https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/72 [17:22:18] pfischer: i think that could still be useful just for understanding how things are going, +1 [17:22:28] +1 too [17:24:17] I didn’t adapt the test to properly initialise the `BypassingCirrusDocFetcher` that is now a `RichFunction` (to have access to operator metrics) I would add that if we see value in the metric. [17:25:56] …and will keep it. [17:26:36] also have https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/73 for the null sink, it does very little but might make it easier to verify backfilling capacity as well [17:26:57] I suppose i could find a way to add some tests, but it didn't seem too important [17:29:51] hm... for some reasons tests are now failing because of my locale (Connection reset by peer vs Connexion ré-initialisée par le correspondant) [17:29:58] no big deal tho [17:29:58] heh [17:39:26] dinner [17:48:28] back [17:49:14] ryankemper inflatador able to join Meet for the Transfer.py and loadData.sh piece? [17:49:52] dr0ptp4kt sorry, I forgot...omw [18:13:36] dr0ptp4kt ryankemper I started the transfer from wdqs1024 instead of the stat machines this time...10Gbps vs 1Gbps ethernet, so it should be faster [18:13:42] thx! [18:14:04] pfischer: was wondering if we could bridge something like: https://github.com/dropwizard/metrics/tree/release/4.2.x/metrics-httpclient5/src/main/java/com/codahale/metrics/httpclient5 [18:14:35] nice [18:15:58] Zbyszko wrote a wrapper at some point: https://gerrit.wikimedia.org/g/wikidata/query/rdf/+/0922c4c4dead1e4937f3c3b4afeb89eb79bd808d/streaming-updater-producer/src/main/java/org/wikidata/query/rdf/updater/DropwizardToFlinkListener.java might be useful perhaps? [18:17:35] dinner [18:32:47] dr0ptp4kt ryankemper the transfer is finished, y'all should be good to start loadData.sh on wdqs1024 [18:42:47] lunch, back in time for pairing [18:43:23] inflatador: thx. do you mean wdqs1023? [18:44:36] dcausse: Sure, I'm in bridging mode for flink+ES anyways 😉 I'll look into that tomorrow [18:48:43] oh yeah, i see inflatador on wdqs1023: [18:48:45] dr0ptp4kt@wdqs1023:/srv$ ls -a T350106/*/*/*/nt_wd_schol | head -3 [18:48:45] . [18:48:45] .. [18:48:45] part-00000-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.txt.gz [18:48:46] T350106: Implement a spark job that converts a RDF triples table into a RDF file format - https://phabricator.wikimedia.org/T350106 [18:50:19] dr0ptp4kt: oh dear me phone died overnight...up now [18:50:54] so the thing will be to (1) stop the wdqs blazegraph on 1023, then (2) delete the wikidata.jnl file there, then (3) -s 0 -e 0 with the updated file pattern (for part-00000), then (4) without the -s and -o flags (or just -s 1 and implicit high numbered -e). g'morning ryankemper ! [18:52:13] huh, consumer deploy failed with: Hash collision on user-specified ID "filter-by-cluster_group". Most likely cause is a non-unique ID. Please check that all IDs specified via `uid(String)` are unique. [18:53:02] oh, that makes perfect sense, it's in a for loop :) fixing [18:54:20] dr0ptp4kt: alright, getting rdy to kick it off for the first file [18:54:27] ryankemper: you wanna just get this thing going now or would it be better to meet at 1:30 pm pt? i put a thing on the calendar for then (which i think works for inflatador, and it may be good fun for him to see the steps to do the thing) [18:54:40] oops you beat me to it. okay if we hop on a Meet and record maybe? [18:55:23] dr0ptp4kt: sure, couple mins for me and I'm rdy to get started [18:56:18] okay, i'll ready a meet link in a couple minutes. i gotta fill the water bottle, and our softened water is gross and so we have bottled water, so i need to put in a new 5gal jug [19:01:23] ryankemper: see gcal for meet. see you there in a minute [19:01:29] ack [19:53:00] going to my appointment, back in ~90 [20:08:57] thanks ryankemper once again! okay, we have the import going for the "wikidata main" side on wdqs1024.eqiad.wmnet (this is the node that's been wonky, but hopefully it's done with its irritability) and "scholarly articles" side on wdqs1023.eqiad.wmnet. i.nflatador you'll find a copy of the recordings on the meeting invitation. you can just watch the first few minutes of the first one then stop, then the second recording. for bonus: [20:11:35] hmm, consumer-search is back up now. Total throughput seems slightly higher from ~20 earlier to ~32 now. Doesn't quite align with the update from 5 to 25 concurrent requests. The new pending_requests guage goes up to a max of 20, but is fluctuating from single digits up to 20 and back [20:11:41] er, not bonus, but if wdqs1024 flakes out i think our idea collectively is to use another node to do the wd_schol import (probably wanting to backup the wikidata.jnl on wdqs1023 where the import of the full graph succeeded) [20:13:08] oops, mistyped. i meant to say: "if wdqs1024 flakes out i think our idea collectively is to use another node to do the *wd_main* import (probably wanting to backup the wikidata.jnl on *wdqs1022* where the import of the full graph succeeded)" [20:14:21] actual bonus, here's some notes of some of the commands we ran today: https://phabricator.wikimedia.org/P54284 [20:30:58] 👏 [20:49:59] pfischer: lol, but not sure it's worth the effort so please feel free to ignore :) [20:52:36] ebernhardson: yes was expecting more... could be because we ask the operator to return ordered results and it's sometime blocked on a slow request? [20:53:50] wondering if we could run 2 fetch operators one being ordered for rev_based requests and the other unordered for rerenders [20:54:17] hmm, maybe [20:56:22] was hoping to see envoy telemetry (k8s) but for some reasons the cirrus updater seems to be the sole flink app not available :/ [20:56:49] dcausse: yes i was currently looking into the same. Oddly even if i `kubectl exec ...` into the tls-proxy and request http://localhost:9999, i get flink metrics [21:29:42] back [21:33:42] fetching the stats directly from envoy...i don't understand :P After running for 90m the consumer-search-tls-proxy reports `cluster.mw-api-int-async-ro.upstream_rq_completed: 2` and `http.mw-api-int-async-ro_egress.downstream_rq_completed: 2` [21:34:32] oh, i'm a dummy, have to ask the taskmanager not the jobmanager. But don't realize for 15 minutes until after saying something :P [21:35:01] much better, reports 166k requests [21:36:52] But now this metric doesnt make any sense, documented as "Total requests pending a connection pool connection": cluster.mw-api-int-async-ro.upstream_rq_pending_total: 3556 [21:41:54] i guess rdf-streaming-updater reports basically the same, at 2.5k. Maybe it's some unimportant oddity [21:50:50] ahh, thats overall. Out of 186k requests, 3.5k weren't able to immediately get a connection and had to go in the pending pool [22:10:40] deployed a consumer-devnull to staging to see if it behaves differently [22:25:58] hmm, i think quota is stopping it from standing up a taskmanager [22:30:54] inflatador: if you have a chance before leaving today, this should update cirrus-streaming-updater limits in staging to match rdf-streaming-updater: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981403 [22:31:17] school run, back in a bit [23:05:57] back [23:07:53] ebernhardson: gonna have to briefly step out in a few but let's get those limits fixed [23:10:31] thx! [23:16:27] ebernhardson: is it just a `helmfile -e staging -i apply` from `ryankemper@deploy2002:/srv/deployment-charts/helmfile.d/services/cirrus-streaming-updater`? [23:17:05] The proposed diff is showing a lot of stuff so I might be doing it wrong [23:17:07] ryankemper: this is admin_ng which is a different level [23:17:14] Oh, doh [23:17:18] lemme find docs, because i've never done admin_ng stuff myself :) [23:17:52] this isn't quite exactly the same, but i imagine the commands are the same: [23:17:54] https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng [23:18:12] so, from the admin_ng directory: helmfile -e staging-eqiad -i apply [23:22:10] Diff looks perfect, rolling [23:22:23] https://www.irccloud.com/pastebin/fZ0fP1cO/ns_diff [23:22:44] excellent, will try redeploying the consumer-devnull in a moment [23:24:44] looks to have worked, consumer came up this time [23:30:03] devnull is also processing in the mid 30's, basically suggests the sink is not a bottleneck here [23:30:16] excellent [23:30:18] quick errand, 10m [23:42:43] undeploying consumer-devnull, seems it's proven well enough that it has the same throughput as consumer-search and we don't need to leave it running.