[07:56:43] o/ [07:57:48] pfischer, gehel: going through emails and backlog but ping me if there's anything I can help with [08:16:34] Hi David, welcome back! [08:17:08] thanks! :) [08:47:57] dcausse: Have you heard back from ottomata regarding the (compressed) kafka topics? I could not see them on kafka-test1006.eqiad.wmnet (kafkacat -L -b kafka-test1006.eqiad.wmnet:9092 | grep topic) last Friday. [08:48:50] pfischer: looking in the thread where I asked [08:50:53] he seems to suggest that if we autocreate a topic with "compression.type: gzip' on the producer settings the topic should use this compression at the broker level [08:51:22] we might need to create a quick java app to prove this [08:51:53] he's not against creating these topics tho, he just wanted that we try that out first [08:52:50] actually a quick python client might be faster to write, I think I have some code at hand [08:52:52] dcausse: Okay, that’s what I thought, by default the topics assume compression.type: producer and it’s up to the clients to compress batches. [09:01:11] dcausse: Clients must configure linger.ms > 0 and batch.size > 0 for compression.type to show an effect. [09:05:33] pfischer: sure but I suspect that with message sizes around 10k it should show an effect regardless? [09:06:27] Question still is: How much does compression buy us, if it only is applied to batches. Any message larger than batch.size will not be sent as batch and therefore will not be compressed. So if we are worried about large messages (for large pages) we have to raise the batch.size to multiple megabytes and that increases the memory consumption on the broker side [09:06:31] max.request.size might have to be tuned most certainly tho [09:08:11] https://www.conduktor.io/kafka/kafka-producer-batching - my mistake, memory consumption raises as on the client side as the client allocates max.batch size per partition [09:09:33] Do we have usual suspects of large pages? Just so we can get a feeling of a corresponding kafka message. [09:09:47] sure [09:09:56] largest should be on test wiki [09:10:05] so it'll be easy to test [09:10:35] it's https://test.wikipedia.org/wiki/Template:Long [09:10:50] should be around 4mb of json cirrus doc [09:14:12] Do we have a distribution for page sizes of (en) wiki? [09:17:43] pfischer: not a detailed distribition but some top percentiles are there: https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&editPanel=58 [09:17:55] you don't get the details per wiki [09:19:04] you have more stats there https://docs.google.com/spreadsheets/d/1NTOhfw5pRPZBxZ017G-SblfSvvJ7hLIpSNtA_IUO2iQ/edit#gid=0 but unsure that's helpful now [09:20:15] Okay, so if we set the batch size to 100kb we cover 95% of all pages. Unless the remaining 5% are responsible for the majority of all revisions. [09:22:01] no the graphana link shows the updates we see in production so it should be very much representative of what we'll have in the topic [09:31:02] Even better, so we already have a value for batch.size. Looking at the update rate I see a sum of 600 op/s over all monitored wikis. So roughly every 2ms we see an update. Hence if we set linger.ms to 4ms we’d have to increase batch.size to 200kb. [09:36:10] sure, but for now the update rate you'll will be much lower (I expect ~20/s) as we only cover revision based updates at the moment [09:36:26] s/you'll/you'll see/ [09:43:30] Makes sense, so unless we increase linger.ms we’ll sent less messages per batch but they would still be compressed. I’ll create a patch for separate kafka output topic with (configurable) batching (+compression) [09:44:29] +1 [09:47:45] dcausse: welcome back! [09:47:51] thanks! :) [09:48:05] While you're around (for a short time), if you could have another pass at https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit# that would be great! [09:48:46] Also, I think that Ryan should have scheduled some time with you to go over the WDQS SLO. We're almost there (for the SLO), but your review is key to validating that the numbers we have do make sense. [09:50:19] sure! [09:52:08] Note that the final SLO dashboard will be backed by grizzly templates. There is a CR ready to be merged: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/862178 (no need to review, just fyi). [09:52:22] It should have mostly the same numbers as https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1 [09:52:43] nice! [09:53:11] we had another incident last Friday. It shows up quite nicely in the graph: https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&from=1669992823947&to=1670013983131 [09:53:37] yes I can see that :) [09:54:45] The strategy was to count 429 and 403 as success. But in cases where the service is slowing down, we can expect 429 to increase for everyone, due to the quality of the service and not due to specific client overload. [09:55:26] So our SLO is probably overly optimistic. That seems like a good trade-off between complexity and accuracy. But your opinion will be welcomed. [09:59:52] makes sense! [10:09:45] I can confirm that compression.type is producer on kafka-test [10:37:17] Thanks! [11:00:47] lunch [11:02:32] lunch 2 [11:41:45] lunch [12:19:25] errand [14:41:32] errand [15:47:29] Stuck in traffic: I'll be late for triage [16:49:33] dcausse: any ideas on the ltr thing? In particular the 0's should be gated on a terms HashSet pupulated via Weight::extractTerms, returning 0. My only plan at the moment is to pull it up in idea's debugger and step through [16:51:05] https://github.com/o19s/elasticsearch-learning-to-rank/blob/218f077a2003c05cae6f3b0206f554b9b16966ad/src/main/java/com/o19s/es/explore/ExplorerQuery.java#L117-L124 [16:51:43] it suggests there are terms, but searcher.termStatistics(term, ctx) returned null for them [16:53:03] i can simply add a boolean thats only true when one least one term has tStats != null, but maybe more is needed [16:53:16] hm so "terms" is not empty but does not have any term stats? [16:53:46] i've only run the code in my head, but i think the only way to get our current output is if terms has at least one term, but no term returned statistics [16:54:26] what about changing StatisticsHelper to output 0 instead? [16:55:04] could flag if anything is seen in the helper i suppose [16:55:13] Daniel wrote this so he might perhaps disagree not sure [16:55:43] i suppose i could try a newer version, but pretty sure this still exists in LTR's head branch [16:55:56] either way I'm fine with guarding with a boolean "addedSomething" only if we added something to the stats [16:55:57] so should probably upstream a patch and deploy a fix locally [16:56:07] yes [16:56:30] i wonder what changed in 7.10 to cause this though, the ltr code didn't change in the 6->7 migration [16:56:37] so something changed in what elastic returns [16:57:55] i suppose will find out by stepping through :) i reproduced this by searching for a term that doesn't exist in the title, is a bit odd [16:57:58] ebernhardson, ryankemper, inflatador : I'll skip our pairing sessions tomorrow. For once, I'll be out having a few beers with a few friends... [16:58:07] gehel: nice! have fun [16:58:17] gehel how dare you have a normal social life ;P [16:58:50] there's some assert !data.isEmpty(); in StatisticsHelper so it expects something [16:58:57] I don't know about normal. But at least somewhat social [16:59:22] LOL [16:59:32] so the if(tStats != null) { and then the if (terms.size() > 0) { do not seem to be consistent [16:59:59] indeed, it seems at a minimum should align those two conditions, probably not check terms.size() anymore and instead use a bool flag [17:00:09] yes [17:14:33] errand [18:06:39] Exercising, back in an hour [18:22:30] hmm, omega on 1089 is doing the same weird thing that cloudelastic 1006 was doing, The instance has 10G heap, 6G is used for old but reported max allowed in 7.8G. 2.3G max for young, .3G max for survivor. They don't actually add up to the 10G heap :S [18:23:05] well, the maxes add up to a bit more than 10G, but for some reason old pool stops almost 2G short [18:23:49] i guess thats the new settings in elastic that try to maintain free memory. Not sure if that's too agressive, should give it more memory, or something else [18:28:20] ebernhardson I noticed the alert, I guess the "reported values" are coming from prometheus by way of log4j? [18:29:25] I restarted the service but didn't check in on it after that...doesn't look like the alert cleared [18:29:31] probably prometheus via the various api's elasticsearch exposes. Probably via https://search.svc.eqiad.wmnet:9243/_nodes/stats/jvm [18:32:09] ah, I guess I need to look directly at those APIs then, maybe that will help us understand why the stats don't add up [18:33:08] my best guess is it's relatd to `8-13:-XX:CMSInitiatingOccupancyFraction=75` in the jvm.options file, but i'm not 100% certain. That basically says run the old GC if it's 75% full [18:34:21] but i suppose thats elasticsearch ensuring there is always enough memory for whatever it wants to do, and avoiding the circuit_breaker_exceptions we see [18:35:41] there is always the option to re-tune the alerts as well, hard to determine how "bad" it is that the old gc constantly runs. Might have to look at how latency changes (https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles) when a node transitions from not running the gc to constantly running it [18:38:40] 9111 is the omega port? [18:39:50] hmm, should be 9243, 9443, and 9643 for the three clusters. I always forget the order :( it will tell you if you request the banner page though [18:39:56] banner == / [18:40:02] I just mean as it appears on the grafana page [18:40:10] pretty sure it is 9111 FWiW [18:40:21] ahh, no clue there. We should change those to display something more useful :) [18:40:31] yeah....also now I'm starting to doubt that [18:40:49] since there is no process listening on 9111 on 1089 [18:41:07] can log into an instance and check in the ps list for the prometheus exporter, see what port matches which instance [18:41:10] but there's not one on 9109 either, so hmm [18:41:30] yeah, working on it...sounds like we do need to fix some labels [18:42:42] based on the unit file for prometheus, the omega port is indeed 9111 [18:43:13] and there is a python script listening there, so I was wrong [18:45:41] but there's also an exporter running on 9110 as well, probably the official elastic exporter, and 9111 looks to be WMF's custom exporter [18:46:33] yea these latency metrics come from our exporter. The official exporter is used for most stats, our custom one only reports index stats. Thats because we wanted to report based on the indexes aliased name for long term consistency, instead of the index names suffixed with timestamps [18:47:17] yeah, totally makes sense [18:47:32] popping a task to update our monitoring docs and clean up those prometheus labels [18:47:53] i wrote a quick patch which will add a cluster name label to those metrics: https://gerrit.wikimedia.org/r/c/operations/puppet/+/864829 [18:48:05] once it's collecting then need to figure out how to improve the dashboard to use it [18:48:50] the annoying thing is the cluster names are really long, production-search-eqiad, bit annoying to fit into a label [18:48:56] s/label/graph legend/ [18:51:20] you can tweak them via regex in grafana [18:51:31] yeah, we could probably ditch "search" but maybe that's important to distinguish from other ES clusters [18:51:55] anyway, I'll take a look at the puppet patch, here is the new task https://phabricator.wikimedia.org/T324500 [19:07:51] stepping out for lunch, will continue working on above patch once I get back [19:27:18] inflatador: we had some great help from the jclarity folks in tuning GC for WDQS (a while back). There hasn't been much activity on their Google group (https://groups.google.com/a/jclarity.com/g/friends) since they were acquired by Micro$oft, but if you want to try, that might be a good option [19:27:48] They are going to need GC logs to analyze. So let's make sure we have some and they are easily published. [19:30:59] gehel: run went a little long, 3’ late [19:31:06] ack [19:32:43] inflatador: at that time, I did talk to Martijn (you can find his email on https://www.jclarity.com/) and another guy from jclarity whose name I have now forgotten [19:33:22] Seems that it was Kirk Pepperdine. See T178271 [19:33:22] T178271: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271 [19:36:17] and some of the conversations we had: https://groups.google.com/a/jclarity.com/g/friends/c/dyq92j15Zmg/m/kCNtUabIAQAJ [19:58:43] back [20:16:25] PCC doesn't detect any changes with the above patch. Guessing it has something to do with the way we package and deploy the script [20:17:42] inflatador: the patch itself just changes a file we put on disk, i suppose since it's a direct file copy and not a templated thing PCC doesn't recognize the change [20:17:46] doesn't matter much for such a small change. Will go ahead and merge [20:20:58] OK, it's merged...I guess we should see the new label appear on the percentiles dashboard once we run puppet? [20:22:09] or do we need to adjust the dashboard as well [20:22:51] it wont show on the dashboard immediately, since the legend is set to {{instance}}, but we should be able to edit the dashboard and drop the legend and perhaps see the new bits [20:23:03] or see it in the grafana explorer [20:25:21] not sure how grafana will treat metrics with an extra label, will find out :) [20:26:43] hmmm . will we need to update grafana-grizzly repo as ryankemper did here? https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/862178/ [20:26:52] O [20:27:10] inflatador: grizzly is specific to SLO dashboards, for everything else it's just manual grafana dashboards [20:27:12] I've run puppet on 1089 so we can check grafana explorer I think [20:27:16] i doubt it, most grafana dashboards are simply edied in place via web ui [20:27:16] grizzly lets us do it in a templated way [20:27:53] i see exported_cluster set on several metrics now from grafana explore [20:28:19] so seems to be working (i think the prometheus collector side renames cluster ->exported_cluster) [20:31:05] yup, the query `elasticsearch_per_node_latency{exported_cluster=~".+"}` is returning more things [20:34:27] Ah thanks! [20:35:14] I think I need to go back to Grafana Explorer 101...seems I forget how to do this every time ;( [20:35:41] i usually have to open 5+ tabs searching for promql things every time i use it :) [20:42:51] inflatador: i saved a new version of that dashboard, it should now say something like "omega - elastic1056:9111" [20:43:10] but only for the newly collected metrics of course, historical data will be "- elastic1056:9111" [20:43:50] it mostly amounted to wrapping the existing expression in label_replace to shorten the reported cluster name with regex [20:49:22] excellent [20:49:56] it's probably run everywhere already, but I'm going to go ahead and run it fleet-wide [20:50:02] (puppet that is) [20:50:49] inflatador: o/ i'm making some good progress on the flink image stuff. any word on the helmy bits? [20:54:41] ottomata , nothing new to report. I've never written a helm chart personally. I think David and I will have to meet with someone more helm-y (you? Janis? Ben?) before we can move fwd [21:12:23] hmm, okay! would you mind if i started working on it? we can do it together, i can add you for reviews and walkthroughs? [21:16:25] ottomata not at all, would be a great learning opportunity for me. I can schedule us some pairing time this wk if that is OK (although of course, feel free to work on it before/during/after) [21:18:26] okay cool! i've never written a helm chart for an operator before; I think serviceops is gonna be much stricter. Luckily ben's work on spark operator paves the way :) [21:18:51] I think, mostly it will be taking this helm chart and stripping away stuff we don't want: https://github.com/apache/flink-kubernetes-operator/tree/main/helm/flink-kubernetes-operator [21:30:51] * ebernhardson isn't sure why local phpunit runner doesn't want to run WBCS tests ... [21:32:28] well, i mean they run..but they error with things like 'Expected: Persian Actual: فارسی' [21:32:43] which is persian, but not expected :P [21:45:40] inflatador: we should get ben to give us a walkthrough of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/855674/. there's lots I don't know about espeically in the networking and access control areas [21:49:03] ACK, I've been looking at that PR, let me get a mtg set up [21:50:55] ah, you beat me to it