[08:37:09] inflatador: seems like we disable access_log, see /etc/nginx/sites-available/production-search-eqiad for instance [09:29:45] we're still having throughput issues with the sup consumer, this time it was not backfilling :/ [09:30:32] something still not right, ~100concurrent requests should be more than enough... [10:39:34] errand+lunch [12:58:41] dcausse: How did you notice? Is it back pressure? I checked consumer group lag for the update topic and that’s a flat 0. [13:02:22] The back pressure is pretty constant too (~1s for the page_rerender update processing) [13:17:32] relforge1003 has a filled root partition [13:18:32] ah, there's a 60G file named stuff.tcpdump in Erik's home, that causes it [13:19:03] running since Dec 09, so maybe forgotten [13:31:52] pfischer: looking at https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search&var-flink_job_name=cirrus_streaming_updater_consumer_search_staging&var-operator_name=All [13:33:44] the consumer was 300k messages behind ~yesterday [13:33:48] moritzm: thanks for the heads up [13:34:57] and the backpressure is not great, rerender_enrich being busy causing the source to be backpressured [13:39:34] tempted to blame resource constraints on the flink pod itself here (when looking at young gc timing and also k8s cpu throttling) [13:56:16] going to delete the tcpdump file from Erik on relforge1003 [13:56:39] and killing tcpdump as well [14:11:37] o/ [14:42:06] o/ [14:43:14] dcausse: Okay, I’ll bump the resource requests and see what happens. I just finished running additional metrics test (exposing via prometheus locally revealed some bugs). Once that is merged, we can deploy a new release. [14:44:22] inflatador: I release a new version of the ES extra plugin today. Could we roll that out to the cloudelastic instances that are currently used by the SUP? [14:45:17] pfischer: sounds good [14:45:38] currently the SUP should be writing to relforge [14:46:16] Right, should be even safer to update that instance. [14:55:54] pfischer: was checking the httpclient metrics MR to merge it, you still want to go with (maxConnection / 2) per route? [15:01:50] pfischer Y, LMK if there is a task. I can make one if not. Probably be a few hrs before we can finish [15:05:57] There’s no task, I’ll create one. [15:10:42] inflatador: https://phabricator.wikimedia.org/T353270 [15:21:41] pfischer ACK, assigned it to myself [15:29:55] hmm, looks like there's a new way to build Deb pkgs https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI [15:31:53] letting CI build the deb would be nice indeed [15:53:41] So we’d have to port https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/ to gitlab and could leverage their CI script? [15:55:22] dcausse no worries re: access logs, just wanted to make sure it was intentioanl [15:56:10] pfischer Y, I'm still wrapping my head around the docs...`dgit` is mentioned several times, but it's not installed on any hosts in our entire infra [15:56:39] https://debmonitor.wikimedia.org/ keeps track of which pkgs are installed [15:57:11] maybe the instructions are for the CI pipeline itself, as opposed to manual steps [16:01:21] That was my understanding. Move repo to gitlab, add .gitlab-ci.yml and include: their script [16:09:39] Looks like swift is using this pipeline already: https://gitlab.wikimedia.org/repos/data_persistence/swift [16:10:16] inflatador: Jelto seems to know how to use that. Maybe he can onboard you? [17:01:08] going to ship https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982434 to see if this fixes envoy telemtry [17:04:18] dcausse that might fix our issue with having to manually specify k8s master IPs [17:04:37] ah perhaps? [17:05:42] Well, maybe not ;) I'm hopeful though [17:05:47] Workout, back in ~40 [17:05:49] :) [17:06:01] I +1'd your patch as well [17:06:07] thx! [17:16:01] hm... deploying will decrease taskManager mem from 3000m to 2000m, going to assume that Erik deployed a mem increase without a repo change for testing [17:23:04] we have the metrics now (https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=flink-app-consumer-search&var-kubernetes_namespace=cirrus-streaming-updater&var-destination=All) [17:37:00] dcausse: interesting, thanks! [17:44:10] inflatador: could you estimate when you’ll be able to work on the debian package? [17:49:09] pfischer just got back and working on it now [17:49:39] I'm looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst ... does the jar file you linked earlier correspond to any of these files? Or is this something different? [17:51:02] One moment, looking into it. [17:51:05] maybe https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst#8 ? [17:51:43] Yep, line https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst#8 [17:52:19] Let me check if it’s on central already, if not, we use the sonatype host [17:52:29] OK [17:53:31] It’s not arrived yet, so we use https://oss.sonatype.org/service/local/repositories/releases/content/org/wikimedia/search/extra/7.10.2-wmf10/extra-7.10.2-wmf10.zip instead [17:54:07] pfischer ACK...will update. Probably take at least an hour to get the new packages deployed [17:57:37] pfischer do you know if sonatype signs the release? re: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/README.txt#8 [17:58:28] Yes it does, one second. [18:01:55] That should be my key ID (I deployed the release as well as the last, so that should not change) [18:03:15] got it, do you have a changelog message? [18:03:32] last one says `Add max_size parameter to extra plugin's super_detect_noop set handler` [18:04:44] Sure: Accept document as script.source in addition to script.params.source (deprecated) [18:13:14] pfischer OK, patch is up at https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/982444 if you wanna review a la https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/README.txt#37 [18:18:11] dinner [18:18:25] inflatador: +1 [18:19:23] ACK, just +2'd [18:20:24] oops, formatting issues and a confusing message from jenkins...need to clean up formatting but another +1 is not necessary [18:53:17] inflatador: thanks! So now it takes a while until the deps are available vie apt and then you could update the reforge instance? [18:53:55] pfischer yeah, still working on it... https://phabricator.wikimedia.org/P19522 for an idea of the workflow [18:55:34] https://apt-browser.toolforge.org/bullseye-wikimedia/component/elastic710/ OK, package is published, moving to relforge [19:00:13] Awesome! Sadly the apt upload is not part of the CI pipeline script we talked about earlier, that’s still work in progress. [19:04:41] Tried to apply changes to relforge...elastic crashed on 1004. Investigating... [19:14:06] `java.lang.IllegalStateException: failed to load plugin extra due to jar hell` [19:15:29] OK, fixed...for some reason the package update did not remove the old jars...haven't seen that before [19:20:22] Weird. Thank you, for fixing it. [20:00:45] pfischer relforge is ready...next step is preparing Docker image [21:32:25] pfischer or anyone else, MR for the elastic dev images change is here: https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/54 [21:37:41] quick break, back in ~15 [21:55:09] ^^ went ahead and slef-merged [22:05:04] Just saw it 👍 [22:05:45] OK, the dev-images repo is updated...asked releng to run their script to update the repo. pfischer are you able to work with this as is, or does the image need to be up on the docker registry to be useful?