[08:53:25] o/ [08:55:14] o/ [09:08:22] o/ [09:57:41] dcausse: I'll be 3' late [09:57:50] np! [10:21:59] I've just scheduled a meeting with Srishti from the Language team for later today. gmodena: I've invited you, but this will be late, only join if you are super interested! [10:22:20] Trey314159: if you can join that meeting, I suspect that you're the most needed person in there! [10:31:57] gehel thanks! I might be afk at that time tonight, but if I'm around i'll drop by :) [11:01:03] lunch [11:33:40] ebernhardson renaming the package name to wmf_mjolnir fixed CI [11:34:13] weirdly, it does not work locally. But tbh i don't know if wmf_airflow_utils is meant to :) [11:35:20] I ran a manual pipeline to publish a 2.5.0 env (`main` is now versioned as 2.6.0dev) [11:37:09] https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/465205 [11:37:10] OSError: [Errno 28] No space left on device [11:37:12] oof [11:42:02] this happens sometimes. Usually it's just a matter of waiting for the runner to clean up / be assigned a runner with enough resources [13:01:55] I managed to release a conda env for mjjolnir 2.5.0, and marked https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1105/ (search: query_clicks: increase data retention) as ready to review [13:04:39] looking [13:10:48] thanks! [13:11:17] lunch [13:12:57] o/ [13:30:21] Waiting for pod gitlab-runner/runner-9e2abdumz-project-93-concurrent-0-hxx8v1qh to be running, status is Pending, 71 minutes 30 seconds :/ [13:31:43] if y'all are having repeated problems w/a runner you might wanna hit up releng in IRC. They're usually pretty responsive [13:32:02] o/ [13:32:07] .o/ [13:32:37] \o [13:32:43] * pfischer has an external appointment and will be back in 90’ [13:36:16] o/ [13:53:49] gehel: I'll be there at the Language team meeting [13:58:09] DC switchover in 2m... [14:03:27] quick CR to add some master-eligibles to cloudelastic if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128874 [14:11:39] \o [14:12:09] o/ [14:17:32] dcausse oof. The ariflow-dags ci failed on the build envs step(s) :| [14:17:39] gitlab is not having a good day today [14:17:53] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/465298 [14:17:55] yes... just retried them, we'll see [14:18:16] disk issues again? It surprises me how we keep running out when disk is (relatively) cheap [14:18:39] i suppose DC grade disks are significantly more expensive than consumer disks, but still... [14:18:49] yeah, I'm not sure where the trusted runners "live"...if it's on prem, maybe we can talk to them about using ceph [14:19:21] wouldn't hurt to aggressively delete stuff too [14:22:13] seems to have passed this time [14:23:08] doing some cleanups on relforge to get it back green [14:31:52] looks like puppet is unhappy with the symlink setup ;( https://phabricator.wikimedia.org/P74245 [14:32:17] inflatador: doh, of course it wants /usr/bin/test [14:32:21] The ready-fire-aim puppet workflow strikes again :( [14:32:23] * ebernhardson doesn't do enough puppet :P [14:32:38] no worries, I can get a patch up [14:34:35] inflatador: probably ln needs a full path too [14:36:22] ebernhardson ACK, will amend [14:44:19] q: do we have some baseline of :morelike results to use as a benchmark to compare with vector search? [14:45:00] gmodena: we have a log of search requests along with results [14:45:06] sec [14:45:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128880 puppet path patch [14:46:43] gmodena: event.mediawiki_cirrussearch_request has the logs, they are very much operational style logs though and not super easy to ingest [14:47:09] ebernhardson ack. I'll take a look. Thanks for the pointer [14:47:34] i've been doing manual inspections, but wanted to try and quantify results a bit [14:48:40] ebernhardson I also like the suggestion you had a while ago to try use an LLM to compare results. I hope to be able to give it a go :) [14:48:45] gmodena: the general structure is 1 event per web request, with each individual request to the backend in the `elasticsearch_requests` array. We try to filter things up to the top level so you don't have to dig in elasticsearch_requests, but sometimes it can't be helped [14:49:17] ebernhardson got it - thanks. Having a look rn [14:51:08] gmodena: i think you can filter for array_contains(elasticsearch_requests.query_type, 'more_like') [14:52:36] thanks for the review, moritz-m! [14:53:21] inflatador: i just realized...its probably going to complain the directory doesn't exist :( [14:53:40] nothing is ever as simple as we would hope :P [14:54:08] ah, I thought the `onlyif` would catch that? Puppet did run without errors FWiW [14:54:25] Since that code also touches the observability hosts, I probably should've added one to PCC ;( [14:56:43] inflatador: the onlyif checks if the source file exists [14:56:55] inflatador: but the part that will complain is that ${config_dir}/sudachi directory doesn't exist [14:58:11] I just ran puppet against a logstash host with no errors, so it seems they won't be affected [14:58:13] inflatador: plausibly this could also move into profile::opensearch::cirrus::server.pp inside the $::profile::opensearch::server::filtered_instances.each bit [14:58:36] inflatador: yea the o11y servers should skip based on the onlyif [14:59:24] What creates the sudachi dir? Is it from installing the plugin or somewhere in puppet? [15:00:09] inflatador: /usr/share/opensearch/config/sudachi is from the wmf-opensearch-search-plugins .deb, but right now nothing creates ${config_dir}/sudachi [15:01:08] ohh...OK. Sorry for being slow on this one. Maybe we just symlink the dir instead of the dic file? [15:01:23] inflatador: oh! yea that will work [15:01:48] cool, will get a patch up [15:07:54] OK, let me know if https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128884 looks OK [15:09:29] looking at sudachi I wonder if we should do something... I don't fully understand kotlin but it seems to eagerly creates its analyzers... [15:09:32] inflatador: in the title you need to use $instance_dir [15:09:37] whoops! [15:09:56] dcausse: i agree that it needs a bug filed upstream, it shouldn't open the dictionary on every shard [15:10:09] * ebernhardson hopes it closes it again and doesn't go on and read it in :P [15:10:28] inflatador: but otherwise seems reasonable...but on the other hand i don't have a great track record of predicting puppet success...good luck :P [15:11:19] No one does, that's why we have PCC ;>. Something that is wholly unnecessary with **COUGH** other config mgmt software [15:14:31] OK, PCC looks like we expect now (I think) [15:15:55] inflatador: looks about right [15:19:39] * inflatador is on pace to set a puppet-merge record today [15:19:54] nah, i bet it's well over 30 :P [15:20:29] If I continue at this pace, I'll get there ;P [15:28:20] OK, everything looks good on cloudelastic. Will start changing masters after I'm done w/mtgs [15:41:13] wow, i wasn't even close. 30 doesn't make the top 10 :P [15:41:28] the winner had 84 commits on 2020-11-06 [15:42:19] via `git log --date=short --format="%ad %an" | sort | uniq -c | sort -nr | head -n 10` [15:42:49] LOL, I love that you looked that up [15:43:02] I guess I'll be content to set a personal record ;P [15:44:58] the data is curious, it says your current record is only 5 [15:45:07] a few days ago :P [16:00:57] I think it's because it does not override "requiresAnalysisSettings" to true but there might be something else [16:04:14] did better than me :) I looked briefly but came to zero conclusions [16:07:01] seems like they worked around some misunderstanding of the elastic analysis API: https://github.com/WorksApplications/elasticsearch-sudachi/pull/112 [16:07:50] AnalysisCacheService is scary :( [16:08:28] i suppose they are trying to avoid loading the 200mb dictionary multiple times for multiple shards? [16:09:19] perhaps? [16:11:08] no it's caching analysis results apparently [16:11:14] oh, weird [16:11:57] https://github.com/WorksApplications/elasticsearch-sudachi/blob/develop/src/main/java/com/worksap/nlp/elasticsearch/sudachi/plugin/AnalysisCache.kt [16:12:43] :/ [16:13:39] err...why? [16:14:55] not really sure [16:16:04] perhaps one of their use-case is analysing many identical small strings [16:17:33] i suppose thats possible. My imagination often isn't good enough to think of all the crazy things people can do with language analysis [16:23:12] was hoping to ignore that since we're not going to use the analyzer but tokenizer does seem to have caching as well... [16:42:28] building sudachi is going to be a pain with gradle I guess... error: cannot find symbol Plugin ... :( [16:45:07] curious, building was one thing i didn't have trouble with [16:45:24] i think it was just `JAVA_HOME=/usr/lib/jvm/java-1.17.0-openjdk-amd64 ./gradlew -PengineVersion=os:1.3.20 build` [16:46:34] ah java 17 is the one I did not try :) [16:46:43] tried 8, 11 and 21 [16:47:19] somehow 17 is the newest one i have installed system wide, choose by luck [16:47:57] will try a PR upstream to see if we get a response then backport to os 1.3 on our repo [16:48:21] awesome, thanks! [17:45:05] errand [17:49:29] workout/lunch, back in ~60 [18:19:53] dinner [18:58:22] back [19:22:08] low priority, but if anyone has time to try and run ` promtool test rules team-data-platform/rdf_streaming_updater_global.yaml` against the latest master on alerts (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/ ) LMK. It seems to be failing linting, just wondering if it's my promtool version or something. I have not made any changes to the file yet [19:23:56] inflatador: passes locally, i'm using `docker run -it --rm --entrypoint /bin/promtool -v $PWD:/prometheus:ro prom/prometheus test rules rdf_streaming_updater_global_test.yaml` with promtool 2.52.0 [19:25:27] ebernhardson thanks for checking, I'm on promtool 2.5.0, let me try updating [19:30:29] turns out i was behind by 82 commits, but latest master also passes [19:31:53] nice. I guess it's a PEBKAC then [19:32:16] Meeting with Language team moved to Thursday [19:49:53] quick break, ~20 [20:13:06] back [20:14:54] i don't understand how k8s miscweb works :S Like i can curl into it, but i don't see where in deployment-charts it was configured: curl -ik --resolve query.wikidata.org:30443:10.2.2.70 https://query.wikidata.org:30443/ [20:19:41] I think there's a cname from query.wikidata.org to the k8s ingress address, lives in both the DNS repo and Puppet? [20:20:42] oh and ATS is involved https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#239 [20:21:45] i'm actually looking at the internal side, post-ATS. What i mean is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1118074 seems to claim https://webserver-misc-sites.discovery.wmnet will end up in k8s [20:22:05] but in deployment-charts miscweb there are a number of sites and images with the static assets listed, but query service is not in there [20:22:28] so where are the assets coming from? I suspect that it still ends up at miscweb1003.eqiad.wmnet and similar, and not k8s [20:22:49] interesting! I thought miscweb was 100% on k8s now [20:23:20] I guess not [20:23:36] maybe, i'm not certain that I understand whats happening here :P [20:24:48] Pretty sure I don't either ;<) [20:26:09] i also don't understand why in that patch trafficserver (ats) is being pointed at https://webserver-misc-sites.discovery.wmnet, but the gui_url provided in puppet is being changed from webserver-misc-sites to miscweb.discovery.wmnet .... many questions :P [20:26:47] on the upside, miscweb.discovery.wmnet resolves to k8s-ingress-wikikube-ro.discovery.wmnet, and i can fetch query service through that by providing query.wikidata.org in the SNI. But where are the assets coming from? [20:26:49] S [20:28:12] must still be going to miscweb1003? At least, I see healthchecks hitting the Apache logs there [20:28:27] but how does k8s-ingress-wikikube-ro end up at miscweb1003? [20:29:21] not sure. You wanna curl with a weird user agent and I'll see if it ends up on miscweb1003? I assume it will since CODFW's off and I don't see other miscweb hosts [20:29:34] 10.2.2.70 is k8s-ingress-wikikube, and this successfully fetches the expected content through it: curl -ik --resolve query.wikidata.org:30443:10.2.2.70 https://query.wikidata.org:30443/ [20:30:12] inflatador: i just sent one with the user agent 'hello inflatador' [20:31:10] I don't see it, I guess it is going somewhere else! [20:33:01] Is this image referenced in deployment charts anywhere? https://docker-registry.wikimedia.org/repos/wmde/wikidata-query-gui/tags/ [20:33:29] inflatador: oh, thats interesting. There is a separate helmfile for wikidata-query-gui vs miscweb [20:33:34] i suppose that would make sense [20:33:47] (for why its not in miscweb) [20:34:17] it seems very awkward, they have 6 separate releases that all serve the exact same content on different hostnames [20:34:31] i guess this is the new k8s world :P [20:35:18] i guess they have separate custom-config.json on each domain, although it's not easy to see how they very [20:35:26] s/very/vary/ [20:36:12] Microservices! [21:04:24] sre pairing? [21:04:32] was hoping to ship above mentioned patch. [21:16:22] ebernhardson OMW [21:16:48] inflatador: https://gerrit.wikimedia.org/r/1118074 [21:43:48] curl -ik --resolve commons-query.wikimedia.org:30443:10.2.2.70 https://commons-query.wikimedia.org:30443/ [21:55:18] inflatador: ryankemper: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128987 [22:05:29] spring is deceptive...blue skys, sun everywhere. 61 degrees :P [22:21:09] heading out