[07:32:46] guess that jvmquake was built against java11 include/jvmti.h not java8 (-3 means wrong version from https://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-12.html) [09:01:28] unmeeting in https://meet.google.com/hvn-zxxd-xrb if anyone wants to join... [10:33:12] lunch [13:10:07] o/ [13:19:25] Is this interesting? https://twitter.com/CohereAI/status/1649097293201547264 [13:19:53] ^ slack message in #no-stupid-question [13:29:11] yes remember having a quick look, there's a demo with using data indexed in weaviate https://github.com/cohere-ai/notebooks/blob/main/notebooks/Wikipedia_search_demo_cohere_weaviate.ipynb haven't played with it yet but sounds interesting [14:27:56] Trey314159 have to cancel, dcausse and I are working on some package building [14:28:27] inflatador: no worries! I'll catch you later [14:40:12] o/ might be 10min late for retro [15:00:09] @team: retro starting in 1'. Olja is going to be there to chat about any organizational changes: https://meet.google.com/eki-rafx-cxi [15:01:50] inflatador, dcausse: ^ [15:21:17] pfischer: about T335399, Peter Wangai might be able to help [15:21:18] T335399: Add SonarQube Integration for CirrusSearch Update Pipeline CI - https://phabricator.wikimedia.org/T335399 [16:30:41] could we try and ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/911940 for puppet deploy window? It's an extra metric from our prometheus elasticsearch reporter [16:41:58] ebernhardson: not tested the script but +1 (deleted the labtestwiki completion index on both clusters) [17:14:33] dinner [18:24:58] ebernhardson: I missed window but I can get that out today [18:26:54] cool! It goes along with https://gerrit.wikimedia.org/r/c/operations/alerts/+/911945 which adds a new alert on that metric [18:31:28] hmm, i seem to remember the wrong release process for search-extra :S released to archiva instead of central [18:34:04] ryankemper: pairing session? https://meet.google.com/eki-rafx-cxi [18:34:39] gehel: internet keeps going up and down, will join when i can in a few [18:34:59] if it doesn't work, let's cancel [18:37:28] gehel: ack, all I had was merging erik's patches and also wanted to ask if you'd looked over the SRE email draft [18:37:34] the grizzly performance work is done so I'm looking to send it out today [18:37:45] \o/ (about grizzly) [18:38:10] I completely forgot about the email draft, looking right now [18:42:06] ryankemper: this is probably backward, but given we've been struggling getting the right metrics / queries, would it make sense for you to get a prometheus training? There are a few that seems reasonable on https://prometheus.io/support-training/ [18:43:19] gehel: some of the modules in https://training.promlabs.com/ definitely look relevant, could be worth checking out [18:43:35] minor comment on the draft, otherwise LGTM, send it out whenever you want! [18:44:09] ryankemper: I'll add a note to discuss this training in our next 1:1 [18:44:43] Excellent [18:44:44] thanks! [18:50:39] ryankemper: minor comment on https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1 looks like the scale is wrong on the right side panel. It goes to 1.0000%, which shoudl probably be 100%? [18:50:45] ratio vs percentage? [18:56:37] gehel: I threw a patch up, take a look at this preview dashboard: https://grafana.wikimedia.org/dashboard/snapshot/ck7Ph8nsdkushCwG0M07j14ML4A5OqVh [18:56:41] (https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/912944) [18:57:21] looks good! [19:05:08] (deployed) [19:20:10] huh, apparently `-DreleaseProfiles=deploy-central` doesn't tell it to use the deploy-central profile from discovery-parent-pom... [19:28:19] ebernhardson: you need to activate a regular profile, not a releaseProfile [19:28:57] gehel: ahh, ok. lemme try that [19:29:05] example at the bottom of https://github.com/wikimedia/wikimedia-discovery-discovery-parent-pom#release [19:29:21] for once, it is even properly documented! [20:16:08] seems to have worked, although it's only showing on oss.sonatype.com and not central.sonatype.com. Going to guess there is some caching/delay and check later [20:45:57] ebernhardson: I'm glancing at https://gerrit.wikimedia.org/r/c/operations/puppet/+/911940 now. I'm a bit confused on what `batch_id` represents though [20:46:37] The comment says `The batch_id of a titlesuggest index marks when it was last updated`, which is pretty straightforwarded. However I don't understand this part: [20:46:47] > Aggregate to find the lowest batch_id across the cluster. If this value grows it indicates the update process is failing somewhere. [20:47:24] ebernhardson: if I think of the batch_id as a timestamp then it *not growing* would be a failure. So it must sort of be the inverse of a timestamp, but I'm having trouble visualizing what the actual value looks like [20:48:03] ryankemper: the batch_id is indeed a timestamp, the max(batch_id) of a titlesuggest index says the last time something was indexed. [20:48:27] ryankemper: the growing refers to the now-batch_id, the age of the index. If the age increases then something is wrong [20:48:51] i mean it should fluxuate a bit, but generally < 1 day. if its a couple days then the process isn't updating the indices [20:49:20] and i suppose worth noting that indexing here isn't related to edits, every day should replace all the docs [20:49:23] okay I see [20:50:17] i suppose that comment could be clearer, i'm mixing later concepts in :P [20:50:34] How does it sound amending `If this value grows` -> `if current time - this value grows...` [20:51:12] Or slightly more verbose, `if the distance from current time until the batch_id...` not sure if that reads better tho [20:51:29] i suppose paraphrasing above: If the difference between this value and the current time grows beyond a day it indicates the update process is failing somewhere. [20:51:47] Ooh I like that. I'll amend to that wording [20:54:19] ebernhardson: okay, how's the wording in https://gerrit.wikimedia.org/r/c/operations/puppet/+/911940/8/modules/prometheus/files/usr/local/bin/prometheus-wmf-elasticsearch-exporter.py#115 look? if good I'm ready to give my +1 and merge [20:55:45] lgtm [21:06:11] ebernhardson: cool, merged. one more question, I don't fully understand what's going on with the "hack" that sets `settings.index.creation_date` to `batch_id * 1000` [21:07:31] ryankemper: well, it's a hack because i'm injecting custom values into an elasticsearch response [21:08:15] ryankemper: but the age_days metric down on line 242 uses that dataset and reports a value for each index in that index_settings response [21:08:31] it wouldn't work if there were more metrics using index_settings because we didn't inject all the data for our fake *_titlesuggest index, just the one [21:08:58] ebernhardson: ah I'm starting to understand. and why the `* 1000` specifically? [21:09:35] we have a unix timestamp, but java reports in ms since 1970 for whatever reason [21:10:01] we are just matching how elasticsearch represents the creation_date value [21:19:52] ah, I see. that makes sense [22:21:33] curious, prometheus has that value increasing every minute [22:30:53] oh duh, it's supposed to. We cache the oldest timestamp, but we report now-timestamp, so it's always moving. [22:35:11] ryankemper: we'll need to restart the collector for all the other clusters, but the one cluster that's collecting now looks good [22:35:56] ebernhardson: collector meaning the prometheus wmf elasticsearch exporter yeah? [22:37:21] ryankemper: yea, i suppose i'm assuming only one was restarted because i only see one in prometheus [23:17:05] ebernhardson: I think puppet should have already reloaded them all by now [23:17:41] random example elastic1067 `Active: active (running) since Thu 2023-04-27 21:16:43 UTC; 1h 59min ago`