[07:22:05] dcausse: (and cc gehel) did some digging into the wdqs update lag dashboards question...I think based off looking at the grafana explore (the first link after the `-> ` in https://phabricator.wikimedia.org/P52490 ) output the non-grizzly version is wrong, likely a missing xor extra paren [07:24:28] ryankemper: thanks for taking a look into this! so this means we're actually red? :( [07:24:37] Basically it looks like ultimately the thing is evaluating to ` < bool 600` which is always returning 1...that's assuming my reading of the explain output is correct which it very likely could not be [07:26:02] I must confess that it's too early for me to understand this prometheus query yet but I'll try harder :P [07:26:08] dcausse: I think so. The sizeable dip around `2023-08-29` makes me feel like there's probably a host or two that's pooled that shouldn't be, or something like that [07:26:22] ok [07:29:55] Looking at https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&viewPanel=8 I don't see any hosts listed with `lag > 10m` [07:30:34] I was thinking of the sli query as operating on pooled hosts, but I think it's just actually "hosts that got at least 1 request in last 5 minutes". So perhaps that could be counting hosts that aren't actually in service [07:32:20] I think the idea was: "was the rate of public queries > 1qps in the last 5min" [07:33:21] ah, that would make sense. hmm, that would definitely only catch actual in-service servers then, since just liveness probes or w/e wouldn't trip that threshold [07:36:06] We can probably do some prometheus magic and get a query to actually spit out the hostnames of the hosts that it thinks are in violation. that would probably go a long way to figuring out if it's behaving as expected [07:36:10] * ryankemper will pick this up again tmrw morning [07:36:15] thanks! [07:49:09] wow seems like we have the updater running again in dse-k8s! [07:50:04] there are no leader-election k8s configmaps so I suspect it must be using zookeeper otherwize it'd fail [07:50:24] I might be wrong and we could inspect zookeeper to be sure I suppose [07:50:38] inflatador: congrats! [07:52:50] another easy test is to force a restart of the jobmanager to see if it picks up the lastest checkpoint [09:28:09] o/ dcausse: would you have time to discuss merging redirect updates? I ran into something while working on the late fetch patch. [09:45:49] lunch [09:49:19] pfischer: sure [09:50:00] After lunch (if you are about to leave)? [09:50:38] I have couple minutes now but ~2pm is also good [09:53:17] meet.google.com/itj-yojk-ehe [10:45:39] lunch [13:14:11] o/ [13:14:43] o/ [13:14:53] errands, back in 30’ [17:05:32] workout/lunch, back in ~90 [17:34:54] going offline [18:31:50] back [19:13:16] tiny puppet patch for the new search-loaders if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/957336/ [19:16:54] ryankemper merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/954350 [19:17:33] inflatador: ty, had forgotten about that one [19:17:52] np, just cruising the board [19:35:13] what creds does https://integration.wikimedia.org take? wikitech dev? [19:39:01] should be wikitech. ya [19:54:16] * ebernhardson wonders why CI passes helm-linter but when i run the same container locally it fails validation :P [20:05:47] (╯°□°）╯︵ ┻━┻ [20:07:59] the answer is...weird magic they have going on regarding diffs and choosing the parent to diff against? rebasing the patch made it stop doing weird things (it was also changing a version number from 0.4.5 to 0.4.4 in the diffs, even though the patch had no related change) [20:09:12] i guess it might be diffing against the prod chart available at https://helm-charts.wikimedia.org/stable [20:13:29] * ebernhardson is still figuring out how exactly the test suite here works. I was suprised to find the test classes are all hidden under .rake_modules/, i had somehow assumed those were vendored [20:26:14] unrelated but I'm going to merge both patches in this chain shortly: https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/956445 . We already added puppet support in https://phabricator.wikimedia.org/T343856 [20:27:36] per https://phabricator.wikimedia.org/T344284#9161787 it looks like the java side is ready too, so ryankemper and I should be able to roll a WDQS deploy for these changes [20:27:55] inflatador: you'll need to do the release process (through jenkins), but yea should be ready [20:30:29] ebernhardson understood, I guess this'll be our first deploy with the new canary [20:43:59] OK, buildin' https://integration.wikimedia.org/ci/job/wikidata-query-rdf-maven-release-docker/116/ [20:44:24] Hmm, found a relevant comment but i still don't understand... "# Write the generated fixtures. Please note that this won't affect admin fixtures" [20:58:29] ??? [21:11:17] oh, i was still trying to understand why the fixtures didn't work for david, or more generally how exactly this repo assembles and runs it's test suite [22:18:27] oh yeah, I got that..just giving a general "WHA?" to that comment [22:18:54] one more host to go on the wdqs deploy [22:20:34] wdqs1016.eqiad.wmnet failed because it's a test server, we might need to fix that in the cumin aliases [22:22:10] nah, that's applied by puppet roles and they're correct. hmm [22:22:27] anyway, a mystery for another day...see ya tomorrow