[08:03:01] o/ [08:11:52] o/ [08:19:17] o/ [08:49:51] o/ [08:50:08] gerrit issues? https://downforeveryoneorjustme.com/gerrit.wikimedia.org [08:53:25] works for me? [08:57:33] works for me again [10:09:09] gehel: is it possible to move the "2025.03.01 - 2025.03.21" column to the left of the search backlog so that it's visible without scrolling to the right? [10:14:49] oops I think I moved it messing up with phab buttons, please let me know if that does not work for you and I'll move it back to the right [10:19:49] It works for me! Thanks for the improvements! [10:39:38] errand+lunch [10:43:46] together with Trey314159 I'm investigating some weirdness with session lengths reported in a/b test results. It boils down to events with `client_dt` set years in the past wrt their meta.dt and dt (e.g. client_dt=2006-02-21T16:00:19.674Z for events recorded in 2025-01). Do you have any pointers wrt how this field is recorded. I have a couple of hypothesis I want to validated. [10:44:20] For now, I'm off to CodeSearch. :) [11:42:38] lunch+errand [13:10:00] gmodena: pretty sure it's from from the client browser so something we should expect to have some weird values in there, Erik might know more but I thought we're doing some cleanups there [13:11:02] looking at past A/B test https://people.wikimedia.org/~ebernhardson/T377128/T377128-AB-Test-Metrics-WIKI=dewiki.html#session-length and the one you ran https://people.wikimedia.org/~gmodena/search/mlr/ab/2025-02/T385972-AB-Test-Metrics-2025-02-WIKI=dewiki.html#session-length seems like we don't remove these outliers [13:17:48] hm... apparently we switched traffic to 100% eqiad, search latencies were not super happy :/ [13:19:41] morelike cache hit rate dropped to 29% from ~47% [13:21:50] ah no wait that was 100% codfw [13:26:26] actually the perf in codfw (handling 100% of the traffic) was reasonable... [13:26:49] the alerts fired was on eqiad... seems there were still couple slow queries running there [13:27:10] dcausse i've modified a bit the a/b test code to remove outliers. I don't think those records invalidate analysis, but make the plot a bit more messy to read [13:27:16] perhaps we should have a qps threshold on the p95 latencies alert? [13:27:39] gmodena: thanks! [13:29:15] dcausse re latency: was it just a false positive? +1 on qps thresholds (in general). [13:30:33] gmodena: I think so, eqiad fired an alert but it was serving something like 5qps compared to the usual 550qps [13:30:48] ah! [13:30:52] I'm not sure what these queries are if eqiad was depooled [13:49:58] speaking of eqiad, is there something to look out for - or manual action to take - during the DC switchover period? [13:51:34] gmodena: not really, I think what could go wrong generally is search latencies and overloading the cluster [13:52:22] generally latencies might go worse for a little bit while the morelike query cache fills up [13:52:58] but apparently today everything went well, I barely noticed a slow down in p95 in codfw when serving 100% so that's good news [13:53:12] but that's not the busiest time of the day [13:53:59] the morelike query cache is visible in the elasticsearch percentiles dashboad under "Per-Query Type Metrics" [14:31:41] dcausse ack. nice! [14:43:50] o/ [14:46:31] how can we get in touch with Scholia? Was wondering if we should ping them to test the new full graph legacy endpoint (ref T384422 ) [14:46:33] T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422 [14:47:42] inflatador: gehel or myself can contact them, please let us know once available [14:49:47] I think it's ready now, but we can verify that at pairing. https://query-legacy-full.wikidata.org/ [14:50:10] nice! [14:52:34] they might join us at our office hours! [14:52:55] Otherwise, I hope to send the general communication on Monday [15:04:40] \o [15:04:49] o/ [15:07:09] gmodena: re meta.dt, essentially we have two timestamps and they both suck. the top level dt comes from our infra and is probably correct, but it includes artifacts of event delivery in the timing. So if the browser sendBeacon holds an event for 20s before sending, that 20s is added in. The other is meta.dt which comes from the client clock, but click clocks are notoriously unreliable [15:07:34] essentially if we want a delta, it seems like meta.dt is the right one. But we need to simply throw out outliers at some point i suppose [15:08:34] s/click clock/client clock/ [15:09:15] o/ [15:09:33] I can’t make it to the Wednesday meeting today. [15:09:51] kk [15:20:02] ebernhardson ack. In this case `meta.dt` and `dt` seem correct wrt the a/b session, its `client_dt` (a third timestamp) that sometimes is way off. Is it that one the top level browser timestamp? [15:20:36] actually, how do we record `dt`? Afaik it should be producer time (browser in this case) [15:21:25] gmodena: hmm, i guess it's been a few months perhaps i'm mixing them up, there is certainly one from the browser and one from our infra [15:21:35] there is a third as well? hmm [15:22:58] hmm, so indeed dt, client_dt, and meta.dt. [15:23:21] might have to poke otto to remember the distinction between dt and meta.dt [15:23:42] i'm pretty sure one of them comes from the infra on our side that receives the events from the beacon [15:23:49] meta.dt (infra) and dt are event platform conventions (https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Timestamps:_meta.dt_and_dt). client_dt should be the browser one (based on how we use it for reporting) [15:23:52] the server side ingestion basically [15:24:19] that should be meta.dt [15:24:49] dt should be event time (by convention), that is set by the event producer [15:25:03] in general, i think i've been trying to use the client side ones for any deltas, because when i looked into the server side timestamps they had artifacting [15:25:13] so the producer and our infra agree. But the browser is way off. [15:25:40] seems reasonable to me, modulo outliers [15:25:48] i suppose we could throw out ones where the diff between the client side delta, and the server side delta, is too large? [15:25:53] but with client generated timings that's to be expected [15:26:27] like if client says 12s and server says 42s, thats fine. If client says 42 days and server says 12 minutes, throw it out [15:26:32] ebernhardson that's my hunch too. I'm fiddling with session logs to find reasonable thresholds. [15:27:30] looking at client_dt detlas alone is not enought. I found a couple of examples of reasonably short sessions, but with client times set to 1996 :) [15:29:13] wikidatawiki_content replicas is set to 0 on cloudelastic :/ [15:30:27] there are couple indices with replicas=0 there :( [15:35:10] :S how could that happen [15:35:22] i guess it could have been manually set at some point, but cirrus shouldn't be configured thus [15:35:59] ebernhardson: if you're around we're in https://meet.google.com/aod-fbxz-joy?authuser=0 [15:36:01] we brought the host back, so not data loss [15:36:21] but might be safer to reset replicas to 1 on all indices [15:36:26] dcausse: i actually have to do a school run in ~10min, but back by 8 [15:36:29] np! [15:50:48] the 0-replica shards turned out to be orphan aliases [16:00:15] hmm, i guess we need better cleanup for those. Or maybe they are before we last improved the cleanup? Cirrus used to not delete failed reindex attempts but should now (but maybe not 100%) [16:01:05] Office Hours: https://meet.google.com/vgj-bbeb-uyi [16:17:05] gmodena & ebernhardson: I was in another meeting while you guys were discussing timestamps.... but it looks like you figured out the source and have a reasonable plan for getting rid of the ridiculous outliers—thanks for looking into it! [16:29:18] Trey314159 np! It's been an interesting rabbit hole. [17:05:14] heading out [17:05:22] .o/ [17:06:05] cloudelastic1007 finished reimaging, puppet is failing with what looks like tmpfiles issues, will check once I get back from workout [17:15:04] https://query-legacy-full.wikidata.org/ working properly now [18:02:16] back [18:12:31] re: cloudelastic, looks like we're getting all the same failures we did in relforge. Not sure why, I thought we fixed 'em. Ref https://wikimedia.slack.com/archives/C055QGPTC69/p1739975790077999?thread_ts=1739962604.172969&cid=C055QGPTC69 [18:17:24] ah, one of the failures is due to the S3 credentials, just updated private puppet [18:22:58] OK, cloudelastic1007's joined the cluster as an Opensearch node. Ya'll may want to keep an eye out for plugin-related log messages. I haven't seen any so far, but that's what I'm more worried about when running a mixed cluster [18:33:42] lunch, back in ~40 [19:04:47] back [19:16:57] Not sure why we still have problems with tmpfiles. The module claims to run the correct command: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/systemd/manifests/tmpfile.pp#38 [19:17:37] and this is where we pull it in: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/opensearch/manifests/instance.pp#305 [19:18:23] I guess I'll check how we do it on Elastic. I remember it being a problem, but we must've fixed it [19:21:27] Looks like the elastic code is identical: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/instance.pp#294 [19:34:26] I'm going to physical therapy for an hour or so, if anyone has time to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124863 it would be appreciated! The PCC failure is because PCC doesn't know about newly-reimaged hosts unless you manually regenerate the facts [20:28:12] +1'd [21:05:13] back [21:34:29] hmm, i deployed a change to both the flink-app chart and the SUP to utilize the new functionality, but upon applying the consumer didn't restart. Tried setting restartNonce=2 which is a change, and it still didn't restart the consumer :S [21:34:43] maybe i'm simply forgetting something due to lack of recent practice :P [21:36:40] maybe something doesn't pass the CRDS validation? but i'm not sure where to look, and might have expected that to fail during apply [21:45:04] ahh, it is burried somewhere. paging through `kubectl get -o yaml flinkdeployment` finds a bit where the operator complains "Invalid log config key: log4j-overrides.properties. Allowed keys are [log4j-console.properties, logback-console.xml]". Which is a bit tedious...i would have expected that to be in the CRDS definition if it only accepts two keys. The CRDS say it's basically [21:45:06] Map [21:45:28] * ebernhardson was hoping to not have to overwrite the entire log4j-console.properties, but i guess we have to [21:46:46] Booo [22:13:28] starting to see some failed shard allocations on cloudelastic, looks like b/c you can't migrate shards back to ES 7 after they go over to opensearch [22:13:32] https://etherpad.wikimedia.org/p/cloudelastic-2-opensearch [22:13:48] inflatador: yes that sounds expected [22:13:52] agreed [22:15:00] * ebernhardson will some day remember to update chart versions when updating a chart... [22:56:06] * ebernhardson is just breaking the logging even more .... now it's not json output :P [23:08:13] ryankemper it looks like the cookbook reimage for cloudelastic1008 is gonna fail. I have to head out soon, but if you wanna try it again, it's running in a tmux window called 'cloudelastic-reimage' on my user@cumin2002 [23:12:12] the cloudelastic consumer also seems to be having issues, i think it's this bit: Caused by: org.elasticsearch.client.ResponseException: method [POST], host [https://cloudelastic.wikimedia.org:9443], URI [/_bulk?timeout=120000ms], status line [HTTP/1.1 503 Service Unavailable] [23:13:30] maybe it's using the elastic library and opensearch doesn't understand it? [23:14:28] that probably wouldn't cause a 50x though [23:14:40] not really sure sadly :S [23:14:55] it seems possible, but i thought they did some related testing with relforge? [23:15:25] I think so, but we'd have to ask d-causse to be sure [23:15:59] all right, gotta go get my son. Will peek in later tonight on the reimage progress [23:19:27] hmm, along with the 503 error we get: java.lang.IllegalStateException: Unsupported Content-Type: text/plain [23:19:32] would be nice if it would log the content :P [23:19:46] maybe i can get it out of the sockets on one of the servers... [23:25:16] it's not completely failing of course, since most servers are fine, seems to depend on how lucky it gets ... [23:50:27] hmm, well tcpdumping 9(2|4|6)00 on cloudelastic1007 (the only opensearch instance in cluster) doesn't seem to have captured any non-http 200 responses during a failure :S Maybe will try again tcpdumping all cloudelastic hosts [23:54:09] i suppose the other answer is it came from nginx instead of elastic, but thats harder to snoop on since its TLS [23:54:20] (and is common nginx errors come out as text/plain)