[09:31:17] dcausse: we're in https://meet.google.com/nac-nzao-gpm with Leszek. Gabriele: if you want to have a last discussion about Search, feel free to join [09:31:27] oops [09:31:39] dcausse: you're not late yet! (cc: gmodena) [10:44:45] lunch [11:31:33] gehel: I tried to pull a report from phabricator (using conduit) of things we’ve been working over the last quarter but apparently that is not a trivial task. How do you usually gather such report? [11:36:49] pfischer: mostly from memory, with some help from https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates [12:10:58] pfischer: want to pair on preparing those slides? [12:30:20] Sure, I have to pick up Luise from daycare but I’ll be back by 3:15 [12:31:09] ping me when back! [13:13:51] [13:13:54] o/ [13:20:13] I added rack/row awareness to relforge puppet code, but I'm guessing it doesn't take effect until after a service restart? checking... [13:26:42] dcausse: Is anyone working on SUP at the moment? I see unstable warnings [13:26:52] pfischer: it's me [13:45:33] I'm into a weird state.... producer & consumer-search failing with "Caused by: java.io.IOException: Target file s3://cirrus-streaming-updater.wikikube-eqiad/consumer-search/checkpoints/1c39201d2b4f315e6efd4f5b56c03af4/chk-219389/_metadata already exists." [13:46:30] suspended both jobs to get a better view [13:54:06] dcausse: Shall we pair? Could we drop the state? [13:54:41] pfischer: sure https://meet.google.com/aqs-cxnf-tor [13:55:38] ryankemper https://phabricator.wikimedia.org/T380934 is unblocked if you wanna work on it when you get in [13:56:30] I also just popped T390565 for the relforge decommission, but that will take a bit more work [13:56:31] T390565: decommission relforge100[34] - https://phabricator.wikimedia.org/T390565 [14:09:32] I'm roll-restarting cloudelastic now to apply the new plugins...relforge will take awhile since it still has 1G hosts [14:09:59] I'm banning them to decommission at the same time [14:12:51] \o [14:13:34] o/ [14:29:35] Trey314159: chrome is crashing. Hopefully we were done... [14:30:09] gehel: I think so. good luck resurrecting chrome [14:53:07] .o/ [14:55:52] Cloudelastic restart is done, so the vector search plugin should be available. I'm about to do relforge as well [14:56:28] nice! i still haven't managed to convince CI to build the image :S But i suspect it's something to do with the build host, going back to older blubber versions didn't help [14:57:31] it has such a useless error message, failing very early in the build (looks like while reading the dockerfile): error: failed to solve: exit code: 2` [14:57:53] booo [14:58:17] if you still need help this afternoon LMK. I doubt I'd be much use, but happy to take a look [14:58:26] dcausse: switching to savepoints as upgrade mode is one of the recommendations after schema changes [14:59:57] yes I think it might be safer to use that all the time [15:00:38] my best guess i the build hosts are doing something wrong, maybe perms? The next message in the log should have been about transfering the local context into the build (which historicallly was 2 bytes, so basically nothing) [15:01:52] :/ [15:02:28] i should probably just file a ticket, all the info i could find about this basically says it's a builtctl problem, maybe invocation args or permissions [15:02:51] but thats all done by kokkuri [15:04:55] yeah, sounds like a ticket for relforge is probably the best [15:05:43] oh not relforge, this is our docker image failing to build and publish in CI. It builds fine in test, but then when trying to publish an image to the docker-registry it fails to even start the build [15:06:10] sorry...I meant to say 'releng' [15:43:36] relforge restart is finished [15:59:29] curiously, reverting all the way back to blubber 0.16.0 let the image build. 1.3.20-5 is on docker-registry.wm [16:01:39] weird... [16:03:12] dcausse: pfischer ! from staff meeting. "modernized FLink job with wikimedia-event-utilities". Is WDQS using event-utilities with the EventRowTypeInfo stuff? [16:03:22] ottomata: yes [16:03:24] no more case classes?! [16:03:30] still some :) [16:03:43] wow I did not know yall were doing that! very cool! [16:03:58] * ebernhardson is waiting for our LLM overlords to migrate the scala to java [16:03:59] would love to peruse a patch or ticket, got a link? [16:04:09] sure [16:05:39] ottomata: mainly https://gerrit.wikimedia.org/r/q/topic:%22output-schema-v2%22 [16:21:46] ryankemper: for when you have time could you deploy https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1112251? [16:27:57] Trey314159: for some reasons I thought you had written some notes on the broken lucene boolean logic but can't find anything in your notes, do you have something about that somewhere? [16:28:42] oddly, i also looked for that last week and didn't find it [16:38:47] dinner [16:47:50] dcausse & ebernhardson: the broken boolean stuff is not in my notes, it's here: https://www.mediawiki.org/wiki/Help:CirrusSearch/Logical_operators [16:54:44] ahh, i suppose that makes sense. I was searching in the user namespaces with intitle filters [17:05:58] * ebernhardson will someday remember there is a 'run puppet compiler' button before sending a manual `check experimental` code review [17:32:12] curious graph, a low level of consistent cirrus failed requests since 12:50 yesterday. Not sure its meaningful, but curious: https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=now-2d&to=now&viewPanel=9 [17:36:53] what the bot? [17:38:58] perhaps, but bad queries should be classified as errors, failures should be internal things. [17:39:29] or i guess we call them "rejected" in the graph, not errors [17:40:26] hmm, logstash has some not happy things: Elasticsearch response does not have any data. upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED [17:40:47] much higher than 2/s though :P [17:41:30] oh, well actually maybe. 2286 messages in the last 15 minutes which gives 2.5/s [17:49:58] it all forwards through envoy, in a quick test of 10000 requests to the banner on each port (localhost:610{2,3,4}), 6102 failed 5 times, 6103 failed 4 times, 6104 failed 0 times. [17:52:34] i suppose plausibly (hopefully!) envoy is reusing connections, so the ratios aren't quite right. The envoy dashboard gives a 10-20% connection fail rate [17:52:49] looks like some host has a bad TLS cert, will see if can figure out which one [17:53:10] envoy might know, but not sure how to find out :P poking [17:55:42] i guess envoy doesn't know, it talks to LVS [18:00:59] might have been 1068, i've just depooled it. Will see if the errors stop [18:09:01] hmm, well 1068 is certainly failing tls, but depooling it didn't change the error rate :( [18:10:56] annoyingly curl will print subject/start/expire date on a succesfull connection. But on a failed connection it just say "bad, no connection for u [18:11:05] s/failed connection/expired cert [18:12:14] inflatador: not sure what to do with it, elastic1068 has an expired TLS cert: https://phabricator.wikimedia.org/P74506 [18:12:34] i ran `depool` on elastic1068, but i'm still able to connect to it when using https://search.svc.eqiad.wmnet:9243/ (just repeat enough times till it fails) [18:13:05] test with: curl -vvkI https://search.svc.eqiad.wmnet:9443/ 2>&1 | grep -e 'subject:' -e 'start date:' -e 'expire date:' [18:15:59] shrug, ran the same depool command again. Maybe this time :P [18:23:48] * ebernhardson realizes its because the right command is `sudo depool`, `depool ` fails silently with no error message [18:24:57] that does appear to have silenced the messages...will make a ticket to get the TLS fixed and the server repooled [18:25:56] Trey314159: thanks for the link! [18:26:16] dcausse: sure thing! [18:26:32] ebernhardson: good catch! [18:27:30] I wonder if we got alerted by some other means for this cert expiration [18:28:51] it's partially odd because the certs are only good for ~30 days, so there must be an auto-renew thing. Also most of the other certs expire Apr 16, only this host expired Mar 30 [18:29:18] dcausse: while you're here, which months is your sabatical? [18:29:26] thinking about summer vacations :) [18:29:38] ebernhardson: june and july [18:29:42] kk, thanks [18:41:49] there is still a (much much lower) rate of "Elasticsearch response does not have any data. upstream connect error or disconnect/reset before headers. reset reason: connection failure"...not sure what to do with that [18:42:02] feels like when the two sides disagree on keeping the connection open [18:42:33] but more like 15/minute [19:09:32] * ebernhardson tries to understand the difference between envoy TCP proxy idle_timeout (1hr), and TCP protocol idle_timeout (10min) [19:17:40] i also wonder why keepalive is either not specified, or set to 4s, 4.5s, or 10s. I suppose i was expecting more like 600s [20:37:51] back [20:39:42] ebernhardson thanks for reporting this. I guess we need to look at our health checks, IMHO we should automatically depool hosts under these circumstances [20:41:37] inflatador: i thought so too, it's in the ticket AC :) [20:41:48] although i guess i said "Consider" instead of a requirement [20:42:52] yeah, I'm still waking up, but I guess SSL certificate verification doesn't cause an HTTP error...so that probably wouldn't be the load balancer's job. Hmm [20:49:57] The cert on elastic1068 is current: `Not Before: Wed Mar 19 12:19:00 UTC 2025 Not After: Wed Apr 16 12:19:00 UTC 2025`. reloading nginx fixed the problem, but I'm pretty sure Puppet should be doing that [20:56:52] hmm, thats curious. So it suggests nginx wasn't properly restarted when the certificate was refreshed. Maybe the puppet run failed early, then the next run didn't know to restart nginx? [20:57:09] checking if puppet logs go that far back [20:57:22] Yeah, that's my guess as well [20:57:48] hmm, no, the oldest puppet log is Mar 24, [20:58:21] kinda surprised that we don't get SSL certificate checks "for free", but that may be a consequence of us using nginx for TLS termination. I think every other role besides cirrus is on envoy [21:04:00] hmm, puppet ran at 11:53, then 12:23, can see the renew certificate ran. Says it scheduled refresh of Exec[nginx-reload] [21:04:28] then it failed with: (/Stage[main]/Profile::Tlsproxy::Instance/File[/etc/systemd/system/nginx.service.d/security.conf]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/tlsproxy/nginx-security.conf: Request to [21:04:29] https://puppetserver1003.eqiad.wmnet:8140/puppet/v3/file_metadata/modules/profile/tlsproxy/nginx-security.conf?links=manage&checksum_type=sha256&source_permissions=ignore&environment=production failed after 0.002 seconds: Failed to open TCP connection to puppetserver1003.eqiad.wmnet: 8140 (Connection refused - connect(2) for "puppetserver1003.eqiad.wmnet" port 8140) [21:04:57] which caused the rest to give an error about skipping because of failed dependencies [21:05:15] (from syslog.12.gz) [21:05:33] not sure what to do with that :P [21:05:36] ah, we do have SSL certificate alerts BTW [21:06:23] seems like it shuld have retried that connection a few times, intsead of failing once in 2ms [21:07:31] And we did get an alert on Mar 27th, so I guess I need to pay more attention to those ;( [21:12:47] seems like SSL cert issues should be fairly loud, maybe creating a ticket? [21:28:06] I'd be OK with that. The bigger issue is that we have a lot of noisy alerts in IRC that drown out the important ones