[07:54:37] o/ [07:57:02] oo [07:57:04] o/ [11:37:45] lunch [11:48:34] lunch [11:48:46] It seems that the issue we had yesterday is happening again... Cirrus has not been blamed yet ;) [11:49:24] See -sre for details [12:45:08] metrics on our end look ok (or no different than yesterday at least) [12:57:07] need to go pick up Lukas from school. They called to tell he's sick. I'll be afk for 20ish min [12:58:28] gmodena: np, take care! [13:06:36] o/ [13:30:51] back [13:30:52] o/ [13:34:25] dcausse I got some vector search working locally (on dummy data / embeddings). Now it's meetings for the rest of the day =) [14:08:30] \o [14:17:11] o/ [15:16:53] ebernhardson looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126653 worked! (sample size of 1, but still...nice) [15:17:00] workout, back in ~40 [15:18:43] nice! [15:50:26] Cloudelastic is now 100% Opensearch (not counting the one failed node) [15:56:55] gmodena: nice! [15:57:17] \o/ [16:02:03] pfischer: retrospective in https://meet.google.com/eki-rafx-cxi [16:52:23] dinner+kiddos. I'll prob be around later tonight [17:01:00] * ebernhardson wishes spotless was better at word-wrapping comments...minor annoyance but still :P [17:06:56] ebernhardson: scanning the fetch_error topic I'm seeing failures of URIs being auth.wikimedia.org, is this the same kind of redirects you found with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1127158? [17:11:12] dcausse: hmm, in this case what was happening is when rendering the response to cirrus-sanity-check it was failing due to Title::assertProperPage failing (because the page can't exist in the db) [17:11:33] might be different from what I'm seeing... [17:11:38] the only ones i saw specifically were redirects to Special:*, but i could imagine them also being external wiki redirects, or even just invalid titles [17:12:22] what I'm seeing is a bit weird but might come from page_change... [17:12:55] zhwiki:User_talk:芯菱 ends up having a meta.uri=https://auth.wikimedia.org/zhwiki/wiki/User_talk:%E8%8A%AF%E8%8F%B1 [17:13:07] hmm, thats very wierd [17:14:57] dcausse: i think that the uris come from EventFactory::getArticleURL, i'm not sure how that would get auth.wikimedia.org. It should be resolving CanonicalServer [17:15:14] i wonder if it has some interaction with the recent work to allow auth from any wiki? I remember seeing that somewhere...loking [17:15:14] filing a task [17:15:40] possibly related to T363695? [17:15:41] T363695: Create a Wikimedia login domain that can be served by any wiki - https://phabricator.wikimedia.org/T363695 [17:15:56] (i don't understand how it works, but it seems at least plausibly in the same realm) [17:16:13] sure [17:21:49] not sure we the uri to fetch the content but we certainly use the domain... [17:26:12] filed T388825 [17:26:12] T388825: Some events in mediawiki.page_change.v1 refers to auth.wikimedia.org in meta.uri and meta.domain - https://phabricator.wikimedia.org/T388825 [17:30:37] sourcing the most recent event and checking logstash for the reqId, it finds messages from ptwiki where a captcha was submitted, a new global account was created, then some jobqueue stuff happens. Doesn't explain much, but does suggest auth is involved [17:30:43] via reqId:"d12e1c67-8e6a-4853-bb7a-0b039f0ea00f" [17:34:34] the few I've were welcome messages [17:34:38] *seen [17:34:55] checking the last million events in kafka, it's always namespace 3 (ns_user_talk) [17:35:08] so plausibly all welcome messages [17:37:09] first is on 2025-03-06T17:00:28Z, req_id ab7cf506-1c0f-4544-b0c9-d573fc67d1c0 [17:37:58] well... that's 7 days ago so perhaps simply kafka retention and not the first one actually [17:38:55] cooking time, back later tonight [17:42:55] Plausibly this is from the NewUserMessage extension [17:45:09] don't really see anything fancy in the extension though, it's a quite simple job. Pulling a few jobs from kafka, the jobs look plenty normal [17:47:50] Apparently having zero cloudelastic hosts in the elastic role caused a weird puppet failure [17:48:51] which broke the deployment server puppet, which broke deployments... [17:50:32] so I had to roll back https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127565 and run Puppet on cloudelastic1012 to get a clean Puppet run on the deployment server [17:52:21] which means I installed Elasticsearch on top of opensearch ;) . Amazingly, Puppet did not throw an error when installing Elastic. Now we'll still have to reimage again, but that's one for the Cabinet of Curiousities [18:03:44] which part breaks? we can probably fix the cause [18:05:05] oh i see it in wikimedia-sre, the problem is basically "${es_hosts[0]['certname']}" [18:05:29] the naive solution would be to wrap everything related in an if statement that only does this if $es_hosts is non-empty [18:05:47] it's a common pitfall of things like accessing the first element in an array without knowing if the array has contents [18:06:18] I think we can remove it since (AFAIK) nothing is using it [18:07:33] But it'd be better to understand the intended use cases, and if this (or similar) might be needed in the future [18:09:57] Like, is this just to allow us to access a specific Elastic host from a k8s pod? And is that something we want/need? [18:10:30] those were added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024613 for T331894 [18:10:30] T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 [18:11:33] it writes out yaml files that can then be read in by helmfile [18:11:42] yeah, it makes sense for kafka or ZK, where you're connecting to multiple hosts. Not sure if that's really needed for Elastic [18:11:58] they are needed for opening egress routes [18:12:03] are at least, they wre [18:12:05] were [18:12:38] Do you think it would break the SUP if we removed this? [18:13:04] depends how we get that info now...looking [18:14:41] cool, Chesterton's Fence says we should probably keep it unless we're really sure [18:16:38] looking at the networkpolicy that kubectl has, it's not clear we are listing all servers there [18:18:35] * ebernhardson is not seeing how we open the egress to elasticsearch :P [18:19:20] i suspect that is the `external_services` key in helmfile, fom which we only reference kafka or zookeeper [18:19:57] maybe via discovery.listeners [18:21:14] which should source from `services_proxy`, which i think is just https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/hieradata/common/profile/services_proxy/envoy.yaml [18:22:04] inflatador: my guess would be that is unused [18:24:56] i will be shipping a SUP update later today, i suppose if you want to rip it all out (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127588) i can run the deploy and see what happens in cloudelastic [18:25:55] ebernhardson ACK, I should've told you about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127560 ;( [18:26:18] :) it's like 5 minutes to write the patch, no big deal [18:27:27] if it ends up being needed, i can always change that function to source from elastic + opensearch [18:27:43] just have to change the source query [18:28:10] Cool, I'm going to merge this now [18:42:05] OK, that's merged/deploy. Taking a quick lunch and will work on the reimage after that [18:58:11] seems reasonable, no change in helmfile diff from the removal [19:16:54] Cool, let me know if you see any connection issues [19:18:17] redeployed all the consumers to pick up a new container version, everything looks to still be happy [19:22:11] Nice! Sounds like the calico stuff is working then [19:33:54] heads up: I was talking with tchin early today, and he plans to start adding data lineage capabilities to (some of) our airflow dags [19:34:22] this should be a no-op for us, and boil down to enabling a setting in SparkSqlOperator. [19:34:46] he asked what would be a good start, and I suggested query_clicks_daily and query_clicks_hourly [19:35:07] Recency bias, because I did some troubleshooting there yesterday :) [19:35:23] Holler if you disagree, or have better suggestions [19:36:50] seems reasonable [19:38:26] cool! [19:47:14] inflatador ryankemper i saw a few cloudelastic1012 alerts the last couple hours, but they appear to be transient and maybe have vanished at this point. All clear? I can reply back on the alerts list that these were transient if so. [19:50:59] dr0ptp4kt yeah, I am reimaging cloudelastic1012 so this is expected [19:51:07] sorry for the noise BTW [19:57:42] no no, noise is fine! thnaks inflatador ! [20:04:52] OK, cloudelastic1012 is reimaged again, and back in the cluster [20:06:10] \o/ [20:29:03] * ebernhardson tries to figure out what the SUP will do if it asks for a document to be built but cirrus returns nothing...but not yet figuring it out :P [20:29:27] context being, apparently cirrusbuilddoc is happy to build a document for a redirect, when those should not result in a document [20:30:22] maybe it needs some sort of error response to tell SUP to throw away the attempted update, but still pondering [21:27:59] inflatador: cloudelastic1011 puppetzeroresources - same thing there? i see it must have come out of downtime and did a cert renewal according to the sal. just scanning messages, so figured i'd check. sorry to bug if it's just transient noise to be expected [21:29:04] dr0ptp4kt Not sure about that one, but I'm guessing it's transient [21:29:51] cool, let's see if it goes away [21:30:18] thx inflatador ! [22:22:18] I have been poking around in a hadoop table full of Action API data (event.mediawiki_api_request) and found something unexpected by me that I thought someone here might be able to reason about. [22:22:20] The table records the ip, user-agent, and lots of details about each Action API request as measured inside MediaWiki itself. The surprising thing is that he "WMF/cirrus-streaming-updater-consumer-search" and "WMF/cirrus-streaming-updater-consumer-cloudelastic" User-Agents show up with ip=127.0.0.1 [22:23:16] I think these UAs are from flink jobs. So I guess my question is if it seems reasonable that a flink job is seen by MediaWiki as coming from the localhost?