[10:18:21] lunch [13:25:05] o/ [13:48:28] o/ [13:52:04] dcausse: configuring the max bulk request size for the ES sink required a code change. Whenever you have a moment: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/112 [13:52:33] pfischer: looking [14:04:59] pfischer: lgtm! [14:05:56] dcausse: thanks! [14:33:51] If anyone has time to review https://wikitech.wikimedia.org/wiki/Search#Adding_new_masters/removing_old_masters LMK. Documenting our Elastic master change procedure [14:37:23] errand [14:41:46] Just asking questions to see if this spikes a memory from some of you: on wikibase cloud using a variant of the polling updater we have just discovered a wiki (or set of entities in that Wiki) that seems to DoS the queryservice (not the updater) basically making CPU spike to the limit and it won't recover without killing the queryservice instance and restarting it. By inspection some of the ttls of the entities are like 300k but not huge [14:43:02] It *seems* to be lexemes rather than items or properties. We're still very much in the middle of investigating and trying to ensure restored service for our other users before resolving this oddity but if you ever saw something like this before or had breadcrumbs it would be interesting to see [14:59:46] \o [15:04:56] o/ [15:14:22] tarrow: not really :( the two kinds of ddos we see are gc overhead and thread count explosion but we always assumed that it was due to query usage not the updater, gc overhead was mitigated with jvmquake and we have nothing yet against thread count explosion [15:14:55] dcausse FWIW I'm 98% sure this is [15:14:59] accidental [15:15:14] but unclear to us what the cause is [15:15:29] the only thing special regarding the updater for lexemes is that it fetches forms & senses for building the update query, could be a giant lexeme with many forms and/or senses [15:15:51] sounds highly possible [15:16:05] but we we no longer use this updater so we might not see this problem on our side [15:16:27] perhaps reducing the batch size of the udpater might help? [15:17:09] dcausse: very meta question really but would it make sense to try harder to collaborate on putting this kind of logic into Wikibase core and not have it in the updater? [15:17:17] or does the streaming updater already do this [15:18:12] e.g. I never really understood why (AFAICT) fetching the forms and senses should be done in the updater rather than in php. Same thing with generating entity diffs [15:19:13] tarrow: it's the nature of the old updater that tries to "reconcile" the state of blazegraph on every update so that's why it wants to see what's inside blazegraph first [15:19:55] ah, I thought the old updater just nuked all the old triples and then wrote the new ones [15:20:05] (for each entity) [15:20:25] for normal entities it's able to do this with a single "UPDATE WHERE" query and a DELETE but for lexeme it can't apparently and it's why it needs to ask blazegraph for some info prior to building the update query [15:21:02] that's actually super helpful! [15:21:21] that process sounds super suspicious for us being stuck in this loop [15:21:34] the streaming updater is able to detect what has changed without asking blazegraph so it's only in one direction updater -> blazegraph and not a conversation [15:22:20] dcausse: but am I right in thinking that you had to reimplement RDF diffing in java for the change for each revision? [15:22:39] rather than asking for an RDF diff from a php endpoint? [15:23:36] tarrow: yes, this was required for multiple reasons: 1/ we do transform the RDF in java (the infamous munger) 2/ the previous revision might not be the parent revision [15:24:05] 1 could in theory be moved to wikibase but 2 requires a state [15:25:05] where is that state stored if not in mediawiki? [15:25:17] in flink (the stream processor) [15:25:20] isn't flink "just" a stream? [15:25:30] it's a stateful stream processor [15:26:08] so you keep in flink state the last revsion id for each page? [15:26:14] it's stored on the local disk attached to the flink containers and stored durably in an S3 compliant object store [15:26:19] tarrow: yes [15:28:26] coolio; all a little bit beyond me; I had thought that playing through the stream was all that was needed [15:30:14] tarrow: we did this mainly to solve scaling issues given the update rate of wikidata, perhaps not something you need yet, if you can narrow down the kind of updates that break blazegraph please let me know and I'd be happy to help [15:30:45] dcausse: thanks! I was going to come to your office hours today; but it turns out its only tuesday :D [15:30:59] lol :) [15:33:13] and yeah; I hope we can isolate which entities are the cause; I can confirm that the raw ttl from wikibase *doesn't* DoS the service so it's probably something to do with the updater and the "suck in the sub-entity data" sounds like a good place to investigate [16:01:34] dcausse — meeting? [16:01:43] oops [16:01:53] workout, back in ~40 [16:11:42] ebernhardson: I was about to deploy a version of the SUP with the new ES sink, but for some reason the deployment-charts repo does not get updated (and I don’t have permission to do so manually). git status reveals two modified files in other sub directories but I don’t know if this is preventing the automated (?) git pull [16:12:18] Would you know, how I can get the latest version of the charts on deploy1002? [16:23:45] pfischer: in meeting now, but can check in a moment. I'm not 100% sure how it gets updated, i suppose i expect a systemd timer [16:38:30] looks like it made it eventually. For reference it looks like it is a systemd timer, git_pull_charts in modules/helmfile/manifests/repository.pp. It claims to do a git pull every minute, can see the next/last invocations from `systemctl show git_pull_charts.timer`, but not sure how to see why it wasn't pulling earlier [16:40:32] (journalctl could probably see, but i don't have that access) [16:49:20] back [16:52:44] ebernhardson: Thanks! I forced an update via —set app.version override in the mean time. However, now the app is running into a circular redirect, that does not get handled properly: [16:52:57] Circular redirect to 'https://donate.wikimedia.org/w/api.php?action=quer [16:52:57] y&format=json&cbbuilders=content%7Clinks&prop=cirrusbuilddoc&formatversion=2&format=json&revids=39547' [16:55:05] hmm :S [16:56:12] pfischer: i don't think i quite follow where the circular redirect is getting stuck [16:57:12] ebernhardson: me neither, if I perform that GET request, I do get a 200 OK [16:58:24] pfischer: i'm wondering if it's something special with donate wiki, i don't know anything about it but i wouldn't be surprised if it was special [16:58:24] I wouldn’t expect envoy to inject redirects. [17:00:08] sorry, ben back [17:06:40] Dinner, back in 180’ [17:11:55] I'm still checking, but we may be able to use Debian's upstream version of elastic curator pkg. It's old, but newer than the custom pkg we have in our repo [17:22:45] Still checking logstash...is there a way to get it to run its pipelines manually? I was going to install the upstream package and run the pipeline and see if it works [17:25:09] hmm, i'm sure there is some way but not sure how :S [17:34:24] That's OK, I'll take a look once I get back from lunch in ~40 [17:36:29] ok, so curious thing. We use mw-api-int-async-ro, which is port 6500. If i port forward that back to a deploy host and then make requests through it we get the 301 moved. Don't have same port available on deploy hosts, but same request through the `mwapi` service which is available works fine :S [17:38:10] same request, when made to en.wikipedia.org instead of donate.wikipedia.org works fine. I suspect something related to https? [17:39:29] is donate in a separate network, iirc fundraising has its own things [17:41:47] i don't think so, it looks like any attempt to donate redirects the user to payments.wikimedia.org which probably is separate [17:45:43] yea, it's https. If i use telnet a request with only GET and a Host line fails, but add `X-Forwarded-Proto: https` and it works fine [17:46:09] the question is, where should that be set? [17:57:36] * ebernhardson realizes after too much time the reason curl wouldn't reproduce the errors i see in telnet is a default NO_PROXY list that includes our sites :P [18:00:39] back [18:01:10] ahh, i suspect if we change our config to use https://localhost:6500 instead of http://localhost:6500 it might just work [18:11:46] hmm, no [18:18:23] OK, looks like the curator actions are on a timer that runs on the active host only [18:19:28] going to run the ExecStart for `curator_actions_apifeatureusage_codfw.service` command manually [18:24:12] looks to be working [18:33:23] hmm, poking through the configmaps i see somethig that suggests envoy is supposed to add x-forwarded-proto: https for us. so thats...curious [18:37:20] oh, interesting. I see the x-forwarded-proto configured in flink-app-producer-envoy-config-volume, but not flink-app-consumer-cloudelastic-envoy-config-volume [18:50:37] found the problem, a setting was mistakenly removed from the service definition. fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016422 [19:10:04] * ebernhardson notes that with k8s this wont auto-push from puppet, it will update the config on the deploy host and we will have to deploy to pick up the change [19:47:25] quick break, back in ~15-20 [20:51:27] Hmm, per https://phabricator.wikimedia.org/T345337#9658807 it looks like using the newer version of curator doesn't fix the dependency hell problem. I'm wondering if it's worth it to use the curator library at all? Could we get the same effect with the rest API? https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/456322/41/spicerack/elasticsearch_cluster.py has the code [20:56:51] hmm, i'm not sure what all we do with curator. apifeatureusage uses it for deleting old things, a quick grep suggests that might be that case. re-implementing that narrow use case wouldn't be all that hard [20:57:13] s/might be that case/might be the only case/ [20:57:34] i see mention of opensearch curator in puppet as well, do they have a different version? [20:57:45] or maybe it's just old duplicated code [20:59:02] looks like opensearch::curator is using the same package we are, and it's very specific about the version [20:59:23] Oh, to be specific I'm just talking about the spicerack code above [21:00:46] we can keep curator everywhere else, but our spicerack library uses curator for some cluster routing stuff [21:00:59] hmm, yea that's just an http request [21:01:04] maybe some error handling [21:01:18] rather than cut a debian package , I wonder if it's easier to just reimplement as rest calls [21:02:24] probably [21:06:21] I had high hopes for the new CI based debian package building stuff, but it's not quite there yet [21:06:41] during the meeting with Research this morning, there was talk of a language agnostic named-entity recognition approach that may be useful if one were to generalize the approach in https://arxiv.org/pdf/2109.00835.pdf . Diego shared with me https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task and I checked about sharing here. [21:10:14] the link on meta shows how the original model was refined to be better performing. there's basic parity between the string in an article that is to get a link and the actual title of the target page, at least based on a couple tries against https://api.wikimedia.org/service/linkrecommendation/apidocs/ [21:11:42] i would think that synonymous terms from a natural language query would require a bit more augmentation. although that would mainly be for the case where the result set came back as insufficient [21:13:01] (here a natural language query is actually more like a factual claim, although i can see how it might apply when posed as a question for certain classes of question) [21:26:35] ebernhardson: thanks for looking into the redirect issue. So, do we have to wait for a +2 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016422 or can we override this config inside our pod? [21:27:51] pfischer: i don't think we can override it, the mesh sets all that up based on the config from puppet. either ryankemper or inflatador will need to merge, and then we re-deploy [21:28:40] \o [21:28:43] looking [21:31:05] hmm, as an aside https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006892/2/hieradata/common/profile/services_proxy/envoy.yaml is a good example of how misleading git's diff algo can be sometimes :P [21:31:52] just puppet-merged [21:32:30] You mean, because of the xfp property that slipped through? [21:32:33] Thanks! [21:33:06] yeah exactly [21:33:28] s/how misleading git's diff algo can be sometimes/how misleading diff algorithms can be in general [21:35:54] ryankemper: Do I have to wait some time before re-deploying or how soon does the pp patch become effective? [21:37:04] pfischer: (cc ebernhardson) generally puppet needs to be manually run on the relevant hosts and/or wait up to a maximum of 30 minutes for the auto run, but I gather from the above context that this puppet code is fetched at helm deploy time [21:37:39] so it sounds like you should be good to go [21:37:47] * ryankemper has very little context though :D [21:38:01] Thanks, I’ll give it a shot. :rocke [21:39:51] 10% project idea: shell alias that forks two equal size tmux windows and runs different git diff algorithms on each. looks like myers (default) and patience would be sufficiently dissimilar to surface an issue like this [21:40:00] would I ever remember to use that command? probably not but a man can dream :P [21:41:18] :D [21:41:51] Can you get the tmux panes to scroll synchronously? [21:42:34] ryankemper: doesn’t look like just re-deploying alone does the trick [21:44:21] At least the application still fails due to the circular redirect [21:44:25] pfischer: hmm, might need to wait for a puppet run on deploy host. Can see when its done when this has it in it: grep -A 15 mw-api-int-async-ro /etc/helmfile-defaults/general-eqiad.yaml [21:46:14] Running puppet on `deploy1002` [21:46:29] Thanks! [21:47:45] p.fischer: and yes the synchronous scrolling is something that would need to be figured out as well. hmm [21:47:57] there it is, should be able to redeploy now [21:49:42] WIP [21:51:13] Hmm there's actually no difference between the 4 diff algos in this case, at least assuming I'm using the syntax correctly: [21:51:15] * pfischer remembers seeing that envoy config removed in the pre-deploy-diff during a past [21:51:17] https://www.irccloud.com/pastebin/e3wnGuZ5/ [21:52:27] yea, it almost needs to be a data-structure diff that understands yaml [21:55:29] * inflatador is learning about git diff algorithms [21:56:22] The good news: Circular redirect is no longer an issue, the bad news: now we run into OOMs [22:00:36] :S [22:06:18] not completely clear from logs, but expect thats the taskmanager. It looks like we have the default of 2gb per task manager, giving it more seems plausible [22:09:03] was curious, this is what a yaml-aware diff looks like for the mistaken removal: https://phabricator.wikimedia.org/P59237 [22:21:23] ebernhardson: have you heard about ? "a structural diff that understands syntax" [22:51:38] ebernhardson: I capped the ES bulk request size from 100mb to 25mb; seems stable so far (TM uses < 1.5GB of 2.15GB) but I can’t see spikes in memory usage via prometheus. We can now try to find a sweet spot that balances bulk request size and request duration. I added the metrics in the Elasticsearch row: [22:51:38] https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin&from=now-15m&orgId=1&refresh=5m&to=now [22:56:51] I am out, see you tomorrow!