[07:53:32] ejoseph: I have to be at Oscar's school at 11am. Ping me when you arrive if you want to have our meeting earlier, otherwise I'll move it to this afternoon. [08:17:39] Hi all, wanted to let you know that my arch docs / diagrams have had some more review on our side now and I'd love another glance by one of you! https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/query/01-introduction [08:18:19] One thing that came up would be that it would be great to include a runtime view or 2 of the new streaming updater, but I dont feel like I could draw that right now [08:26:48] addshore: I like the changes, thanks! [08:26:58] how much detailed you want that view or 2? [08:32:30] hmm, so I I think I'd like to understand the difference between theupdaters, and what it does differently [08:32:51] ie, gets things from a stream, compares that to stuff it already knows somehow? outputs something to a stram [08:33:03] the middle bit I guess is the most interesting and the bit we dont know about [08:33:18] I had a quick glance through the code repos to see fi I could create it myself but didnt know where to start [08:34:18] addshore: it's heavily async and a bit hard to describe as a runtime view/sequence diagram [08:35:04] what I came up with originally is https://people.wikimedia.org/~dcausse/updater_v2r2.png [08:35:12] oooooooooo [08:35:46] Is that still mostly accurate? [08:35:54] addshore: yes [08:35:59] neat [08:36:15] I think maybe then that could live quire resonable as another building block breakdown view :) [08:36:31] *quite reasonably (needs another coffee) [08:36:59] Mind if I recreate that in draw.io? [08:37:31] addshore: no please do, I have the dia source if you need [08:37:45] oooh yeah, lets see fi I can do some magic conversation etc :) [08:37:54] ok :) [08:38:08] it also makes use of swift, nice, didnt know that! [08:38:30] yes that's the cold storage for the flink state [08:38:36] what exactly does "rdf import" represent? [08:38:48] the "consumer" part in your diagram [08:38:53] otcha [08:38:55] *gotcha [08:39:18] I'll probably rename some of these parts then inline with other parts of these docs so far [08:39:36] but obviously we can continue working on terminology if folks want etc [08:41:16] addshore: please name things the way you want if it makes them clearer with the rest of the doc [08:41:19] https://people.wikimedia.org/~dcausse/updater_v2.dia [08:41:23] thanks! [09:45:19] I have to go to Oscar's school. Back later [09:48:36] zpapierski: let me know when you are available [09:48:54] I'm available now, at least for the next 25min [09:48:59] ok cool [09:49:07] code with me? [09:49:49] and guys i might close early today i have a train to catch [09:50:03] yep, hit me with a link [09:50:10] a, wait [09:50:26] I haven't checked why it's broken for me now, lets go with zoom [09:50:32] I have a meeting with gehel in 10 minutes [09:50:40] ah, ok [09:50:50] really - didn't he just go to Oscar's school? [09:51:00] Oh [09:51:12] Let's start then [12:33:28] lunch [13:01:17] ejoseph: I'm back, ping me when available [13:18:26] forgot to mention in standup but I'm out on pto today [13:18:41] (will be pretty much off the grid till sunday) [13:27:03] ryankemper: enjoy! [13:51:14] dcausse: elastic has switched to 2.7.1 I wonder if we should do the same [13:51:35] mind giving a little more context ? [13:51:52] ejoseph: this is about keeping dependencies up-to-date [13:52:22] if elastic has switched from randomized.testing 2.7.0 to 2.7.1 then might want to do the same [13:52:44] ok I see [13:53:51] while it seems to work for us, it's generally good to keep deps up-to-date so that you don't fall too far behind and discover few incompatibilities sooner than having huge set of problems at once [13:54:14] gehel: are you available [13:55:40] ejoseph: https://meet.google.com/vpn-uqkz-obr [13:56:26] (yes, I'm available) [14:14:17] ejoseph: I see that you have created a string of CR (737026 is based on 737024). This is great and exactly what you should be doing! Addressing the comments on the parent CR will require rebasing the child CR. [14:14:44] I don't know how familiar you are with git, but this might get messy. Feel free to reach out to me (or zpapierski) if things look strange! [14:17:04] ejoseph: I had a look at those 2 CRs. It all looks good (except the comments that dcausse already mentioned). [14:17:15] Congratulations on your first Java CR! [14:19:52] ejoseph: I always use git rebase -i (interactive rebase) see https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History [14:20:48] the most important thing is to keep the Change-Id in the commit message intact so that gerrit can keep track of what you do [14:24:28] ejoseph: I'm back around if you need me [15:02:22] \o [15:03:36] ejoseph: o/ [15:06:38] and don't `git commit --amend` when fixing a rebase conflict from `git rebase -i` :) Lose so much work that way... [15:13:28] o [15:13:32] / [15:14:09] going out early, have a nice week-end! [15:14:27] dcausse: enjoy! [15:14:39] I'll be out a bit earlier as well, I'll miss the gaming unmeeting [15:36:51] I think we actually haven't set any game this time ;) [15:46:51] zpapierski: are you available [15:47:16] yep [15:47:55] need help? [15:57:21] ok, need to hunt down some food before unmeeting [16:36:04] time to start the weekend! have fun! [17:17:13] same here, have fun y'all [17:17:58] (in my head, y'all always sounds like spoken with Texas accent, don't know why) [18:12:43] FYI, it is the "pioneer anomaly" not the "voyager anomaly" that I was trying to recall during the unmeeting: https://en.wikipedia.org/wiki/Pioneer_anomaly —TL;DR: a small amount of heat loss created a miniscule deceleration that adds up over years and bajillions of miles. [18:26:49] i wonder if it will matter...reading the streaming updater bootstrap process we need a revisions file that comes from entity_revision_map, but afaict we only build the revision map for wikidata. Might be as simple as running it [19:17:59] Hi, someone just reported search is broken on CheckUser wiki in #wikimedia-tech [19:18:11] 19:15:36 I don't know if this is significant, but I'm getting "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." on some searches on https://checkuser.wikimedia.org/ [19:18:32] Could you have a look [19:18:36] ebernhardson: ^ [19:22:16] (I can see some `index_not_found_exception: no such index` dramah for those searches, but can't replicate) [19:22:59] It's apparently intermittent [19:23:19] I wonder if it's broken partly and needs a rebuild [19:35:14] hmm [19:35:36] intermittent is odd, they only talk to a single cluster the index should exist or not [19:35:42] reads at least [19:40:47] there is some suspicious timeouts around the pool counter, starting ~7 hours age we've been putting out 2-300 pool counter timeouts per 10 minutes [19:41:30] (that's really low compared to the overall traffic, and doesn't seem to be wiki-specific though) [19:42:29] fwiw, it seems to be 1 in ~10 searches that error.. [19:45:50] looks like i have a few logged requests with that error, will try re-running a few times [19:58:49] well, good news i can reproduce an intermittent problem. But bad news i have no clue why yet :P Will look [20:17:26] something seems off in the cross-cluster communication between 9643 and the other clusters but it's not clear what. I can reproduce every time against elastic1044:9643, and it seems to work every time against elsatic1045:9643. But there is no way for these settings to differe between instances they are part of the cluster state [20:33:51] ebernhardson: is there envoy in the mix there? Like is envoy locally on elastic1044 to bridge to the rest of the cluster with TLS? [20:34:49] bd808: envoy only from mediawiki to the cluster, but these inter-cluster commms go over elastic's inter-node port and skip envoy and such [20:35:16] *nod* [20:35:34] i'm restarting psi cluster on 1043 (was reproducing error) to see if a fresh instance makes any difference [20:35:43] and that does fix that instance...hmm [20:36:19] ryankemper: while its a sad answer, feel like rolling a restart across psi? I dunno if we should invest much in this if we are rolling 6.8 in a month and then 7.10 a few after [20:37:52] strictly guessing, each instance is maintaining it's own view on cross-cluster state and some of them have fallen out. Would like to know why, but not sure where to look [20:59:04] logging is curious. There have been a low level of logging that looks like same issue, ~400 events a day since 9-14. 9-14 is when we switched primary traffic from codfw to eqiad, so the problem would have been introduced to the eqiad cluster some time prior to that, but no clear point when [21:01:33] That's fun [21:01:43] Like 6 weeks ago [21:05:17] yea, not going to be finding great details on this one [21:52:30] i've restarted instances i identified as failing in the psi cluster, dunno if it means anything but only instances in 1030-1050 had issues, 1050-1067 were all fine [21:55:26] curriously i can't reproduce the issue in any other direction, each cluster has two remote clusters configured but only psi->chi is reproducing in my setup