[07:17:14] o/ gehel: I got feedback on our request to use kafka-main for our search update pipeline: https://phabricator.wikimedia.org/T341625. Guiseppe asks if we have SLOs for this project and I find any described in the (google) documents I checked. Do we have any SLOs for cirrus-search that we’d have to assume? [07:21:48] We don't have a formal SLO at the moment, but for the update pipeline, we should probably aim at lag < 10 minutes within a 3 or 4 nine availability [07:52:03] Thanks! A “3 or 4 nine availability” - What does that mean? Guiseppe asked specifically for retention periods (and their link to SLOs) But I’m not sure if we want to/have to give any guarantees on what we’ll be able to restore and for how long. After all we’re based on existing event streams, at least for the critical revision-based events. So whatever they can restore, we’d be able, too. [07:52:16] AFK, back in 30’ [09:28:54] 3 nines availability: we reach our threshold 99.9% of the time (in our case, we evaluate our SLO quarterly). [09:29:36] So it would mean: we have more than 10% update lag for less than 2h10' per quarter - https://uptime.is/ [09:30:30] 4 nines (99.99%) would be less than 13 minutes per quarter, which is probably hard to reach and our users are unlikely to care much at that point. [09:32:01] As for retention, that might be a question for David and Erik. I don't think we care all that much. I don't expect that we'll recover from Kafka in case of catastrophic failure, so we need enough retention to cover transient issues with the update pipeline. Which I doubt would be much over a few days, if that. [09:40:27] If need be, we can probably have a way to backfill our page stream from RecentChanges, or something similar. [09:49:33] ebernhardson: let's try to do our ITC during our 1:1 today! [09:56:01] lunch [12:43:37] First day of school today for kids...will be ~15-30m late [13:13:46] back [13:28:27] seeing some flapping alerts for the streaming updater in both DCs, anyone aware of any issues there? [13:33:42] I see what looks like pods being restarted. Probably something to keep an eye on but not an emergency at this point [15:21:38] ebernhardson: If you have thoughts on the kafka storage topic (no pun intended), please let me know [15:22:50] pfischer: not much, the numbers there all look plausible [15:23:49] Alright, thanks for having a look! [15:24:49] pfischer: do you want to respond to joe in the ticket? or i can [15:31:44] So far, I’ve only started a comment with updated numbers. So please, go ahead. [15:37:09] ebernhardson: I just did the SLO math for deriving retention from availability: with 2 nines we would accept an outage of ~4 days with 3 nines we would already expect outages of ~0.4 days. As we do not have anyone on call 24/7, we should account for long weekends etc. Do you think a retention period of 4 days is too short? [15:37:38] pfischer: 4 days feels short. If it fails on friday monday will be not-fun [15:37:57] and we have three day weekends at times, it just doesn't seem enough [15:38:27] i suppose with that many nines we can't ignore it for a weekend anyways though :P [15:43:44] * ebernhardson randomly wonders if local_sites_with_dupes is wrong anywhere, excluding pages from search results without having a local file with same name [15:50:05] ebernhardson: I’m happy with less nines. Would it hurt anybody to start low, let’s say with 99%? [15:52:29] ebernhardson: “randomly wonders if local_sites_with_dupes is wrong anywhere, excluding pages from search results without having a local file with same name” I just reindexed the local general index to a new one (that I then referenced as extra index) with the same settings to manually test deduplication locally [15:52:52] pfischer: hmm, off the top of my head it seems fine. For our readership, the update frequency almost doesn't matter. If it updated twice a week most readers would be happy. The question comes down to what editors need, and we really don't know [15:53:31] * inflatador is always happy with less nines [15:53:32] some editors would certainly be put out if search updates failed, we know some use search as a kind of queue of work [15:57:00] Does anyone object to me cleaning up the Search Update Pipeline Doc? https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit . There's some stuff we're probably not going to do anytime soon (like Kafka stretch) and I think it would be easier to get an up-to-date picture without this kind of thing [16:00:53] inflatador: seems probably ok, iirc google docs maintains a history so that is all still findable if we needed it in the future? [16:01:19] otherwise i might be tempted to re-shuffle the doc and move things we aren't going to do to a section at the bottom that says that [16:01:37] inflatador: yes, that’s what I’d prefer. [16:01:41] ebernhardson that's not a bad idea, might help reduce confusion. And yeah, I think there is a history [16:01:57] pfischer you'd prefer removing the stuff we're not going to do? Or moving it to the bottom? [16:02:39] inflatador: move it to the bottom. It’s easier to follow for someone new to the topic. [16:02:50] pfischer ACK, will do [16:03:00] inflatador: thanks! [16:03:47] Also, please feel free to revert/update if my changes don't work for you [16:07:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/948615 is up for review if anyone has time (ZK cluster puppet stuff) [16:08:06] inflatador: 👀 [16:09:50] ebernhardson: Just wondered: When we start the application after a down time, we’d consume kafka events at the maximum rate to reduce the backlog. Would we keep the same window length for deduplication in that case? Or would we start with a shorter window and restart later with the original length? I would assume that the window size cannot be changed at runtime. [16:14:14] pfischer: iiuc the windowing time is based on the event time, so even though the window is 10 minutes or whatever the window wont live for 10 real minutes, it will be based on ingesting an event with a timestamp 10m greater than the prior event [16:14:52] which is to say, i think we wont need to change the window? [16:15:13] in theory it could use a larger window, but probably not worth the complication [16:16:05] I'm not sure if this helps any, but here's an example of what David did to quickly catch up after Flink broke: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Running_from_YARN [16:16:24] Not advocating this, just maybe some context about what happens when the job is down for a long time and can't catch up [16:16:50] inflatador: yarn is slightly different, i think we do it in yarn because we weren't able to get enough resources with short notice in flink, but getting more resources in yarn is simple [16:17:19] i guess we've done it a few times for slightly different reasons, but last time arround iirc the problem was we didn't have enough memory in the flink workers to process [16:19:16] ebernhardson understood. I assume we'd have to try something similar Search Update Pipeline if we had a large backlog? [16:21:14] inflatador: it's hard to say, i don't understand what about the wikidata updater required it to use more memory. In theory I don't think the current cirrus updater will use much more memory processing a backlog, the memory usage should be constrained by the event time (but perhaps i'm being over-optimistic). I wouldn't discount it as a possiblity, but we also don't have much to suggest it [16:21:15] will be necessary yet [16:22:24] OK, sounds like a non-issue then [16:39:24] Hm, one thing that didn’t make it in the calculation is, that David’s new page_rerender stream has to be counted twice: once as a topic on its own + updates resulting from them. I would expect a very low retention for page_rerender though. If we fetch late, we wouldn’t need to retain them at all. [16:40:21] dinner [16:56:58] pfischer: another random thought (for later i suppose) while writing up an answer to joe: We've talked about running the main flink application per-datacenter, or running the flink application in a single datacenter and consuming it from each. IIRC we're leaning towards one per datacenter? That would suggest we have two copies of each update for separate topics [16:59:29] meh, modals are the devil. I couldn't click anywhere in my browser, because on another screen some tab poped up a modal [16:59:51] no argument here [17:54:04] lunch, back in ~45 [18:43:44] sorry, been back [18:49:14] ebernhardson: Hm, that’s right. Are services aware of running in an active vs. passive DC? If so, the passive producer could simply drop updates so both, the active and the passive consumer are feed by the updates coming from the active producer. [18:57:03] pfischer: hmm, I'm not sure how we expose that. At a high level i believe we manage the state of which DC is primary inside etcd, there is tooling that can essentially update config files on the machine when a change happens in etcd [19:02:52] ebernhardson: 1:1? [19:03:14] gehel: doh! omw [20:09:16] * ebernhardson puzzles over how to understand the commonswiki file upload issue ... [20:31:14] my best guess currently is a race condition where we generate the doc before metadata has processed...but i don't even have proof yet that the metadata is asnyc :P [20:41:43] Quite the conundrum [21:19:28] afk, school run [22:10:38] back