[08:13:55] Hi all! [08:13:55] The docs i have been writing (with diagrams) continue to iterate, and I wonder what you think of this so far https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/Query/01-Introduction.html [08:14:44] I'd be interested to also include the WCQS as a deployment view, as its slightly different with the whole auth stuff somewhere [08:28:28] addshore: nice! one diagram that describes well where the streaming updater is integrated is https://wikitech.wikimedia.org/wiki/Event_Platform#Platform_Architecture_Diagram [08:30:23] addshore: the streaming updater consumer does not get query results other than write requests responses [09:18:42] addshore: looks very nice! I don't really understand that "Write Queries" part, though - any updater will write/delete triples (or update entitites, for a more general description). I think you might be doing yourself a disservice here - https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/query/05-building_block_view.html with Kadka block [09:18:48] s/Kadka/Kafka [09:19:47] dcausse: awesome, so does flink essentially diff the RDF of a current and previous revision in order to decide the RDF changed to put into the next stream? [09:19:50] this suggests a weird workflow for the streaming updater - things go out from Kafka, then back to Kafka, but also to backend as well, which suggests that it has two main outputs, which isn't true [09:20:45] addshore: for revision-create, that's exactly true (not counting the first revision or undelete) [09:20:50] addshore: exactly [09:21:03] cool, I'll defintly make that change then [09:21:49] addshore: as for Kafka - it;s basically a transport medium - it has more in common with communication arrows than with other blocks [09:22:27] zpapierski: yeah, I wonder if I can tweak how that is visualized, it did feel odd doing it [09:22:54] I'd remove Kafka from that view entirely, it's a different level of abstraction [09:23:08] Then perhaps at this level https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/query/05-building_block_view.html#updating the streaming updater block would actually have an arrow going out and back to itself that says kafka [09:23:39] yeah that could be good, and only include it at the level below [09:23:46] I don't know Kafka feels weird here, you don't actually mention Blazegraph, or Flink here [09:24:04] so I dive into the streaming updater the level below https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/query/05-building_block_view.html#streaming-updater [09:24:41] I quite like the idea of not including kafka as a box, and just as arrows [09:24:45] I'd replace Kafka on the first view with EventGate [09:25:00] the event gate service? [09:25:21] Yeah, I could do [09:25:22] I think that's what it's called (I frequently confuse the naming here) [09:25:30] at a super high level that is also in https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/systems/Repository/03-Context_and_Scope.html#technical-context a bit [09:25:42] also, glossary for event gate ;) https://c2d26c0110--wikidata-wikibase-architecture.netlify.app/Glossary.html#eventgate [09:26:04] event-gate makes sense for wikbase -> event-gate -> kafka [09:26:18] not for the two updaters [09:26:34] I'd at most mention Kafka on arrows, next on what data is transported [09:26:49] I almost removed kafka entirely on my last pass for this anyway [09:26:54] but it's fine to repeat a "kafka" block to show the "one-way" of the update process [09:27:00] as its am implementation detail, at least for the flink part, as many backends could be used [09:27:23] I think it is also an implementation detail for event gate service, as that can write to multiple backends too [09:27:41] sure [09:27:48] I'll have a go at removing it and then send you all some more links [09:27:52] thanks for the feedback :) [09:27:59] as for the streaming updater breakdown, I'm confused by consumer/flink terminology here [09:28:32] might it make sense to call "Flink" block a Producer and mention Flink backend in the description? [09:28:37] I'm not sure about it though [09:29:11] I could get behind that! [09:29:21] any thoughts on what to call the old recent changes based updating? [09:29:40] "soon hopefully extint" has a nice ring to it :) [09:29:45] I'd say "stream processor" but fine with producer as well [09:29:45] xD [09:29:48] s/extint/extinct [09:30:09] it's called the RCPoller [09:30:16] perfect! thanks! [09:30:37] How about the older version of the kafka updater? [09:30:44] KafkaPoller :P [09:30:48] perfect :) [09:31:08] RIght, i'll get to making some changes [09:31:27] summary of what we just discussed https://github.com/wmde/wikidata-wikibase-architecture/pull/150#issuecomment-958779625 [09:32:07] thanks! [09:36:33] addshore: what framework is that doc made with? [09:39:31] vuepress [09:39:52] to be honest I'm undecided if it should be in vuepress, but I'm 98% sure it should be in git and all versioned together etc [09:39:58] rather than on mw.org for example [09:40:27] Well, sorry, the JS site building framework is vuepress, the architectural framework is arc42 (ish) [09:40:45] diagrams with draw.io/diagrams.net vs code integration, for mermaidjs [09:49:12] And the old KafkaPoller still used the revision-create stream right? [09:52:28] addshore: yes + page-delete/undelete [09:52:37] ty [10:08:01] ejoseph: I'm back! [10:12:47] Both KafkaPoller & RCPoller get the last update timestamp (or something) from blazegraph right? [10:42:53] zpapierski: do you have time [10:50:29] I'll have soon - 15m [10:57:28] ejoseph: 11:15? need to get something to drink and we can start [10:58:25] addshore: yes, they use blazegraph as a storage for their own "polling state" [10:58:38] Lunch time! [10:58:45] bon ap! [11:00:51] ejoseph: of course, I meant 12:15 [11:10:20] lunch [11:14:33] ejoseph: I'll hang around https://meet.google.com/pxb-zyzq-xqf , join when convenient [12:15:38] break [12:39:07] launch! [13:06:02] @ejoseph : Up for the meet we scheduled for now? [13:48:42] are you available noq [13:48:44] now [13:49:44] tanny411: my other meeting got canceled [13:50:43] ejoseph: Hi, sorry, can't make it now. [13:50:53] Ok [13:51:16] Reschedule from your end [13:51:18] thanks [13:51:22] will do! [13:51:38] zpapierski: let me know if you are ready to continue [13:59:32] I am set [13:59:43] same link? [14:35:56] sure [14:36:30] sorry for being late, I thought we were suppose to reconvene at 3:30 [14:40:01] ah, we were, I see [15:01:02] Meeting with Maryana starting: https://meet.google.com/vgj-bbeb-uyi [15:01:02] (cc: ryankemper, ejoseph) [16:50:45] ryankemper: in case you missed it in the backlog: T294865 [16:50:46] T294865: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 [16:51:32] I did silence the servers until today, they will start to complain again soon. Can you either investigate or increase the silenced timeline? [16:51:34] thanks, will take a look today after interview [16:51:39] thanks! [17:09:53] heh, i can't even ctrl-c htop on wcqs1001 anymore [17:14:11] dmesg from failing instance: https://phabricator.wikimedia.org/P17664 [17:27:44] dinner [17:40:56] the wcqs hosts are all alerting again btw [17:41:42] either an ack or downtime would be appreciated :) [17:41:54] hmm, i thought g silenced them earlier today? [17:42:22] legoktm: ryan is in an interview, i imagine he can silence after [17:42:57] ok :) [17:43:22] looks like hardware problems, or bad io drivers...they hang in the kernel :( [17:57:04] legoktm: downtimed, thx [17:57:12] ty :) [17:57:56] curious about the kernel issue, is wcqs significantly different from wdqs? I had naively assumed the software was the same but different dataset [17:58:12] legoktm: the software is the same, the failure looks more like: https://phabricator.wikimedia.org/P17664 [17:58:45] huh [17:59:19] i would guess bad hardware, but it looks like all 6 instances have the same problem. So i have to maybe guess hardware drivers [17:59:23] but no clue really yet :) [18:05:25] ryankemper: since the instances are falling over, i collected the relevant dmesg from the 3 machines that haven't fully fallen over yet: P17667 P17668 and P17669 [20:48:34] ryankemper: looks like we're going to want to roll lvs on wcqs back a stage, with all instances down it's making various monitoring unhappy [20:49:15] ebernhardson: ack, just got back from lunch and catching up on #operations backlog and I was thinking the same [20:49:19] okay will get a patch up real quick [20:49:28] ryankemper: thanks [20:50:40] ebernhardson: Do we want to be in `lvs_setup` or `monitoring_setup`? I'm thinking the latter but not entirely sure [20:50:51] I'm guessing monitoring_setup will have it be in lvs but not alert, which is what we'd want [20:51:09] ryankemper: reading, but that sounds reasonable [20:51:09] https://upload.wikimedia.org/wikipedia/labs/b/bf/Lvs_state.png for a refresher [20:51:27] ryankemper: yea, then monitoring_setup should be reasonable [20:54:32] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 [20:59:59] I have no idea in lvs & pybal but it definitely shouldn't be production [21:00:09] It's caused quite a few alerts [21:01:33] RhinosF1: i could be mistaken, but my understanding is we can't pool servers and actually test anything unless that says production [21:17:24] Right [21:17:49] I mean at the moment the servers are broken anyway right? [21:19:10] RhinosF1: right, they locked up overnight [21:20:20] for the moment they certainly shouldn't be set to production...until we figure out whats going on with the bottom end [23:08:46] ryankemper: the pastes are a little long for a ticket, if it's more than half a page, maybe a page, i generally prefer to use phab's paste utility [23:22:33] afaict i have streaming updater producing diffs to kafka. No cue i fthey are right, but they look plausible :) [23:25:54] ebernhardson: ack, replaced with pastes [23:26:14] As for the `wcqs` stuff, unfortunately it's not even to just go back to `monitoring_setup` or even `lvs_setup` [23:26:35] We have to go all the way back to `service_setup` which will require a pybal restart (and naturally will require another restart when we're ready to put it back) [23:26:45] unfortunately it's not enough* [23:27:49] ryankemper: :S [23:29:01] ryankemper: i wonder if we had a better option...i'm not sure what though nothing is built for a prod service to disappear without it complaining, for good reasons :)