[08:28:11] huh, python kafka client is missing a somewhat crucial method, that java one contains - describe cluster, apparently the only way to get cluster Id :( [08:31:27] zpapierski: I've zero knowledge of this library, but they do mention a cluster_id in the metadata, I'm wondering if you could access that through kafka.cluster.ClusterMetadata somehow [08:32:08] https://github.com/dpkp/kafka-python/blob/f19e4238fb47ae2619f18731f0e0e9a3762cfa11/kafka/protocol/metadata.py#L59 [08:32:40] you also just won 3 nerd-sniping points :-) [08:33:13] heh, I'll start the tally :) [08:33:18] thx for the suggestion [08:33:41] documentation didn't mention that, but otoh the documentation barely mentions structs [08:34:57] eh... [08:37:14] well, as often is the case, source code is the best documentation [08:39:34] nothing actually returns this MetadataResponse - https://github.com/dpkp/kafka-python/search?q=MetadataResponse [08:40:07] but maybe somethings overwrites it... [08:40:53] I thought "It simply updates internal state given API responses (MetadataResponse," [08:40:56] could help [08:42:12] I thought as much, but I see no way to actually retrieve that [08:42:25] otoh I'm probably mistaken, it would make sense otherwise [08:42:45] cluster_id doesn't sound like something user should assign [08:43:01] yeah, or dynamic [09:42:08] break [09:46:38] SyntaxError: 'break' outside loop [09:56:34] # lunch [10:09:57] lunch 2 [11:42:56] ha! I found the cluster id [11:43:09] and I only had to reach internal kafka client objects to do it [12:44:00] zpapierski: did you want a brief overview of spicerack / cookbooks? [12:44:32] I did, but it took me half a day to get cluster id and I'm still completing my scripts [12:45:21] I'd rather complete this one first, if that's ok with you? [12:46:02] all good for me. Ping me if you need me! [12:46:37] zpapierski/dcausse: did we formally decide which of you is going to mentor our new SWE? [12:46:58] I'm not sure if there was formality to it, but I volunteered [12:47:17] dcausse: any objection to Zbyszko being the primary contact? [12:49:12] gehel: no objections! [12:49:37] Deal ! [13:27:56] dcausse: correct me if I'm wrong, but we don't really transfer offsets by partitions, right? [13:28:02] as in - we shouldn't? [13:28:28] zpapierski: I'm not sure what you mean [13:28:41] I mean we can get offsets per TopicPartition [13:29:24] yes offsets are per partition [13:29:42] but those partitions won't transfer necesarily cross-DC, right? [13:30:25] no, cross DC transfers need to use timestamp approximation, same-DC transfer must use actual offsets [13:30:39] sure, but we may still have more offsets per topic per one [13:30:58] unless we assume we always index in a single thread? [13:31:02] ? [13:31:27] offset is per topic per partition, so you can have more offsets per topic [13:31:48] (I should probably just pick the biggest one) [13:32:09] you mean to approximate the timestamp? [13:32:34] yeah [13:32:49] sorry, I'm using timestamp and offset interchangeably, I really shouldn't [13:33:36] in any way - will we have more than one timestamp per topic? [13:34:24] so in case we have multiple partitions (we don't but just in case) the data-transfer cook must be used in normal conditions (no backlog) so the timestamps should be almost identical [13:34:45] ok, for timestamp it holds [13:34:57] since we'll have some buffer [13:35:08] for offsets its the same cluster, so partitions should be the same [13:35:34] (unless I don't understand what partitions are in kafka, which is highly possible) [13:35:35] thx [13:35:41] yes for offsets it's just a plain copy [13:36:54] I wonder how to cleanly deal with the prefixes for topics, I think I'll create a map for each cluster_id [13:37:01] just need to know what they are... [13:37:24] I really hope cluster_ids are stable :D [13:38:37] yes I hope so too, but you should be given the brokers list from the command-line too [13:39:25] so same for topics I suppose [13:45:33] e.g. data-transfer --from wdqs1009 --to wdqs2010 --fromBrokers kafka-main1001.eqiad --toBrokers kafka-main2001.eqiad --fromTopic eqiad.rdf-streaming-updater.mutation --toTopic codfw.rdf-streaming-updater.mutation [13:46:21] s/kafka-main2001.eqiad/kafka-main2001.codfw [13:59:32] jeez, per https://discovery.wmflabs.org/wdqs/#wdqs_usage if wdqs has a peak of 25.14 million requests in a day, that is 290 per second D: [14:02:17] addshore: I think this dashboard was made out of webrequest logs (so hopefully some queries are cached) [14:02:39] true! and yes I hope so :D [14:03:17] We do track the cache hit rate a bit daily here https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?viewPanel=15&orgId=1&from=now-6M&to=now [14:03:39] Looks like it is never more than 50% hit, mor normally 10% [14:03:50] or 10-15% [14:04:14] thanks, did not know this graph! [14:05:19] per server metrics indicate ~30qps per server [14:20:39] I'll past some other mildly interesting numbers here as I put them in some slides [14:20:44] Regular peaks of 700 entity updates per minute ≅ 20k triple updates per minute or 330 per second [14:21:27] dcausse: that's how you want that? I was thinking of basically providing list of topics e.g. without a prefix and constructing full topics based on cluster id [14:22:04] if cluster id during update is the same as in the offset file, use offsets, if not - go with timestamps [14:22:23] in any case, use the correct topic names based on cluster id [14:23:25] zpapierski: I don't have preferences, it's just that some information can be passed by the ops through command-line if this is difficult to determine [14:23:45] no, I don't think so [14:24:04] we can connect to main clusters from analytics, right? [14:24:18] yes [14:24:19] even if not, it shouldn't be difficult to find, basically a single run of my script [14:24:34] ok, in that case it's trivial, just need to provide correct brokers [14:24:46] the script needs to connect to 2 kafka clusters for a cross DC transfer [14:25:00] gehel: we need to ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/720251 [14:25:24] ah, I didn't know we are going to do this with a single execution (I assumed some other means of copying, don't know why) [14:25:31] it changes nothing for me, though [14:26:11] this kafka offsets/timestamp preset have to happen during the cookbook runs [14:26:51] I know, for some reason I assumed that one script will be run to get offsets, other, somewhere else, to update them [14:27:10] but I have no idea what made me think this way [14:29:55] https://www.irccloud.com/pastebin/KpXLyXZR/ [14:30:15] I see something like this - with prefixes assigned per cluster beforehand [14:30:15] lego has merged the patch, will restart the updaters to pick-up the new topic [14:31:29] zpapierski: sounds good to me but the topic name can be hardcoded imo (it's just one: rdf-streaming-updater.mutation) [14:31:53] no prefixes? [14:32:09] yes the prefix is needed too sorry [14:32:28] but yeah, you're right, I can just hardcode it for each cluster [14:33:19] hmm, wcqs [14:34:08] I think I'll leave the topic, we'll use a different one for wcqs [14:34:21] also, maybe it would make sense to have a different topic name for wdqs? [14:34:45] or a different partition [14:35:05] we generally don't have project names in topics [14:36:00] different partition would be better? [14:36:35] I don't know :) [14:37:25] dcausse: seems to be already merged. Thanks to legoktm ! [14:37:25] I know it's my limited perspective, but old habits force me to think about partitions in a different way [14:38:01] and it's not really about the project name [14:38:20] one topic is for wikidata entities, other is for SDC entities [14:40:17] :) [14:44:56] I don't have strong objections to have another topic but I wonder what's wrong with a static partitining scheme based on the project [14:51:36] nothing probably, I just need to understand it better and the consequences of it when it comes to stuff we do, like data transfer [14:52:54] with static partitionning scheme you need to ask for a specific partition to work on (the updater-producer/consumer have that option) [14:53:22] they explicitely produce to/consume from partition 0 [14:56:21] oops I missed the QS & wikidata checkpoint [14:56:49] I missed 4 minutes of it, it took 2 minutes more :) [14:57:31] and heard that not many people cared about wdqs degraded QoS, which is both encouraging and depressing at the same time :) [14:58:30] ryankemper: seems that https://gerrit.wikimedia.org/r/c/operations/puppet/+/720667/ is ready to be merged [15:01:27] ryankemper: can you check the status on T281989 ? The ticket blocking it has been closed [15:01:27] T281989: Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 [15:02:11] ack [15:02:13] Trey314159: triage meeting? https://meet.google.com/qho-jyqp-qos [15:17:35] addshore: another ping on T289770 [15:17:36] T289770: Add hints in response headers for 404 responses in Special:EntityData - https://phabricator.wikimedia.org/T289770 [15:18:31] Thanks, pinging the team again! [16:17:46] addshore: I'll keep pinging ! [16:23:55] sigh...i guess if we are keeping wcqs-beta for 3 months i do have to make puppet generate the different kinds of configuration :S [16:25:45] :/ [16:26:48] possible, just ugly :P I was trying to remove all the gui references to make it clear whats going on, but wcqs-beta keeps the gui :P [16:27:09] ebernhardson: do we actually update the puppet repo on wdqspuppet? [16:27:18] zpapierski: puppet updates the puppet repo afaik :) [16:27:58] zpapierski: the puppetmaster instance is owned by the standard wmfcloud puppet, and that puppet should be triggering git rebase's or some such [16:28:22] i suppose it can get stuck depending on what patches we've landed there [16:28:31] yeah, I was thinking about that [16:28:39] but from what I see, it is up to date [16:29:17] * ebernhardson wonders if a puppetmaster can be it's own puppetmaster. Must be? [16:29:22] I guess it will work until it conflicts with our private commit there [16:29:48] physicians heal thyself! [16:29:58] :) [16:30:07] but I suppose that it could [16:31:47] i guess my new plan is keep query_service::gui, copy it into query_service::proxy, and only apply query_service::gui in wcqs-beta via some hiera conf [16:34:58] I'm not sure I understand - basically duplicate current class and leave the old one running with wcqs-beta while working on another one? [16:35:30] the new one deletes about half of what was there, and installs a different nginx template [16:36:10] makes sense, no need for 5 minute flexibility [16:36:16] certainly we could keep it all mixed together, but i guess going through this stuff my biggest difficulty has been how intermingled things are, so i've been trying to make boundaries [16:37:15] that and things with misleading names :) [16:37:30] I thought we enjoyed misleading names! [16:37:41] I mean, since we use them so often :D [16:38:07] I thought up WCQS acronym based on this assumption :) [16:38:16] i puzzled for a few minutes about what the purpose of a gui_vars.sh that is imported into cron tasks that load categories has to do with the gui [16:38:27] maybe more than a fe w:P [16:38:57] I actually wondered about the same file, probably even discussed it here at some point a year ago :) [16:39:02] in the end, i think what it has to do with the gui is that the paths we needed were in query_service::gui, so it was called gui_vars :P [17:07:51] hmm, if i can't find deploy_mode of autodeploy in horizon or hiera....safe to remove related functionality? [17:09:23] was regularly used on wdqs1009 before it seems...i guess keep for the moment [18:30:00] * ebernhardson just realized he can't have pcc execute in the wmflabs context to tell me what actually changes in the end [18:59:22] ryankemper: thanks for the incident report! https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-13_cirrussearch_restart [19:34:41] ryankemper: I sent you a meeting with Erik tomorrow to move forward (and hopefully merge) the puppet patches for WCQS [19:35:24] I'm not going to be there, but you probably don't need me [19:36:26] Feel free to do it on IRC or in meet (I think that meet would provide higher bandwidth and help move those patches further, but your choice !) [19:45:33] hmm, it seems as fallout from that incident codfw lost the transient cluster settings [19:45:38] so we started spamming deprecation logs [19:45:57] re-applying settings from eqiad to codfw [21:41:17] ebernhardson: thanks [21:41:29] I don't remember context on the deprecation logs too well..is it something we want to make a permanent setting? [21:41:39] i.e. how long are the transient settings supposed to stick around, is it until ES 7?