[01:20:10] \o/ [07:32:22] sup consumer@codfw in a sort of crash loop, seems to have troubles with swift [07:59:35] going to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1071805 [08:12:09] pfischer: Cormac is looking at some refactoring to the Image Suggestion pipeline. He will reach out to you to get more context on the new SUP. [08:17:08] timeout increase deployed, let's see if this fixes the issue... [08:52:10] gehel: He already did, we have a meeting later today. [08:53:41] dcausse: flink was not able to read/write snapshots? [08:54:43] pfischer: it was having frequent timeouts when writing, issue is that when writing it does not retry the request so the job restarts [08:56:02] pfischer: unsure if you saw nor if you're available but we have a quick meeting with Gabriele to discuss about shipping events from spark (relates to weighted tags and image suggestions) [08:57:14] dcausse: I saw but forgot to answer. I’ll be there. [08:57:24] great, thanks! [09:04:21] dcausse: The swift dashboard does not show anything suspicious, only increased network I/O shortly after SUP restarts, but I would not be sure this is necessarily correlated. [09:04:33] (and would not explain the timeout upfront) [09:07:32] yes... not sure what's the cause of these timeouts tbh... I haven't dug too much in dashboards, it just remininded me of an issue we had previously (T362508) [09:07:33] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [09:07:52] where we bumped the s3.socket-timeout from 5s to 30s [10:13:57] lunch [12:31:26] it's now the sup producer@eqiad failing... InvalidMWApiResponseException: Not Found still so similar to last time with labswiki events [12:42:50] going to catch this exception and let another component fails (hopefully the fetcher so that we get a fetch failure) or later but that'll fail the pipeline too [12:58:17] scratch that it'll fail later... fetching is done in the consumer [12:58:32] tempted to filter them out with a warn... [13:13:19] But for wich API call does the producer get that response? [13:13:51] It only fetches search config to build an index map [13:15:20] hmm, looks like we had some WDQS maxlag alerts a few hrs ago...wonder what that was about [13:15:41] pfischer: yes that's this API [13:15:53] inflatador: just saw it but haven't looked yet [13:16:23] pfischer: if you have a sec: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/162 [13:17:29] the issue I believe is that we receive events with a domain that's not supported by the mediawiki mesh endpoint we use [13:17:36] o/ [13:18:08] dcausse np, can take a look in ~45 if you're busy w/other things [13:18:17] thx! [13:18:18] dcausse: approved [13:18:33] thanks, will deploy to unblock the pipeline [13:19:08] dcausse: thanks! [13:46:51] \o [13:50:30] .o/ [13:53:44] o/ [13:55:20] waiting for the backport window to end before shipping https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1071875 [14:46:22] dcausse just saw your comment re: reimaging wdqs2021 ( https://phabricator.wikimedia.org/T373791#10116233 ) . OK if I start the reimage now? [14:47:07] more importantly, CODFW maxlag is spiking to ~1h . Guessing our "friend" is back again [14:48:28] hmm, actually the worst was a few hrs back. Still, it's over 10m [14:55:45] hmmm....maybe not? wcqs is falling behind too [14:56:07] something going on w/wikikube in codfw maybe? [14:57:36] or one of the other dependencies...hmm [15:12:46] looks like codfw kafka might be to blame [15:15:20] weird...2024-09-10T15:14:48 WARN: Failed to fetch settings for wiki labswiki, domain wikitech.wikimedia.org [15:15:43] it's clearly in the wikiids filter [15:16:24] ebernhardson: thanks for the deploy, was distracted by meetings [15:16:39] no worries [15:30:42] something wrong with our deduplication graphs, on restarting it reports 2000% deduplication ratio [15:34:58] ouch [15:36:23] inflatador: re wdqs2021 yes please go ahead you depooled this host already IIRC [15:37:09] dcausse ACK, starting now [15:40:26] Also, it seems the w[cd]qs maxlag alerts were related to Kafka, they've already cleared and we should hopefully be OK [15:42:42] hmm, maybe two problems with the dedup, i suspect the high numbers are because flink state already had records in the state there, so they don't report as coming in, only out? And we are using the total count since start, not the rate [15:43:57] when starting up the rate going out is very elevated for the first ~1 minute, but the rate in is normal [15:48:09] ah wikiids: -labswiki,-labtestwiki, should be wikiids: -labswiki;-labtestwiki (; is used to separate array values I think) [15:48:23] oh sigh...for some reason i always forget that [15:48:38] arg parsing does it with the ; because , means something in a map conversion [15:48:43] iirc [15:49:51] ; is kind of surprising for separating array values tbh [15:49:59] i guess dedup rate should perhaps be a rate of the records shifted by the window length? then on startup it would simply report nothing, but might be [15:50:09] rate of the records coming in [15:50:49] i suppose shifting is also not right, because the windows dont extend... [15:51:01] maybe just have the updater report more directly instead of using record in/out counts [15:51:03] ebernhardson: is it in the sup dashboard? [15:51:06] dcausse: yea [15:51:22] i suppose it's not a big deal, it's just that 2000% was very noticable :) [15:51:29] I thought we reported the dedup/merged number itself [15:51:46] hmm, maybe lemme check. The graph is from the in/out counts of the operator [15:52:00] iirc the merger is computing it [15:52:44] hmm, indeed we have metrics for the duplicate and merged count, will see if i can make a more direct graph [15:52:52] sure [15:53:25] would be something like records out / (out + merged + deduplicated) i suppose [15:53:39] I believe so? [16:01:04] replaced all the records in metrics with out+merged+deduplicated, looks much better [16:02:24] workout, back in ~40 [16:03:54] and using 2m rates, so it's now showing whats happening at that moment, rather than aggregate since start [16:11:02] thanks! [16:11:08] shipping the wikiids fix [16:15:57] I do wonder if prometheus is smart enough to cache the results between queries, or if we should try and be nicer to prom by running the separate queries and doing the math in an expression so it happens client side [16:16:15] this one graph basically queries the same metrics 3+ times [16:17:37] not sure I've ever done such things [16:18:46] yea probably doesn't matter [17:40:01] back...forgot I had to get a couple of vaccines [17:46:16] hmm, surprised phpcs doesn't enforce the `): [return_type] {` being on it's own line in a multi-line function declaration. Is that just something we do locally? [17:52:30] hmm, wdqs2021 reimage failed. Haven't had much luck with reimaging lately ;( [17:53:01] you all do some batch processing to improve ranking in search indexes, yes? [17:53:07] dinner [17:53:10] how does that get pushed to elastic search? [17:53:23] is that outside of SUP? [17:54:42] ottomata: yes, we write files (one per wiki) of pre-formatted elasticsearch bulk indexing requests into swift, then there is a daemon called mjolnir-bulk-daemon that reads those files and pushes them into the correct cluster [17:54:51] ottomata I think it's this guy: https://gitlab.wikimedia.org/repos/search-platform/mjolnir ...we have some VMs that run this workload but it will probably be k8s before too long [17:54:56] ottomata: there is an event that gets shipped to tell the daemon about newly available files [17:55:38] ottomata: i suppose in theory it could write directly these days, but that was necessary when it was implemented because there was a firewall (vlan) between analytics and elasticsearch [17:55:58] okay, so not SUP. is there a desire to do this more like SUP? or is this totally different because they are lower level elasticsearch API stuff, and not like event/page based updates of data? [17:56:17] ottomata: hmm, i think the issue is reconciling bulk vs stream [17:56:24] ottomata: the bulk updates can take 24h+ to index [17:57:52] and thats pushing at a rate of ~1k docs/sec, so it would overwhelm the normal metrics. This weeks push: https://grafana.wikimedia.org/goto/G8m-D16Ig?orgId=1 [17:58:49] hm. [17:59:28] i know very little about elasticsearch. is the bulk update a specific API, or are you just iterating over a large batch individual document updates? [17:59:45] it's a specific API [17:59:56] got it okay, so that makes it very different than SUP, ya? [17:59:56] https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html [18:00:10] basically jamming a bunch of documents into one API call [18:00:12] ottomata: it's a specific api, we actually send all updates through the bulk api. There are other api's for doing single document updates but we decided awhile ago to send everything through the same path to reduce variations [18:00:19] oh [18:00:22] ottomata: SUP also batches updates together [18:00:56] so in principal they could be the same, its just that a separate pipeline is nice so that the mjolner stuff doesn't overwhelm the regular updates? [18:01:17] ottomata: yea, thats my current thought [18:01:20] got it hanks [18:01:39] i'm mostly asking because i'm putting together a slide deck about data platform techs, and am using different data piplines as the narrative to show the techs [18:01:51] trying to decide if i want to mention the batch updates for elastic search [18:01:53] maybe not :) [18:02:17] also the name is a misnomer, mjolnir is our machine learned ranking bits, we built the daemon to do things like model updates, but then reused the same daemons for the updates we calculate in bulk in the hadoop cluster [18:03:26] aye, i have no idea hat mjolnir stands for so misnomer or not, it is some noun I am familiar with :p [18:03:28] maybe it's not a tech you want to advertise :P There is a second mjolnir-msearch-daemon which is properly insane (uses request/response/control topics to do api over kafka :P) [18:03:43] wut ha [18:04:03] we needed to do millions of elasticsearch search requests from hadoop, and there was a firewall [18:04:12] (to collect feature vectors for ML) [18:04:15] sounds cool to me ;P [18:04:27] oh sounds familiar [18:04:39] you get a response in hadoop through kafka? [18:04:40] OK lunch, back in time for pairing [18:06:21] ottomata: yes, each request/response is a single event on the appropriate topic, then there is a control topic which reflects a special message so the bits receiving the response know what the last offset to read from is. [18:06:41] it basically sends all the requests, then waits for a message on the control channel to tell it the final offset, then reads from where it started until the final offset [18:07:15] we basically send a custom message into every partition after shipping all the requests, and that gets reflected back on the control topic [18:07:51] it's properly silly :P [18:08:18] nice! [18:08:53] ebernhardson: back to mjolner bulk: what is the source data? eventlogging_SearchSatisfaction ? [18:09:30] ottomata: eqiad.swift.search_updates.upload-complete [18:09:42] no i mean, to generate the bulk ranking updates [18:10:34] ottomata: ahh, the source queries come from a data pipeline that reads webrequests and search satisfaction and transforms it into the discovery.query_clicks table [18:10:46] perfect thank you [18:11:26] those aren't really bulk ranking updates though, those are used to train the ML models [18:11:46] what ML models? [18:12:23] ottomata: i might be confusing a few things, the bulk ranking updates are things like popularity_score, which is sourced from webrequests, we also calculate the incoming links count from our own dumps of the search index [18:12:33] ottomata: the ML models are what do the final ranking on the top 18 wikis [18:12:44] they are decision trees using xgboost / lambdamart [18:12:53] they are shipped and run by elastic search? [18:13:08] ottomata: yes, we train the models in hadoop and ship them to elastic using the same mjolnir-bulk-daemon [18:13:12] amazing [18:13:13] k [18:15:00] ebernhardson: is this about right? [18:15:02] Ranking and ML model training and serving with ElasticSearch: [18:15:03] Webrequest & EventLogging SearchSatisfaction instrumentation data in Hive: [18:15:03] -> Airflow & Spark batch generate bulk ElasticSearch updates to improve ranking and train ML models [18:15:03] -> Swift storage [18:15:03] -> [EventGate -> Kafka] notification event on new upload [18:15:03] -> Daemon process on ElasticSearch nodes get notification [18:15:03] -> download data from Swift and bulk update ElasticSearch [18:15:46] ottomata: yea that looks right [18:15:50] ty [18:15:58] ottomata: except it might skip eventgate, i would have to check [18:16:14] I think it doesn't i kind of remember this...? unless it changed [18:16:17] butya [18:16:28] not super important, i think that might pre-date eventgate [18:16:32] really? hm [18:16:47] that would be 2017/2018 [18:16:48] but then you'd have to maintain kafka producer configs in your airflow configs somewhere? [18:16:49] hm! [18:17:10] hmm, indeed i haven't changed those in years, suggests maybe not [18:17:16] swift*.upload-complete streams are handled by eventgate-analytics [18:17:43] could be bypassed though? [18:17:48] ottomata: looks like we pass explicit broker lists in, currently jumbo1007,jumbo1008,jumbo1010 [18:17:51] oh my [18:17:59] okay i'll just remove it, detail doesn't really matter [18:19:23] ottomata: looked closer, it looks like it's split. the uploads go to eventgate via the swift_upload.py script, the search request/response fake-api talks to jumbo directly [18:19:45] ah k. i will omit the fake-api in this pipeline overview :D [18:19:51] fair