[07:58:57] ottomata: yes please [08:11:38] Hi search folks, if I an updated my own elastic rom 6.5.x to 6.8, I should just be able to udpate the code right? won't have to rebuild any indexes of anything? [08:12:07] mainly thinking from a elastic search internal kind of point of view, i guess this is a minor version change, so the new code should just run TM [08:12:14] addshore: hopefully not unless you have indices created with elastic5 [08:12:20] cool! [08:12:49] it's good to re-create your indices every once in a while (esp. after major version bumps) [08:12:53] also I notices that there is inconsistent naming with the plugins now? the syntax highlighting one is `6.8.23` but the extra one is `6.8.23-wmf1`, no problem, just thought it worthy of mentioning [08:13:04] addshore: yes we missed that [08:13:14] ool, just flaggging it up, we adjusted our code to cope :) [08:13:16] *cool [08:13:21] *drinks more coffee [08:13:22] we use wmfX when we have to iterate over the same ES version [08:13:30] :) [08:43:20] ejoseph: we discussed yesterday that it might be valuable to put the 2 forks of HebMorph & elasticsearch-analysis-hebrew under gitlab.wikimedia.org instead of our personal github repo [09:00:41] do we have a project space on gitlab already? [09:08:22] gehel: quickly filed T301444 but please adapt it if I missed something [09:08:22] T301444: Create new GitLab project group: search-platform - https://phabricator.wikimedia.org/T301444 [09:10:23] dcausse: do we want to path to be `repos/search-platform` or just `repos/search` ? [09:11:07] damn I was hoping you would not ask a question about naming :P [09:11:19] I can help - I'm great at naming! [09:11:29] We just use "search" in gerrit at the moment. It make sense to be as our team is responsible for both wikidata/query and search, and those 2 groups don't have much in common except the accidental constraint that they are both maintained by our team [09:12:18] I don't know if we have naming guidelines yet. But it makes more sense to me to organize by something that looks like a project / product than by team [09:12:25] but yes, naming is hard! [09:14:20] is that per team grouping really necessary? [09:14:46] sounds super arbitrary, esp. considering potential external contributions [09:15:25] I assume the groups makes it easier to manage permissions. They don't have to reflect teams directly, but should reflect a cohesive set of repos [09:15:26] I a potential soon-to-be external contributor, I vote no team in path :) [09:16:07] maybe, I don't think it helped me a lot with gerrit in the past, though [09:17:02] but should there be "something" in the path? Is "search" in the path better than "search-platform"? [09:17:26] I think it might reinforce the idea of wdqs being somehow related to search [09:17:59] I would propose to NOT have wdqs under the same path, same as we have now in gerrit [09:18:27] wdym? /wikidata/query/rdf doesn't share a path with search [09:18:38] wdqs should probably be in `repos/wikidata/query/rdf` or something similar [09:19:04] yep, it does not share a path with search now, and should not share a path in the future [09:19:32] ah, was confused - we were talking about per grouping, so I assumed you want wdqs there as well [09:20:49] in that case - will we put any of the cluster configuration stuff inside? pipelines, etc.? if so, I think search-platform makes sense [09:20:55] not as a team, but as a function [09:28:11] So, decision? We keep it to `search-platform`, the function, not the team? [09:28:27] And we forget that I ever asked the question! [09:28:39] IRC never forgets [09:28:48] actually, that's not remotely true [09:28:53] no one reads IRC history [09:29:07] there's no such thing as IRC history, so even better [09:29:23] there are IRC logs [09:32:38] they contain messages? [09:32:53] https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-search/ [09:33:05] ah, didn't know about them [09:39:48] it's in the channel topic [09:40:08] I never once read the channel topic completely, I'm afraid [09:40:24] yeah, I know, it's pretty long [09:40:29] and now I know, thank you :) [09:42:04] errand [09:42:16] Good morning [10:00:47] o/ [10:02:26] I can change the name for the group to "wmf-team-search-platform" [10:03:04] search-platform sounds OK for project space, it does not have to mean the "team" [10:05:03] makes sense to not put wdqs related stuff under the same search-platform project spacke [10:06:03] hopefully we don't have anything wdqs related to put in gitlab at the moment so we can postpone finding a good name for these :) [10:19:08] sounds all good to me! [10:23:58] WDQS will be a doozy, after introducing WCQS [10:39:17] Lunch [11:10:18] dcausse: I know I had this issue before, but I forgot how to solve that - how do you make https proxy work with flink pipelines on YARN? [11:10:58] zpapierski: you need to setup http routes in the updater config [11:11:11] ah [11:11:13] or change the flink yaml to add the proxy [11:11:20] weird I didn't have them already [11:11:24] thx [11:11:45] I think you might be able to retrieve that in the flink-job.py script from the deploy repo [11:23:18] lunch [11:31:21] ah, good advice,thx [11:31:27] lunch+errand [13:35:31] zpapierski: dcausse speaking of webproxy: https://phabricator.wikimedia.org/T300977 [13:35:36] add comments there plz if you have them plz! [13:36:03] sure, looking [13:56:03] ebernhardson: I commented on this proxy task ^. But I might have missed something. [14:02:45] dcausse, zpapierski: blazegraph meeting : https://meet.google.com/dyi-sopm-ihj [14:02:52] oops [14:19:10] greetings [14:31:14] o/ [14:32:17] o/ [14:50:51] errand, be back by retro [15:42:02] I ran the pipeline on flink 1.14.3 (commons one for now). It doesn't fail but doesn't push anything either, so I'm guessing I misconfigured something, I'll try wikidata config - dcausse is your updater-job.properties on stat1004 valid? [15:45:36] going to try my luck at the driving building soon; will miss retro today [15:56:15] zpapierski: perhaps? but I'd do a quick check to not mess-up prod topics [15:56:43] of course [16:02:19] @team: retrospective time: https://meet.google.com/ssh-zegc-cyw [16:02:47] ejoseph, mpham, ebernhardson ^ [16:08:04] ottomata: i guess it's getting off track of the ticket so i wont post there to distract, but for ref https://phabricator.wikimedia.org/T120281#1895374 [16:15:38] ebernhardson: reading [16:16:49] ottomata: i guess mostly i'm just wondering if i can kill this complexity, it's a bunch of moving pieces who's urpose is to go around that firewall [16:17:12] ebernhardson: iirc you push to swift right now? [16:17:13] or to kafka? [16:17:14] or both? [16:17:16] both right? [16:17:22] ottomata: swift and kafka are open, its elasticsearch that's closed [16:17:23] kafka for noticication that data has been put in swift? [16:18:04] ottomata: right, for ~7 years now we put data in swift and notify over kafka. Could we just let analytics talk to elasticsearch now? [16:18:22] I don't think anyone except me even knows how the prod side of the daemons works :) [16:19:20] ebernhardson: it is possible, from that ticket it looks like SREs main objections were maintenance of the special rules more than anything else [16:19:56] but, imo I think what you are doing is better than pushing directly to elastic search; you are kinda using swift as a async queue [16:20:03] would be better if it all could just go through kafka [16:20:13] but, either way, the produce is decoupled from the consume [16:21:14] ottomata: i suppose i already think of it as decoupled, because the outputs are already in hdfs. I guess i think if this kafka+swift bit as an extra decoupling on-top of something already decoupled [16:22:59] with your current architecture the updates are pull, instead of push [16:23:52] ottomata: sure, but the push is behind airflow with all the normal retry handling, scheduling, etc. I feel like we could run a minimal version of the current daemon code directly in yarn and get the same result [16:24:10] (but not as a deamon, as an invokable) [16:29:34] ebernhardson: ya i think it could be fine, you should comment on that parent task [16:29:53] from a thousand foot view, i prefer the event sourced model better than what you are trying to do [16:30:05] my hammer is to make all updates be in a stream [16:30:14] :) [16:30:44] q: would it ever be useful to have the data you generate used by anthing but the ES servers that eventualy are updated by it? [16:30:51] like, what if you wanted to pull the updates into a test ES cluster? [16:30:59] without having to make your hadoop job push there? [16:31:44] ottomata: not really, i mean everything is json lines and could be read but the formatting of everything is pre-done to be piped directly into elasticsearch _bulk apis [16:32:25] but for testing elsewhere? maybe testing a new version of es bulk api? [16:39:47] ottomata: not really, at least not in the last number of years that I can remember. I suppose we moved relforge into the anylytics network so that we could send test data directly to elastic without involving the kafka+swift bridge [16:39:57] it was so much easier :) [16:40:55] aye [16:41:43] ebernhardson: btw, i'm hoping one day these kinds of jobs won't run in analytics hadoop. spark is fine, but hopefully on k8s and not using hdfs. using some object store for storage [16:41:51] buuut that is super future so [16:41:55] proceed! :) [16:42:24] ottomata: maybe someday, although i'll say the primary reason we used yarn so much wasn't spark, it was the ability to request resources (sometimes stupidly excessive amounts) on demand [16:42:37] yeah [16:43:17] i hope to ge tthat with k8s, but i also worry it will be locked away to SRE's where i can't do things like yarn lets us [16:44:06] i'll ponder some more, there are valid reasons for both directions i suppose [17:15:41] err, hmm. Running the saneitize enqueue job works for some wikis, and other wikis: Elastica\Exception\ResponseException from line 182 of /srv/mediawiki/php-1.38.0-wmf.21/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php: blocked by: [FORBIDDEN/12/index read-only / allow delete (api)] [17:15:56] we don't use read-only indices :S [17:16:08] ouch [17:16:12] i guess that must be the metastore? [17:16:16] I hope [17:18:48] I don't see any closed indices neither in eqiad nor codfw [17:19:44] while checking logs, i see that search-loader intsances are repeating: urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno -2] Name or service not known [17:19:56] * ebernhardson hopes it turns out one thing wrong and not 14 :P [17:21:12] is it happening now or could it be due to the decom? [17:22:28] dcausse: hmm, last message is 17:22 and its currently 17:22 :( [17:22:35] mediawiki is failing ElasticaWrite jobs as well [17:25:37] it looks like elastic will lock an index to read-only if at any time the index went red, until an operator unlocks the index (until 7.4 when they changed that). Could that be happening? Not seeing any red alerts though [17:26:41] on codfw dewiki_content_1624301897 has blocks.read_only_allow_delete=true set, i don't think we ever set that ourselves [17:27:04] (not clear that actually sets read-only state either though, i think it's just conf) [17:27:38] elastic sets that when disk is low [17:28:02] nothing is anywhere close to full in codfw though :S [17:30:20] ConnectTransportException[[][10.64.0.235:9500] connect_exception]; nested: AnnotatedNoRouteToHostException[No route to host: elastic1034.eqiad.wmnet/10.64.0.235:9500]; nested: NoRouteToHostException[No route to host]; [17:30:47] i don't understand :( [17:30:48] mw only reports codfw [17:30:55] but might be eqiad as well [17:31:45] are the new nodes perhaps unable to talk to the old nodes and we are getting a split? [17:32:01] hmm, i guess play with nc a bit [17:32:41] it seems plausible elastic would do things that apear entirely bizare if only some machines can talk to eachother but they are all in a cluster together (able to talk to same master) [17:32:59] but i would expect yellow indices too :S [17:33:53] ebernhardson: dcausse: so we're seeing the problem in codfw currently? [17:34:17] ryankemper: updates failing with read-only exceptions, nodes failing with transport exceptions that can't talk to each other, search-loader instances are failing to connect [17:34:19] mw job failures seem to be from codfw [17:34:27] ack [17:34:33] but I might read logstash badly [17:34:42] FWIW no decom work has been done in codfw, only eqiad [17:35:02] dcausse: the only read-only index i've tracked down so far is dewiki_content in codfw, but there are probably more (i kinda stopped looking after finding one) [17:35:20] agreed on this btw: [17:35:20] > it seems plausible elastic would do things that apear entirely bizare if only some machines can talk to eachother but they are all in a cluster together (able to talk to same master) [17:35:31] yes there are more but mostly dewiki indeed [17:36:00] elastic logs don't help on the master@codfw :/ [17:36:16] As for the transport exception, are those old logs or recent? `elastic1034` should certainly be gone now [17:37:09] ryankemper: curiously, those messages only come for elastic1073 trying to talk to the omega cluster [17:37:18] https://logstash.wikimedia.org/app/dashboards#/view/default?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'The%20default%20landing%20page%20for%20Kibana.',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),query:(language:lucene,query:%22AnnotatedNoRouteToHostException%22),timeRestore:!f,title:'-%20Home',viewMode:view) [17:37:23] ouch, i thought it would shortlink :S [17:37:36] transport failures: https://logstash.wikimedia.org/goto/f9a5c800f7c7d88156a574ab9d66b84c [17:38:02] I wonder if 1034 or other old hosts are still listed as seeds for the purposes of the cross-elastic-cluster replication, checking [17:38:41] ryankemper: oh i remember now, i think cross-cluster communication is set with the masters as the things to talk to in the remote cluster, we need to update for the new masters [17:38:48] https://www.irccloud.com/pastebin/U34jZKLR/ [17:38:58] (probably not our underlying problem, then) [17:39:00] yeah, the above snippet is me looking at the main cluster (9200)'s settings [17:39:19] but should be fixed regardless, have less things breaking at the same time :) [17:39:53] ebernhardson: do you know if those seeds are manually set? cause it seems the `cirrus.yaml` master settings didn't auto update those see [17:39:56] those seeds* [17:40:18] ryankemper: i think scripts/push_cross_cluster_conf.py in the cirrus repo but i should read it [17:40:21] yeah, I'd like to fix it real quick to cut down on noise / incase cross cluster replication is fully broken to omega cause that might make things weird [17:41:29] ryankemper: it looks like that script mostly lets you define the cross-cluster config once and then it slices it to send the appropriate bit to each separate cluster (so the cluster doesn't try to remote itself) [17:41:41] maybe, i didn't write it :) [17:42:34] this script does not even have a doc [17:42:37] ebernhardson: any guesses on where the seedfile comes from? is that something we manually create and then run this script? [17:43:32] last chance is I wrote something in the ticket [17:43:46] ryankemper: T213150 [17:43:46] T213150: Configure elasticsearch crosscluster on production search servers - https://phabricator.wikimedia.org/T213150 [17:48:31] not seeing anything interesting in the health status transitions, just the typically daily shuffle as it creates titlesuggest indices [17:49:35] yes same [17:49:40] (wrt the seed stuff) heh too bad the contents of `psi_codfw_masters.lst` isn't included in the ticket [17:49:47] based off the script code though the format is super simple [17:49:55] i feel like elastic shouldn't have been able to set an index to read only without logging, but i cna't find the logs :S [17:49:58] i'll construct the files w/ the new masters as the seeds [17:50:18] ryankemper: it's one master per line [17:50:40] i guess we can loop over all the indexes and re-open them, any reason not to? [17:50:55] well, it's not even re-open it's removing the index block setting [17:52:15] ebernhardson: you see them closed? [17:52:30] dcausse: not closed, but with `index.block.read_only_allow_delete=true` set [17:52:39] oh [17:53:14] yes we need to remove this [17:53:17] the only reference i can find to that so far is that elasticsearch will auto-set that when disks are too full, but we've run disks way more full before and never seen anything similar [17:53:23] yes [17:53:48] we can use _all perhaps? [17:54:00] not sure if we allow that [17:54:07] i'm not sure either :) [17:56:39] updating the index settings blocked by a ClusterBlockException :P [17:57:22] oops trying too [17:57:36] dcausse: go ahead, i'll poke logstash some more [17:57:54] done [17:58:03] _all seemed to have passed [17:59:08] seems more calm [18:01:05] err, hmm. I was going to check CirrusSearchChangeFailed rate to see if things are fixed. We are already at the lowest point of errors in the last 2 weeks (at 1.7M failures per 12 hours) [18:01:05] maybe needed for the smaller cluster [18:01:06] sigh [18:01:30] i suppose i dont remember, but i don't feel like we were ever failing millions of updates a day before [18:01:30] wow [18:01:38] no [18:01:51] for doc missing? [18:02:48] not sure yet, the recent ones are ReadOnly but need to agg over th erest [18:02:59] running that on psi&omega [18:03:21] we're definitely missing some alerting [18:03:42] indeed [18:03:53] it's codfw so it's why it remained unseen I guess [18:04:02] by users I mean [18:04:27] yea makes sense, users wont notice when the secondary cluster is wrong [18:06:40] wow started on jan 16 :( [18:07:00] i guess that's why couldn't find any logs, didn't even think to look that far back :( [18:07:10] yes [18:07:14] so, codfw needs a month of catchup? [18:07:21] looks like it :/ [18:08:33] well, change failed is clearly fixed now. I suppose we need an incident report? Is it an incident if it lasted a month? [18:08:53] :) [18:09:26] well I don't know what qualifies as incident, I feel like since no users were impacted it's not really an incident? [18:10:37] hmm, perhaps. Looks like also still nede to look into search-loaders, still eeing them try to talk to 1038. [18:10:59] i guess this was two separate things, decom turnout out a few pieces of hardcoded conf, and whatever caused codfw to go read-only [18:11:19] needs to take care of the kids, will back later [18:11:23] kk, thx! [18:14:22] Okay, seeds should be all fixed on chi/omega/psi eqiad [18:21:56] lunch, back in ~1h [18:24:53] ryankemper: will probably want a ticket or come up with some way that that stays aligned to puppet configured masters in the future [18:25:19] ebernhardson: agreed, will write up a ticket [18:26:28] for the search-loader instances, i think we can just restart them and call it good enough. What happened is the elasticsearch client there is "smart" and instead of talking to the LVS endpoint it sources a nodelist from the first node it talks [18:26:36] i suppose that needs another ticket to turn off magic and use LVS [18:27:43] is there a nice way to restart a bunch of systemd units that use @? or do i need a `for i in $(seq 0 7); ...` [18:29:10] ebernhardson: you mean like `sudo systemctl restart elasticsearch_6*`? [18:29:26] we can use *? til [18:29:49] ebernhardson: yeah, made my day when I found that out. systemd is really nice about that [18:30:07] I'm always using the globs with `list-units`, `status`, and `restart` [18:30:32] (`list-units` is great to sanity check that your restart glob is gonna restart what you think it will and no more) [18:30:45] or status I suppose, it's just noisier [18:31:38] as for the way to make sure we stay aligned with puppet configured masters, interestingly this ticket implies we should have had alerts firing for settings drift: https://phabricator.wikimedia.org/T218932#5100726 [18:35:32] ryankemper: we have /etc/elasticsearch/production-search-eqiad/ccirrus_check_settings.yaml, but the command line refers to /etc/elasticsearch/${title}/cirrus_settings.yaml" [18:37:22] huh, no i'm reading old ode :P [18:37:24] *code [18:39:20] https://github.com/wikimedia/puppet/blob/f3920513905975e7bce93b21c59f5e30fc3eb5b3/modules/icinga/manifests/monitor/elasticsearch/cirrus_settings_check.pp#L34 [18:42:02] Interestingly I don't see the file getting placed? despite this block https://github.com/wikimedia/puppet/blob/f3920513905975e7bce93b21c59f5e30fc3eb5b3/modules/icinga/manifests/monitor/elasticsearch/cirrus_settings_check.pp#L21-L28 [18:42:28] Nevermind I do, scratch that [18:42:36] https://www.irccloud.com/pastebin/es0CUpSN/ [18:43:46] So yeah unsure why the alert wasn't firing because these contents look right, implying puppet had the info it needed for the cirrus settings check to detect the mismatch [18:43:48] https://www.irccloud.com/pastebin/ASPQeLZR/ [18:44:25] ryankemper: i'd agree, the deployed check seems legit. The only thing i could suggest is adding an extra (valid) host to codfw and trying to convince icinga to complain [18:44:52] but we already figure that wont (likely) complain....hmm [18:49:37] It seems like things are mostly under control, i'm not feelin 100% and taking the rest of the day off. Sent a mail to -private. [19:09:43] 👍🏿 thanks for all your help erik [19:27:05] just catching up on the scrollback. Nice work all, if I can help follow up on anything LMK [19:41:37] ryankemper there's chatter about an incident that just ended in the security channel, just checking to see if any of the earlier ES stuff could be related [19:46:32] seems unlikely based on what I'm reading, but just wanted to verify [20:07:32] gehel: inflatador: Here's an example check command for the main (chi) cluster `/usr/lib/nagios/plugins/check_cirrus_settings.py --url http://localhost:9200 --settings-file /etc/elasticsearch/production-search-eqiad/cirrus_check_settings.yaml` [21:15:14] found the issue with the cirrus settings check [21:15:18] turns out it has never worked [21:15:18] https://phabricator.wikimedia.org/T301511#7702414 [23:26:52] Since I forgot to post it earlier, here's the log of the changes made to the cluster seed settings to bring them into alignment https://phabricator.wikimedia.org/T294805#7701855