[07:45:40] ejoseph: I need to take care of some errands, so I'll be out 10AM-~12:30PM - let's connect before that [07:46:35] zpapierski: Emmanuel is meeting Maryana from 9 to 10 [07:46:55] that's some nasty example of bad timing :( [07:51:41] Yh [07:51:52] I have a meeting [07:54:23] Let me know when you are back [07:57:16] I am not sure of what to do about this, any recommendation https://usercontent.irccloud-cdn.com/file/ltDRw7tj/Screen%20Shot%202021-11-10%20at%208.56.01%20AM.png [07:58:04] ejoseph: that's an automated comment from SonarCloud (https://sonarcloud.io/organizations/wmftest/projects) [07:59:07] RuntimeException is a very generic exception, which should almost never be used directly. In this case, a IllegalStateException might be better at representing what's going wrong [07:59:36] I was just hitting enter on a similar answer :D [08:00:56] this generally suggests to create some more specificly named exception, like SearchLatencyException, that inherits from RuntimeException [08:01:48] For a simple use case like this I would probably not create a new Exception, but try to reuse a standard one if there is one [08:01:54] we don't always pay attention to that advice, but in this case there are at least two things - IOException has its unchecked version you can use [08:02:11] called UncheckedIOException (surprise) [08:03:15] actually, ok single thing - I imagine only OSService can throw IOException here [08:52:18] errands (as mentioned, be back around 11:30AM UTC) [10:49:12] lunch [11:01:23] ejoseph: I got confused, I'm busy from 12:30PM, I have half an hour now [11:01:48] if you got time, we can connect? [11:03:09] i'll send a code with me shortly [11:03:14] ok [11:33:21] lunch [13:28:45] ejoseph: I have 30' right now if you want to pair on https://gerrit.wikimedia.org/r/c/search/extra-analysis/+/737024 ? [13:31:13] ok sure [13:31:40] ejoseph: meet.google.com/nmv-idwx-xgb [13:32:00] you can send me a code with me link [13:34:09] ok [15:01:23] break until the next meeting [15:18:27] I heard search has issues in commons https://commons.wikimedia.org/w/api.php?action=query&format=json&uselang=de&generator=search&gsrsearch=filetype%3Abitmap%7Cdrawing%20CAT&gsrlimit=40&gsroffset=0&gsrinfo=totalhits%7Csuggestion&gsrprop=size%7Cwordcount%7Ctimestamp%7Csnippet&prop=info%7Cimageinfo%7Centityterms&inprop=url&gsrnamespace=6&iiprop=url%7Csize%7Cmime&iiurlheight=180&wbetterms=label [15:18:34] in slack, engineering-all [15:24:39] Amir1: thanks for the ping [15:24:55] something's weird happening on the filename namespace [15:28:04] please ignore any 'test' email messages some of you might have received. test is now over!:) [15:31:05] dcausse: do you need any help on that production issue for Commons= [15:31:15] I let the SD team know, they'll look at it now too [15:31:25] still trying to understand what changed... [15:31:41] dcausse: we'll be in the hiring meeting, scream if you need us [15:31:56] I can join a couple minutes [15:32:58] seems to have started to fail around 1pm UTC today [15:33:41] this was reported in T295480 too [15:33:41] T295480: Searching for files on Commons returns error - https://phabricator.wikimedia.org/T295480 [15:37:08] it's odd...Seddon says nothing has been merged in 9 days to WikibaseMediaInfo, and there wasn't a train today [15:40:23] seems elastic related [15:40:24] elastic is having major issues in eqiad, i'd probably switch to codfw [15:40:44] +1 [15:40:56] /_cat/indices | grep commonswiki_file says an index exist, /commonswiki_file/_search says no index [15:42:33] let's switch and then try to recover [15:42:38] kk [15:42:42] you send a patch? [15:42:46] not yet, i'll make one [15:42:55] ok [15:43:43] do we have a ticket? [15:44:23] * ebernhardson idly wonders if there is some magic barrier just over 4tb, but that wouldn't be a a 32 bit limit or some such...dunno [15:44:58] yes, T295480 [15:44:58] T295480: Searching for files on Commons returns error - https://phabricator.wikimedia.org/T295480 [15:45:59] I UBN'ed it, hope that's fine [15:46:48] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737938 [15:49:52] * cormacparle waves [15:49:56] can't see backscroll [15:50:02] \o [15:50:07] o/ [15:50:12] anyone have a tl;dr for me on what's up on commons? [15:50:16] cormacparle: elastic in eqiad is in a weird shape [15:50:16] we're moving traffic to the backup cluster...but i worry that cluster will start exibiting the same systems [15:50:33] ok cool [15:50:45] commonsfile seems empty [15:50:58] yeah, just looking at that [15:51:14] so weird [15:51:23] it's there in codfw [15:53:03] eqiad->codfw synced out [15:53:49] seems like the index was created but the mapping was left out [15:54:02] the index timestamp is from june though [15:54:52] on codfw? [15:55:25] no eqiad [15:55:53] commonswiki_file_1623767607@eqiad is empty and has no alias [15:55:59] no, i mean we name indexes foo_content_12345, the commonswiki index in eqiad has a timestamp of june 15th, so it's the same index we've been using for months [15:56:24] i don't understand how it could have 1 primary and 2 replicas, but exist since june [15:56:33] * ebernhardson feels like corrupted clutser state, but how? [15:56:41] "number_of_shards": "1", so weird... [15:56:55] we were running queries against it earlier today via a tunnel from mw-vagrant to test a bug [15:57:10] could we have fucked it up somehow? [15:57:20] * cormacparle hopes the answer is "no" [15:57:56] I doubt it, the oddity here is that it has a single replica, the index is months old, and on this version of elastic you can't change the primary count of a live index [15:58:05] "creation_date": "1636552790304" which is now [15:58:10] oh! [15:58:15] so the index was recreated [15:58:22] (but with old name, so not through the scripts) [15:58:38] double checking auto-creation settings, i don't think we allow auto-create [15:59:33] yea auto create is disabled here, something had to issue the create request then: action.auto_create_index: +apifeatureusage-*,+glent_*,-* [16:00:21] elastic probably logs something about index creation, have to find it though [16:00:44] [commonswiki_file_1623767607/nAeX-xvfRDGpny497k_mxg] create_mapping [page] [16:00:47] [16:00:56] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[commonswiki_file_1623767607] [16:00:59] :/ [16:01:11] when? What else can we correlate at that timestamp [16:01:28] hmm, correlate wrong word :P but what else happened then [16:01:29] fuck! it probably was me! [16:01:39] [commonswiki_file_1623767607/fDEiqTRaRJiCYPjfmO6wJA] failed to delete index [16:01:49] I ran a provision on vagrant earlier, might not have closed the tunnel [16:01:52] :( [16:01:53] oh sorry [16:01:54] cormacparle: tis ok, if so that means we don't have to worry about codfw falling over. Much better :) [16:02:05] timestamp are not copied [16:02:30] cormacparle: good to know, no worries! :) [16:02:33] if we have relevant timestamps, I can check my bash log [16:02:49] Nov 10, 2021 @ 13:59:14.389 [16:03:46] still weird that it was re-created with the same name tho [16:04:06] yea that doesn't seem like something it normally does [16:04:31] meh my camera wont come up...gotta restart [16:04:33] but the shape looks like a default index (e.g. autoexpand: -1) [16:04:51] * ebernhardson tries to not `sudo reboot` while logged into the cluster :) [16:04:59] * ebernhardson has before... [16:06:12] Nothing around that timestamp in my logs, I'm afraid (though IDK how complete they are with multiple shell instances open) [16:12:00] my logs aren't timestamped, but I did run a `vagrant provision` at some stage before I had lunch [16:12:11] and I have had a tunnel to production open on and off this morning [16:14:33] FYI: I also ran vagrant provision (11:16 UTC+1), hopefully with tunnel off, but who knows (in case you see another weird thing around that time) [16:28:03] hello folks! Do you connect to Kafka Main or Kafka Jumbo via TLS? [16:29:47] elukey: if for wqds, you mean, they connect to main [16:33:38] what he said [16:33:56] and yep, we do TLS now [16:34:04] (in Streaming Updater I mean) [16:34:12] actually wait, I'm not sure [16:34:25] incident shuld be resolved now, will need followup to get everything back in a happy state [16:34:27] (cookbooks do that, I need to verify the cluster itself) [16:34:51] actually, no - Streaming Updater is using http [16:35:31] elukey: hmm, in our airflow stuff we use whatever the python-kafka package does by default. Can look into if its used by default or not [16:37:48] ack thanks for the info.. the context is in https://phabricator.wikimedia.org/T291905, I am moving all clients to a new ca bundle [16:38:06] it needs to be done before moving the various kafka clusters to the new PKI [16:38:13] going offline [16:38:20] (and eventually we'll be able to turn on hostname verification as well) [16:39:58] > incident shuld be resolved now, will need followup to get everything back in a happy state [16:40:02] hooray! [16:40:16] sorry again :( [16:40:39] let me know if you need me/us to help out with (or do) any of the follow up [18:12:20] Is munging the WCQS dump supposed to take a long time? I set the entity count to 50,000 for each dump file and it's still on the first file since the process started yesterday. Or are the entities really big? This is the command I used: ./munge.sh -c 50000 -f ./data/commons-20211107-mediainfo.ttl.gz -d ./data/commonsMungeOut [18:27:11] hare: looking at the munge outputs on one of our servers, they are dated 17:38 - 22:08, so 4-5 hours [18:27:24] hare: final output was 1413 files [18:27:53] If it's spending forever trying to dump to one file, could that mean something is going wrong? The CPU is definitely keeping busy but I don't know if anything is coming from that. [18:28:28] hare: hmm, possibly? On the prod host it looks like it emits a couple files per minute [18:30:12] hare: the overall dumps takes a few days to load, fwiw [18:30:33] Version is "0.3.92-SNAPSHOT" that was shared with me a few days ago [18:32:30] hare: hmm, sounds reasonable. It looks like i ran the prod import (but only a test, not actually used) with 0.3.90 [19:37:18] https://github.com/BigDataBoutique/elasticsearch-repository-swift is 131 commits ahead of our branch, so something has been happening there [19:38:27] i suppose i have a local dev env with swift, i'll setup the plugin there and see how it goes. If it seems reasonable we should be able to install it to the clusters and try a snapshot/restore of commonswiki from codfw->swift->eqiad [19:39:43] or maybe relforge->swift->relforge first [19:51:06] * ebernhardson wonders why he installed the wmf-elasticsearch plugins with ar and tar instead of dpkg -i in the dev env.... [20:04:46] If someone uses an external search engine like google to reach wikipedia, do we know what their search query was? I'm guessing no.. [20:05:41] mpham: not these days, used to many years ago [20:06:36] mpham: there is the google search console which tells us about the top N queries that resolve to us, but when i looked years ago it was all exact title matches [20:14:45] ah, ok. thanks. Would be cool data to have, but I was skeptical about how much of it we have access to [20:38:28] ebernhardson: working on a lightweight incident doc. current question: what actually is a cross-cluster search [20:39:03] ryankemper: in this case cross-cluster isn't really a culprit, it's the cross-wiki search. On Special:Search most wikipedias queries their sister wikis along with commons [20:39:15] for ex what are the "clusters" it's referring to? I gather it's not literally across elasticsearch clusters...is it effectively across multiple indices? ie instead of just `enwiki` it could be `enwiki` and also `commonswiki_file` etc [20:39:35] ryankemper: i was thinking cross-cluster because we had a related incident recently where a few eqiad psi instances had a cross-cluster issue [20:39:42] Ah okay that was partially my confusion, so "cross-cluster" is just not the right terminology [20:39:48] Right okay that makes much more sense [20:40:54] ebernhardson: not directly related but if I make a search for a string on english wikipedia, that's going to hit one cirrus cluster, which is itself composed of 3 clusters (main, psi, omega). Does a given search only hit main + one of (psi XOR omega), or does it actually hit all 3? [20:41:55] Or do psi/omega all have the same indices and it's rather just kind of a way to maintain resilience at the row level? e.g. the fact that we try to mostly evenly distribute psi/omega btw rows [20:42:10] ryankemper: depends on the wiki and where it's sister wikis are located, the wikis outside the top 200 by size are randomly split between the other two clusters, so probably most wikis go cross-cluster [20:42:41] Okay so there actually are different indices between psi/omega then [20:42:50] i.e. there exist indices that are present in the main cluster and present in psi but not omega and vice versa [20:44:08] each of omega/psi/chi have a unique set of indices, the top 200 by size are on 9243 (i forget the names...we should have named the big one omega) and the rest and randomly split (crc(wikiid) % 2 or some such) between 9443 and 9643 [20:44:43] +1 to the fact that we should have named the big one omega [20:45:00] the main one is just called like `production-search-eqiad` whereas the others are like `production-search-psi-eqiad` or something [20:46:03] Yeah `elasticsearch_6@production-search-eqiad.service` vs `elasticsearch_6@production-search-psi-eqiad.service` [20:46:30] yea, it already had a name and we didn't bother changing it. I think we have to fully take down the cluster to change the name and we don't even have to do that for major version upgrades anymore iirc [21:23:37] Took a first swing at the incident doc: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage [21:24:13] I don't have a great handle on how many queries failed exactly since it looks like https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 has other types of failures in there as well IIUC [21:32:59] ryankemper: index_not_found will also have the write side failures, maybe can filter something though sec [21:34:02] https://logstash.wikimedia.org/goto/3a54108415872e40c13aa77f6a3330c5 should be close enough [21:34:25] i guess `index_not_found_exception AND channel:CirrusSearch AND queryType:full_text` would be nicer to the servers [21:35:19] i wish grafana had ways to dice data on the fly..like choose a column and get a breakdown of it over the current result set [23:09:07] swift snapshotting might be viable. I poked through the git history and found a version for 6.6.0, backported to 6.5.4 and tested locally. Using an import of alswiki_content it seems to backup and restore to swift properly [23:10:41] can install this version to relforge, but i don't know if we have proper swift credentials, in our other use cases we have an analytics swift login, not sure if proper to reuse here. Can for testing relforge, but maybe not for prod [23:12:17] (backporting amounted to realizing 6.5 uses gradle 4.1, but 6.6 uses gradle 5. took longer than i care to admit :P) [23:55:01] relforge had a 1tb commonswiki index, i've started up a snapshot to ms-fe.svc.eqiad.wmnet under analytics:admin. They are showing ~90MB/s with default limits, so maybe 3 to 4 hours. the swift cluster overall does ~1GB/s so pushing 10% shouldn't be too bad (i hope) [23:57:36] i suppose swift replicates, the 90MB/s we are pushing probably turns into more inter-cluster traffic as well