[09:00:28] dcausse: I'll be 5' late [09:00:33] group0 is on 1.39.0-wmf.28 (and thus serving search requests from es710@codfw) [09:00:35] no worries [09:05:38] I'm there [09:46:44] dcausse: also if you are curious the JNL is 1.1T, but gziped it is 342GB [10:14:32] addshore: wow I would not have expected that [10:25:39] lunch [12:30:16] pfischer: We discussed T309097 with dcausse. This might be a good first task for you! You already have some context, but you'll likely need a bit more. Feel free to ping David or me as needed. [12:30:17] T309097: We should have a top level maven project template based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 [13:02:40] greetings! And welcome pfischer ! [13:04:36] o/ [13:13:50] o/ [13:24:05] dcausse: so playbooks as in "ansible playbooks"? [14:44:09] addshore: cookbooks as in "wmf cookbooks": https://gerrit.wikimedia.org/g/operations/cookbooks :) [14:44:42] Thats why I couldnt find a wikitech page! [14:45:09] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks [14:46:32] and dcausse would it be okay to use the `gsutil` binary for such a copy from a wdqs host? rather than the much longer process / set of code for using the google cloud upload api? [14:47:40] addshore: you mean adding google cloud clients to a wdqs machine? [14:48:16] \o [14:48:39] o/ [14:49:06] I guess you would have to provide a buildable debian package for this [14:49:07] dcausse: client in terms of binary that can auth and interact with their api, yes, rather than trying to cook that from scratch [14:49:17] dcausse: I believe there is one *looks around* [14:50:09] also if we are going to automate this, we should probably dedicate a machine for this [14:50:31] wdqs1009 being a test machine might be used for other purposes too [14:52:34] ack! automating it could end up being useful, the copy takes 4-5 hours it seems at the current size [14:52:58] but I guess let's see how useful this first JNL ends up being for folks to start [14:53:19] the nice thing about it will be that if they pick it up in the next 30 days, they should be able to catch it up to present day relatively easily [14:55:11] sure [15:04:21] actually any client that can talk to an s3 compatible api might work, so perpahs there is something lighter i coud find [15:05:33] stat machines have s3cmd [15:21:57] ooooh, let me experiment a bit [15:24:11] * ebernhardson is surprised group0 has enough traffic to push the 95th percentile on completion suggester [15:26:34] hm why is it eqiad tho? [15:27:42] hmm, good question [15:28:09] dcausse: also, comp suggest on mediawiki.org isn't returning anything [15:28:18] meh [15:28:35] I see things [15:28:51] i'm sure that works in integration :S [15:29:24] hmm, if i use an incognito i get things, but not when logged in [15:29:48] https://www.mediawiki.org/w/api.php?action=opensearch&search=cirrrussearch [15:29:55] active/active? [15:30:20] oh right, it's a different api. logged in i get opensearch, logged out i get /w/rest.php/v1/search/title [15:31:00] I get things logged-in [15:31:39] hmm, the url i get is https://www.mediawiki.org/w/api.php?action=opensearch&format=json&formatversion=2&search=medi&namespace=0%7C12%7C100%7C102%7C104%7C106&limit=10. If i drop the namespace filters it then works [15:32:10] but it has namespace 0 in there, so it should be finding things [15:32:59] it does a multi-search in that case [15:34:02] completion query is absolutely the same... [15:34:55] indeed it looks same here too :S [15:35:32] so it happens if you save non-default namespaces I guess [15:36:01] * ebernhardson was never a fan of expanding that feature into completion search :P [15:36:14] yes :/ [15:36:23] not sure how you save these namespaces now [15:36:30] with Advanced Search [15:36:30] from Special:Search [15:36:32] oh [15:36:34] hm [15:36:37] might have to disable it [15:37:19] yea probably have to disable advanced search, they don't have the flag to toggle saving [15:38:30] x_content_parse_exception: [1:500] [terms_lookup] unknown field [1] [15:39:07] namespace": { [15:39:09] "1": 10 [15:39:11] } [15:39:34] isn't having a single ListHashMap fun :P [15:40:12] so we need an array_values($nsList) somewhere, will poke around and see where is appropriate [15:40:30] :) [15:40:36] looking too [15:41:49] dcausse: thats probably SearchContext:552, $filters[] = new \Elastica\Query\Terms( 'namespace', $this->getNamespaces() ) [15:41:57] yes [15:42:12] elastica does not call array_values here [15:42:33] well it does... [15:44:48] does seem like elastica should, the api there makes it clear it's querying a specific field and not taking fields from the second arg [15:45:05] i'm not seeing it take an array_values though [15:45:15] I see elastica calling setTerms [15:46:07] oh [15:46:35] indeed it's nothttps://github.com/ruflin/Elastica/blob/7.1.5/src/Query/Terms.php [15:46:40] indeed it's not* https://github.com/ruflin/Elastica/blob/7.1.5/src/Query/Terms.php [15:47:41] i suppose the blame really lies with elasticsearch itself, the parser simply looks for a json object-start and says that must be terms lookup [15:47:52] i thought they got rid of all that stuff :P [15:47:59] :) [15:50:04] yes was looking at the old elastica version https://github.com/ruflin/Elastica/blob/6.2.1/lib/Elastica/Query/Terms.php#L53 [15:50:49] guess it only happens on this blended completion+prefix query because we unset the namespace 0 [15:52:07] yea i think so too [15:52:41] dcausse: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/830192 papers over the issue, i suppose the array_values could alternatively go into getNamespaces method but thought that explaining why it was necessary seemed odd there [15:53:05] makes sense [15:55:43] as for the previous question of why the requestTimeMs for comp_suggest 95th percentile increased, i think your right that must be the new active-active things and us sending codfw traffic to eqiad, rather than group0 sending eqiad traffic to codfw [15:55:50] there just isn't enough group0 to effect 1k req/s [15:56:04] yes [15:56:18] so since we force eqiad we pay same latency I suppose [15:56:41] yea [15:56:50] s/same/some [16:00:28] wonder what our acceptable amount of connection failure is, over the last 12 weeks we are motsly at <100 per day, but last few weeks have been higher. Still < 500 per day, out of a few million connections probably fine, but not sure when to start worrying [16:01:16] looking at this in logstash: `channel:CirrusSearch AND "Elasticsearch response does not have any data. upstream connect error or disconnect/reset before headers. reset reason: connection failure"` [16:01:41] probably nothing to worry about [16:02:07] I guess we need to define a number (probably a %) and possibly create an alert [16:02:53] might be due to node restarts I hope? [16:03:50] yesterday was one of the higher days in awhile, with 220 on sunday and 336 on monday, and we shouldn't have been restarting anything [16:05:32] I see 1600+ on aug 24 [16:06:55] hm.. hard to tell what's a reasonable number here [16:07:11] yea i have no clue, maybe we simply need to turn it into a counter and not log those into logstash [16:07:35] the only reason i notice is because they are there (which perhaps is good, it means there are very few other errors happening) [16:07:35] yes [16:07:44] pfischer: there is a question for you on T316922. Ping me if you're not sure you understand it (or anyone else on the team) [16:07:45] T316922: Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 [16:08:47] pfischer: fyi, I pinged Tyler about shell access. Hopefully this is moving forward soon. [16:29:33] inflatador / ryankemper: the shell access ticket for pfischer should have all approvals. Can you check if it is moving forward? [16:32:54] gehel ACK [16:35:07] looks like the last update was a few minutes ago [16:51:23] inflatador: for the shell access, you could prepare the patches and merge them after review [16:52:01] I'm more confused about LDAP / phab, but you might have the necessary rights to do it. [16:55:34] s3cmd works a treat dcausse :) [16:56:13] :) [16:57:12] gehel I think I can do the LDAP stuff, but want to check if that's OK. Working on shell puppet patch in the meantime [17:07:56] inflatador: I haven't touched LDAP in forever, so I'm unsure how it works. Ask in -sre ! [17:20:17] * ebernhardson is indecisive on if circuit_breaking_exception is classified to `failed` and lumped in with other things, or gets its own `memory_issues` classification [17:29:10] I think I prefer to have dedicated types for important class of errors list that [17:30:17] lunch, back in ~45 [17:31:12] dinner [17:32:09] gehel: inflatador: I've done the LDAP stuff before FWIW [17:33:36] we should probably change our onboarding docs to more explicitly point to https://wikitech.wikimedia.org/wiki/SRE/Production_access#Filing_the_request when creating a ticket as well. there's a bunch of boilerplate info like ssh key, etc that always needs to be included in the ticket [17:34:03] (or alternatively just link to an old phab ticket of one of us requesting access as an example) [17:34:07] inflatador: the shell access is T316090, which already has the SSH key. And the key has been verified via a meet session [17:34:07] T316090: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 [17:34:39] ryankemper: please update the template! [18:11:49] back [18:28:47] finishing up food, few mins late to pairing [18:33:53] hmm, 1.38 release notes has a bit about a new thing with support for prometheus, it exists in core but not seeing anything use it yet, and it's not configured with a target in mediawiki-config :( [18:37:43] * ebernhardson sighs and does the alerting in crazy graphite+puppet land instead of alertmanager [18:59:43] pfischer: (for tomorrow) your shell access should be good. Please test it tomorrow and ping me if it does not work. Or if you don't know where to connect [19:00:20] inflatador: is working on your NDA access and will update here when done. [19:11:20] Should be done! Ticket is updated [19:31:44] inflatador: I moved to "needs review" and assigned to Peter [20:38:47] inflatador: ebernhardson: on mobile but I came across this; tldr zen.minimum_master_nodes is not a setting in es7 anymore https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch [20:55:58] perhaps there was nothing to worry about, it's not super clear from the logging. Those docs indeed seem to suggest elections should be expected and are now relatively quick and painless in es7 [21:08:18] Yeah, you're probably right. We can try it again tomorrow [21:21:14] ^ agreed w all the above [21:32:25] reviewing more elastic docs, it indeed seems like we can simply add the new nodes as master capable and it will be happy. When removing master capable nodes they should be removed one at a time with a delay between them to allow the voting configuration to update [21:32:30] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/add-elasticsearch-nodes.html [21:34:43] docs aren't super clear on how long to wait, but it seems to require a cluster-state update so we should probably expect to wait 30s or even 1m [22:41:24] "This is delibnrerately a very simple class". Has 17 constructor arguments, 16 of which are required [22:41:33] *deliberately [22:56:29] xD