[07:39:34] Not sure if retro was happening but I'm driving my housemate to the airport in the morning so won't be in until 1h after retro starts [07:39:45] Will be around for puppet deploy window [10:25:19] lunch [12:20:17] We can cancel retro this week with people being out and it being inspiration week (I also have another meeting at the same time I'd like to make it to) [13:05:29] greetings [14:31:23] I second cancelling the retro this week [14:37:47] o/ [14:39:51] howdy [14:58:53] \o [15:00:42] Trey314159: analytics merged the fixes for superset templating, this works again now: https://superset.wikimedia.org/superset/dashboard/154/?native_filters=%28%29 [15:01:07] apparently bots like peptid and platwright testing right now :P [15:01:16] s/platwright/playwright/ [15:01:45] they really, really like peptid [15:04:39] the hint about bots is how few variations there are. When real humans are searching for things they type it in 10 or even 40 different ways [15:05:12] just look at how many ways there are to search for star trek strange new worlds, that looks like humans :) [15:10:40] o/ [15:18:22] ebernhardson: cool! thanks for the ping. I love browsing through the lists. After thinking more about peptids and playwrights, I wonder if we should stop filtering the likely bots, and instead mark them in another column, or something. I can't recall how many there were, though. Maybe they flooded everything? [15:21:14] Trey314159: thats done with one line in the final query, source table keeps the data. We can see what it would look like here: https://superset.wikimedia.org/superset/sqllab/?savedQueryId=508 [15:22:20] it's a bit easier to read outside the sql lab, they need the thing where you can dynamically change column widths [15:23:20] apparently bots *really* like that davi person [15:24:12] and "squirrles" [15:24:18] works out to an average of a query rate of 17 per hour for 90 days [15:24:42] There's also an incategory search that is blowing out the column width... [15:25:59] interesting, yea appears so. This all comes from the frontend logging (SearchSatisfaction) so we sadly don't have the query parser information that makes it easy to filter those kinds of things [15:49:29] I think the incategory search is interesting—I just wish it didn't make the coumn so wide.. which you pointed out isn't a problem outside the SQL lab. Not sure this is the best way, but I added a "bot_like" column: https://superset.wikimedia.org/superset/sqllab?savedQueryId=509 There are lots of "... us senate election" searches; I think we found that website before. Also some things that almost look like spamming of current [15:49:29] events topics to try to make something more popular. [15:49:52] I think maybe we should stop filtering and add the flag if that makes sense to you, ebernhardson [15:50:14] ok.. off to lunch (remember we have one inspiration week lunch this week) [15:50:31] yea can probably add a toggle somewhere in the dashboard, just have to remember how :) [15:52:46] randomly, i wonder if `methionylglutaminylarginyltyrosylglutamyl...serine` could be real humans. No-one would type that but i could imagine people copy-pasting the shorthand for the longest word in english. [15:55:54] Sounds like a dietary supplement! [16:20:15] i wonder if it would be possible to add some sort of spark-line representing popularity over time in this data. Would have to retain perhaps the day that each query was made, but would be useful to see if something was briefly popular, or rising in popularity, or what since this is a 90 day aggregation [16:31:54] sigh, reindexing testwiki fails with: [{"index":"testwiki_content_1657816276","type":"page","id":"82624","cause":{"type":"illegal_argument_exception","reason":"Rejecting mapping update to [testwiki_content_1657816276] as the final mapping would have more than 1 type: [_doc, page]"} [16:37:05] lost IRC for awhile, back [16:45:18] meh, so seems like the auto-magic _doc -> actual type conversion doesn't apply to reindexing either :( At least there it's not so painful to query the indexes and use whatever it actually has. Annoying that elastic doesn't choose the right thing since we normally don't even specify the type [16:45:57] sigh.. [16:46:42] could we script this? [16:47:05] not sure, but since this is a maint script we can query the index and put appropriate type into the reindexing request [16:49:29] explicitly specifying page in the source and _doc in the dest seems to work. I guess should do that and the maybe reindex everything again so it's all _doc [16:50:19] forget why we didn't already do that, i remember 6.5 wouldn't let us create _doc as type, but 6.8 does [17:12:38] lunch/physical therapy, back in ~90m [17:40:32] ebernhardson: Hmm.. back from lunch and been playing with superset. Adding the bot-ish filter may have been good for enwiki and other large wikis with alphabets, but it filters out too much on arwiki or zhwiki (because there's no lowercase and short queries may have only one obvious way to write them) and for really small wikis (because no queries meet the requirements) [17:40:34] I don't think we need anything as complicated as a toggle; a "bot?" column will do (plus it identifies the potentially (un)interesting ones) [18:34:08] dinner [18:41:58] inflatador: grabbing some quick food, let's pair when you're back from PT if that works [19:21:32] sorry, been back for a bit [19:24:36] ebernhardson ryankemper up at https://meet.google.com/jwc-fahn-nex , feel free to join if you want. We can try the cookbook on DFW [21:04:18] created this one to review/improve alerting around lost masters: https://phabricator.wikimedia.org/T313095 [21:39:10] inflatador: is elastic2047 being worked on? It just alerted for a settings check. It looks related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/813974/1/hieradata/role/codfw/elasticsearch/cirrus.yaml [21:39:44] inflatador: yea it's being reimaged. [21:39:48] err, [21:39:51] RhinosF1: yea it's being reimaged [21:40:11] ebernhardson: might be worth downtiming alerts when it is or checking that the downtime applies [21:41:05] inflatador: ryankemper: it looks like the hosts are maybe all going to start complaining that the master list doesn't match, might need to downtime some alerts ^^ [21:41:39] ebernhardson: oh good point...this is for the cross-cluster seed stuff right? [21:41:40] I noticed 2027 didn't downtime on reimage too looking at ops bot's output [21:42:07] RhinosF1: thanks yeah, the downtime cookbook returned 99 for whatever reason during that reimage [21:43:31] ryankemper: i suspect thats what the alert is saying: [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] [21:44:52] ryankemper: does the actual log give any hint? If host being reimaged for 2047 too and didn't downtime then maybe a wider bug [21:45:17] 22:23:11 PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration is the full alert [21:46:09] Acked the settings check alert, will fix the cross-cluster seed in a couple hours when I get back from gym [21:46:22] RhinosF1: good question, checking logs on cumin real quick [21:46:32] Np [21:46:46] fwiw my hunch is transient because i've seen it occasionally fail to downtime before but it could be more than that [21:47:38] I'm only going on 2027 failing and you saying 2047 is also being reimaged [21:47:53] And assuming 2 hosts seemingly without downtimes means an issue [21:50:44] RhinosF1: Ah where did the 2047 part come from? I don't think I said 2047 was being reimaged [21:50:52] https://www.irccloud.com/pastebin/dWEI3fkU/2027_downtime_failure.log [21:51:00] i mean, that error is correct. We changed the list of master capable nodes but only restarted the node we wanted to become master. The rest of the hosts will have different settings [21:51:17] oh I didn't look far enough back in the backlog [21:51:22] ryankemper: ebernhardson did when I asked if 2047 (the one being worked on) was being reimaged [21:51:46] i was incorrect, sorry, we lost 2049 due to hardware failure and that was what we removed from the master list [21:51:49] RhinosF1: ah yeah I see that now. so AFAIK we're only reimaged 2027 today. Separately though we replaced elastic2049 with elastic2054 as a master; that's the source of the settings check [21:52:09] Yeah, basically we have some between-elasticsearch-cluster replication magic that needs an updated list of the eligible masters to do its thing [21:52:24] So the settings check alert name is a bit vague but that's actually what's complaining [21:53:03] Cool [22:09:42] hmm, unrelated to all the above, the refinery-drop-hive-partitions doesn't seem to recognize `snapshot` as a date containing partition key [22:14:11] i suppose if we want to use snapshot= instead of date= may need to throw together a custom version of refinery-dorp-mediawiki-snapshots [22:14:29] s/dorp/drop/