[08:59:29] I've updated the meet link for all of our meetings to reduce the confusion (so that it isn't named "Wednesday Meeting (tm)". The new link if you want to bookmark it: meet.google.com/eki-rafx-cxi [09:03:15] thanks! [10:07:48] dcausse: are you around [10:07:59] ejoseph: hey, yes [10:09:03] I am on the meet [10:09:06] ejoseph: our meeting is at 11:30 on tuesdays but I can join now if you want? [10:09:15] joining [10:09:17] oh sorry [10:09:23] np [10:09:28] I thought it was 11 [10:10:08] I have a another meeting at this time on tuesdays every two weeks that's why I've set it up to 11:30 [10:55:37] lunch [13:13:25] greetings [13:14:26] o/ [13:15:10] inflatador: (cc r.yankemper) thanks for the deploy & wdqs switch yesterday! [13:16:41] np, if you need anything else LMK, just going over the sprint stuff from yesterday [13:19:18] inflatador: for T303256, I restarted blazegraph wdqs1006 this morning and it picked the right jvm option, so I think we'd need to do a rolling-restart of all the wdqs-blazegraph services [13:19:18] T303256: WDQS servers should use skolem for wikibaseSomeValueMode - https://phabricator.wikimedia.org/T303256 [13:23:06] dcausse oh yes! I had that rolling restart in my notes but lost track of that ticket. Should be able to finish that shortly. Do you want me to do both DCs or just eqiad? [13:23:31] inflatador: both DC would be perfect [14:22:35] !log T303256 bking@cumin1001 restarting wdqs services `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-blazegraph` [14:22:35] inflatador: Not expecting to hear !log here [14:22:35] T303256: WDQS servers should use skolem for wikibaseSomeValueMode - https://phabricator.wikimedia.org/T303256 [14:22:55] oops! [14:22:58] :) [14:23:44] dcausse restart was successful, are there any other BG instances we need to restart? [14:24:01] inflatador: I don't think so, thanks! [14:41:57] Hi folks - I just sent out an email with an ask about SDAW Search Improvements with a tight turnaround. Please check your email and have a look, thanks! [14:48:32] errand [15:01:35] well thats not going to be confusing. Login to email with the okta password, but not the okta 2-factor auth (use the google one) [15:15:51] what was confusing for me is that the google pass immediately changed to the new okta pass [15:18:00] yea, it logged me out before i even fully agreed to okta, no going back :) [15:18:37] :) [15:32:52] ebernhardson: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/770598 is the first patch making CirrusSearch actually requiring elastic 6.8 right? [15:34:30] dcausse: oh, i should double check. I didn't actually see if include_type_name existed in 6.5 [15:34:58] I think I briefly saw erros in cindy logs but did not dig in further [15:36:08] i suppose somewhat relatedly, the metastore change for the type name is going to be annoying to deploy as well [15:37:04] Some spots are easy to fix, like the MetaNamespaceStore::find() can be adjusted to search against the index and not the type, but GET requests for a specific doc id have to include the type in the url, but the type has to change from mw_cirrus_metastore -> _doc [15:37:39] pondering replacing the get by id requests with a search by id which seems stupid, but at least avoids breaking when the code expecting _doc is deployed and the live index still says mw_cirrus_metastore [15:42:17] if only calling _doc always worked eventhough you did not set it with that name [15:44:36] yea that would make things so much easier, but instead it just says 404 :( [15:45:35] I wonder if could just to a quick query to inspect the current type name instead of having that being hardcoded [15:45:51] IIRC the metastore is not heavily queried [15:46:16] hmm, yea the namespace query is the only one that i thing is used with regularity, everything else could suffer an extra round trip to find the state. I suppose that would work [15:47:07] the namespace query? [15:47:51] dcausse: Searcher::findNamespace calls MetaNamespaceStore::find [15:48:31] oh I think I removed that one, at least from wmf production setup [15:48:49] oh right, that ukr30 (probably different letters :P) thing [15:48:50] using php intl things [15:48:59] quick workout, back in ~30 [15:49:53] i guess its utr30 [15:51:37] yes that's it, a unicode draft that never reached consensus but was kind of usefull in search usecases [15:53:44] hm.. can't remember if we use the metastore for the freeze index thing [15:54:03] yea looks like we should mostly be not using the metastore for namespaces, if it's only used in maint scripts things aren't nearly as concerning. Will have to look a bit closer to verify [15:54:37] i'm pretty sure freeze goes through metastore as well, although i keep wondering if we actually need that functionality still [15:55:08] we needed freeze when elastic needed 40+ minutes to pick up the pieces after restarting a single instance, because freezing made it take 5 minutes instead (when it worked right) [15:55:16] but these days it takes <5m regardless? [15:55:41] no clue it's been a while since I restartd the clusters [15:56:12] don't even know if this "freeze" feature has been implemented in the cookbooks [15:56:21] if it's not I'm all for removing it [15:56:27] hmm, for some reason i think it was. can look [15:56:38] also tempted to restart a random codfw instance and see how long before its back :) [15:56:44] :) [15:57:42] yea freeze/thaw are in the cookbooks [16:00:06] looks like we have a parallel implementation in spicerack for freeze/unfreeze (makes sense, it's a simple sigil doc-id) [16:03:44] restarted instance with 112 shards, ~70 were picked up within 30s of restarting, all indices assigned within 2 minutes [16:04:52] i suppose we would want to try full-cluster rolling restarts to actually know, but i suspect if we can roll a restart through codfw in the same time without freezing, then we should kill that unnecessary complication as well [16:22:13] and back [16:22:58] freeze is implemented in the cookbooks, but I think last time we tried it it didn't work. ryankemper might remember more [16:27:43] inflatador: fyi wrt the wdqs stuff, restarting wdqs-blazegraph is going to impact service availability so if done manually the procedure is depool w sleep -> restart -> repool w sleep [16:28:05] Generally I’d just roll a deploy instead though, usually no reason not to [16:28:48] Be sure to add your votes to the SDAW UI spreadsheet linked in @cbogen_ 's email. (Her last IRC message may have been sent a bit early on the West Coast, so here's a reminder!) [16:28:51] We will be working on / supporting the chosen project(s) for a few quarters, so adding your input and being familiar with them and ready to discuss them tomorrow is important! [16:29:16] ryankemper oops, apologies for that [16:31:28] ebernhardson: +1 to dropping code [16:35:08] yea if it didn't even work correctly last time, i don't see it worth fixing. Will remove that too :) [17:12:24] about to try and enroll in Okta...wish me luck ;) [17:13:38] at least irc isn't going to log you out too :) [17:20:53] true enough! Okta seems like it worked, so lunch break for the next ~45 [18:22:32] back [18:24:36] In other news, I've snuck in a late item into the SDAW UI spreadsheet (line 30)—come vote for it! [18:28:39] :eyes [18:47:20] heh, uploaded patch to drop freeze/thaw indexes and cindy completely broken. Searching in all the wrong places to find out the vagrant LocalSettings.php uses a class referenced that moved, so all requests failed [18:47:50] (the class reference moved in mediawiki core, this just happened to test with the updated mediawiki branch too) [18:58:50] ebernhardson: is this https://gerrit.wikimedia.org/r/c/mediawiki/vagrant/+/769566 ? [18:59:26] dcausse: yea probably that one, we don't really re-provision on cindy so it doesn't get these updates [18:59:50] ok [19:00:05] dinner [19:26:28] I'm not sure how to look into it, but sonarcloud builds are failing against master: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/771007 [19:30:03] gehel: inflatador: will be ~3 mins late to pairing [19:30:10] ack [19:53:46] i know not everyone here uses slack. there is a channel for the SDAW planning that the SD engineers asked to be invited into as well to be more involved. I have been trying my best to represent yall so far and keep the chatter down, but if you are interested i can invite you in as well -- it might make more sense at this stage for having conversations in one place [19:54:32] i'll invite everyone on the team into it, but feel free to mute/leave the channel if it's not helpful to you. i can always invite you back in later too [20:05:38] lunch [20:05:59] mpham: thanks, much appreciated! [20:12:20] +1 [20:12:28] quick break, back in ~15 [20:34:00] back [20:56:47] back [21:13:59] ryankemper ebernhardson I'm in https://meet.google.com/adc-wzno-uvu whenever [21:30:24] inflatador: ebernhardson: https://phabricator.wikimedia.org/P22359 [21:46:32] Here's where we already disable the `elasticsearch.service` for future reference: https://github.com/wikimedia/puppet/blob/996c8e487dde2f51a6f5149b1d16f6cacd3b1181/modules/elasticsearch/manifests/init.pp#L133-L139 [22:39:04] random thought, maybe relforge is exposing some other problem that we would only see in a 2-node cluster, when one node goes down the other doesn't know if it can be the master or not (needs two votes) [22:39:14] (goes down == reboot)