[11:47:26] lunch [14:21:01] dcausse: would you be able to add a few bullet points about the findings from the current query analysis? [14:21:58] gehel: where? [14:24:09] o/ [14:28:28] dcausse: anywhere (etherpad? comment on T355040) [14:28:29] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs - https://phabricator.wikimedia.org/T355040 [14:28:50] and I'll add it to the weekly status notes, especially in the notes for the steering committee [14:29:16] ok [14:51:58] dcausse: ping me when you have the bullet points, I'm waiting on that to publish the weekly update. Or let me know if I should add them next week instead. [14:52:25] gehel: doing it, should be done in ~10min [14:52:38] great! thanks! And sorry for being pushy :/ [15:04:49] gehel: T355040#9509621 [15:04:49] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs - https://phabricator.wikimedia.org/T355040 [15:07:35] dcausse: thanks ! [15:25:43] weekly status: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-02-02 [16:00:24] \o [16:01:22] o/ [16:05:33] * ebernhardson notices cindy doesn't seem to be voting [16:06:25] no :/ [16:06:46] it's been a while, will take a quick look, might just need a "reboot"? [16:06:51] also cloudelastic fix rate is way up today :( i started looking at a few yesterday and have some bugs to investigate, but will probably be a bit of wack-a-mole [16:07:00] dcausse: yea, possibly the instance was rebooted and it's just not running [16:07:37] ebernhardson: if you have some page ids to investigate that's already a good start, but yes it's generally tedious to debug :/ [16:08:16] dcausse: what i've done is collected the raw json logs from mwlog that come from the LogOnlyRemediator, then i have a python script that re-verifys the errors exist, filters for when the error only exists in cloudelastic, and then reports it [16:08:25] although i've only done the ghost page in index part so far, but can expand the rest [16:09:18] so might be missed delete or perhaps a move between content and general namespaces [16:09:47] so far for these, it looks like maybe a delete and a tag update got merged into a rev_based_update, preventing the delete. Then one where we ran a delete for the redirect and not the page it was pointing ta [16:10:11] might be a problem with source events there, needs more investigation [16:10:24] sure [16:12:04] hmm, it must have been awhile [16:12:11] the current output to cindy tmux is: Execution of 44 workers started at 2023-09-18T19:05:06.368Z [16:12:36] restarted the script and will see what happens [16:14:00] thanks! [16:14:07] panic from go (mw-cli) :S [16:14:34] :/ [16:15:25] will poke at it, but it's not super clear... it's panic'ing from config.LoadFromDisk() [16:23:33] deleted ~/.config/mwcli, re-did the initial setup (from first-run.sh, but manually), and it looks to be going again. At least it's creating the env [16:32:04] ryankemper or anyone else, CR up for migrating cloudelastic1009 ... the check experimental failure is expected, since we're changing its name in site.pp [16:32:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/995223 [16:33:16] workout, back in ~40 [16:35:48] hmm, cindy is taking cpu with chromium, but the tests don't seem to be making progress :S [16:36:31] * ebernhardson is tempted to spin up a new instance and hope for the best, in theory the first-run script should take care of most of it [16:39:52] yes... or a reboot of the "instance"? [16:40:44] i guess a reboot first can't hurt [17:25:35] back [17:54:31] hello! did something bad happen to elasticsearch on the evening of the 30th? It looks like a lot of jobrunner jobs have been failing against it since then https://logstash.wikimedia.org/goto/69b47b1f679b305a70c6d8d165826678 [17:55:11] I see there was a rolling restart at 22:41 in SAL [17:59:25] hnowlan Looking...that may have been when cloudelastic borked. Shouldn't have affected the prod cluster [18:00:01] thanks! I'll create a task just given that it's end of day here [18:00:43] hnowlan ACK, feel free to tag/assign to me [18:03:05] hnowlan: it's a known problem, related to the train rollout. a fix will be in next weeks train. Basically what happened is we started running a reconciliation process against all known clusters, and that ended up catching a default cluster that points at localhost [18:03:31] hnowlan: so the reconciliation fails there every time, but it doesn't really matter since the cluster isn't real. The next train will correctly recognize those [18:04:15] previously we always operated off specific lists of clusters, nothing actually looked at the full list of defined clusters so we hadn't noticed [18:07:06] ebernhardson: ah I see... I assumed the error around localhost in this case was a failure to connect to ES on the service proxy [18:08:24] The errors are saying it couldn't connect to localhost [18:08:35] but I don't know much about the internals of the job [18:31:42] lunch, back in ~40 [18:48:20] well, after poking cindy a whole bunch...the test just started running, and i didn't particularly change anything ... [19:10:23] inflatador: haven’t fully dug into it but at first glance the patch looks reasonable [19:31:59] ryankemper ACK...I'm thinking merge, roll-restart to pick up new master config, ban cloudelastic1009 then do the reimage [20:04:25] Heading to my medical appt...I saw we had a morelike alert that cleared. Will check out when I get back [20:43:10] Huge traffic jam, so I turned around