[08:57:09] dcausse: I would be like 10min late [08:58:56] ejoseph: no worries, ping me when you're around (but our meeting is in 30mins not now :) [09:49:26] dcausse: i'm around now [09:55:10] ejoseph: sure joining [10:43:05] lunch [12:48:34] greetings [12:49:50] o/ [13:00:28] welcome back! [13:00:55] (or maybe you came back yesterday, but I was gone ;P ) [13:02:11] thanks! :) [13:50:42] dropping off my son, back in ~20 [14:01:20] back [14:03:56] errand [14:59:34] \o [15:00:36] o/ [15:21:11] i got annoyed with cindy so spent friday making the integration suite work in mwcli, it kinda/sorta works. It has insanities, like mwcli doesn't have a job runner so you have to run a bash loop hitting runJobs.php for all the wikis :P [15:21:36] would need more work to actually pass anything [15:25:57] nice! [15:52:09] * ebernhardson realizes still have to make normal cindy work, something has to vote :P [16:27:56] hmm, unexpected. https://cirrustest-cirrus-integ02.wmflabs.org/wiki/$US?action=cirrusdump [16:28:12] the pages title is '$US', not a redirect, but cirrus got 'US' [16:28:35] hm... [16:29:33] i think core might have changed something, waiting for the edit to $wgNamespaceAliases failed too [16:29:36] (page named $wgNamespaceAliases) [16:29:40] some rewrite rules? I can't seem to understand where this happens [16:30:07] i have no clue either :P was hoping it would ring some bells [16:30:42] errandhttps://cirrustest-cirrus-integ02.wmflabs.org/wiki/$US exists [16:30:48] ah [16:31:20] sql agrees that page_id 112 is $US, we are just losing the $ somewhere. should be fun :P [16:31:51] action=info seems ok [16:31:59] some param sanitization? [16:32:24] oh [16:32:30] https://cirrustest-cirrus-integ02.wmflabs.org/wiki/US?action=cirrusdump is different id [16:32:37] 112, 113 [16:33:36] huh, so https://cirrustest-cirrus-integ02.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusdoc&titles=%24US and https://cirrustest-cirrus-integ02.wmflabs.org/wiki/$US?action=cirrusdump have different values [16:33:52] and cirrusdoc has different values than cirrusbuilddoc as well (and cirrusbuilddoc seems reasonable) [16:34:09] stale data? [16:34:13] oh, i guess cirrusdoc is probably throwing it out as not accurate [16:35:51] it's failed twice in a row running the integration suite though, doesn't seem random :S hmm [16:36:08] :/ [16:36:29] I don't get why cirrusdoc fails to print the indexed data with https://cirrustest-cirrus-integ02.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusdoc%7Ccirrusbuilddoc&titles=%24US [16:36:38] but is getting something with https://cirrustest-cirrus-integ02.wmflabs.org/wiki/$US?action=cirrusdump [16:36:59] i don't remember exactly, but it has a bunch of extra safeguards that sql and cirrus match, so we don't prematurely declare 'edit success' in the integration suite [16:37:12] oh, probably revision id. lets see [16:38:01] sigh, this has the wrong data. [16:38:15] This says page_id 112, rev_id 113. rev_id 113 is not for page_id 112 [16:38:28] (the cirrusdump) [16:38:52] sigh... [16:41:31] could there be 2 cindys running at the same time? [16:42:06] shouldn't be, i killed everything node/grunt in ps before starting the second round of testing, theres also an flock (maybe it even works) in run-cindy.sh [16:42:31] * ebernhardson is dubious of posix flocks, probably for no good reason because we don't use NFS :P [16:42:40] :) [16:43:26] could there be stale data in the jobqueue, can't remember what implementations it use [16:43:36] but if it happened twice it's probably not that [16:44:48] hmm, yea thats a possibility i suppose. Does seem odd it would happen the same twice. There are also a small number of other failures, maybe a dry run of the saneitizer to see how far off the index is from reality [16:45:28] fails waiting for edits a few times (if rev_id's are going in the wrong place that would do it) [16:47:46] same problem, this has a title and rev_id for something else: https://cirrustest-cirrus-integ02.wmflabs.org/wiki/ILinkToNonExistentPages1654015572917?action=cirrusdump [16:47:47] sigh [16:48:02] should we worry about corrupting prod indices? [16:48:14] I hope we don't... [16:48:28] does the sanitizer captures these? [16:48:47] i was waiting for the test to finish, but i guess there are plenty of fails to already find. Lets see [16:49:38] at least I don't see anything crazy in grafana [16:50:00] hmm, doesn't look like it's finding it. Finds a few pages in wrong index though [16:50:05] :/ [16:50:11] it really should though, i'll see if i can make saneitizer find these [16:50:33] make sure the revision matches the title? [16:51:57] hmm, not the hardest fix but also not obvious. The problem is we `if ( $version < $latest ) {` [16:52:09] can change to equality, but then it adds other race conditions [16:52:15] yes... [16:52:46] i guess easiest would be a direct check, revision owned by page. Seems a bit wasteful to always check that but seems necessary atm [16:57:55] sigh the config for wikidata prefix does not seem correct [16:58:19] also very odd, out of 429 pages, 19 have a revision id that doesn't match page_latest [16:58:28] I wonder how this might have affected your a/b test [16:58:38] dcausse: oh, hmm [16:58:49] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/SearchSettingsForWikidata.php#6 [16:58:56] wWBCSPrefixSearchProfile should be wgWBCSPrefixSearchProfile [16:59:16] huh, so what is it using? [16:59:52] WikibaseCirrusSearch, lemme check what it's using when unset [16:59:52] implies the tuned parameters are worse than the defaults we ship :P [17:00:40] depends on where you switch, might perhaps also mean that nothing changed between a and b? [17:02:07] config used was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/787069/1/wmf-config/InitialiseSettings.php, hmm [17:02:55] i'm mostly guessing, but cirrusWBProfile probably overrides $wgWBCSPrefixSearchProfile? looking [17:03:34] yes it should override the query builder (not the rescore tho) [17:05:05] hm.. it might have compared it with code defaults indeed [17:05:12] "code_default": "default", [17:05:14] "actual_default": "default", [17:05:20] from https://www.wikidata.org/w/api.php?action=cirrus-profiles-dump [17:06:36] comparing some exact numbers in https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&search=example&language=en&context=item&cirrusDumpQuery=1&cirrusWBProfile=wikibase_config_prefix_query-202203-en&cirrusRescoreProfile=wikibase_config_entity_weight-202203-en it looks like it uses the weights [17:06:59] and without the special params it's getting the cirrus defaults [17:07:03] so the AB test was worse than those :S [17:07:22] :/ [17:08:03] I don't think it was significantly worse so it might just be noise [17:08:23] yea, it was barely different from control, many things indistinguishable [17:08:41] sigh... so now I worry about fixing this config error [17:09:19] oh wait you pushed the new settings? so that must be ok [17:09:43] hmm, actually maybe this is an integration test problem and not cirrus. Running saneitizer early in a new test suite run emits a bunch of `Deleted page in index 158 SON Nearmatchflattentest [17:09:52] basically, maybe the index doesn't actually clear between tests [17:10:14] that could certainly cause all sorts of oddity...looking [17:10:26] it does a curl -XDELETE /_all but nothing actually check that it works [17:11:35] FYI cloudelastic-chi-eqiad went into red status during our reimage ( T309343 ), checking it out now [17:11:35] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [17:11:44] doh! [17:11:54] exciting morning :P [17:13:35] inflatador: if you need help lemme know [17:14:48] i wonder if thats caused by us running reduced replicas in cloudelastic, there is only 1 replica instead of 2 like a normal cluster [17:15:27] also means cloudelastic always has to restart one node at a time, although i suppose my intuition would be if we restarted both at the same time it would temporarily go red, then come back when the nodes did. But never tested that scenario [17:16:09] yeah, just doing basic health checks ATM [17:16:17] says 2 unassigned shards [17:18:48] yeah, it looks like we lost commonswiki_file_1647920262 , primary and replica [17:20:24] hmm, logs are a little hard to read because there are multiple clusters on the same hosts and it's not obvious which cluster each log is for :S [17:20:30] (from logstash) [17:20:37] shard 11 specifically [17:21:17] sadly i think the only thing we can do is deploy the swift snapshot plugin and copy an index from prod. [17:21:19] not sure how we lost it, since 1004 wasn't down at the same time as 1006 and has recovered [17:21:47] but yeah, if that is the only way fwd that's OK, I need to learn how to do that anyway [17:22:01] let me grab a drink and we can hop on a Meet if that works for you [17:22:07] sure [17:24:21] huh, /etc/ssh/ssh_known_hosts on the bastions doesn't have cloudelastic (tried to use my known hosts update script) [17:25:00] or hmm, it does and my local ssh doesn't seem to like importing. oh well [17:25:20] oh! it has 1,2,3 and 5. Missing 4 and 6. How does that happen? [17:25:41] * ebernhardson has no clue where this even comes from, guess i have to look later :P [17:26:20] I think it's due to the recent failed reimages? 4 probably got new host keys, although its reimages failed and it did not lose data [17:27:29] pondering, we'll also need to pause the saneitizer for cloudelastic on commonswiki. Checking how, we might not have a specific flag for one cluster [17:28:59] OK, up at https://meet.google.com/jwv-qtkd-fko [18:23:26] dinner [18:31:57] gehel we are in https://meet.google.com/jwv-qtkd-fko instead of normal SRE room [21:20:02] actually good timing, i was about to forget to get liam from school in 20 minutes :) [22:34:36] regarding cindy, it seems as long as elasticsearch indices are properly nuked it passes. I'm not actually seeing where we ever nuke the indices in elasticsearch, easy enough to add but this had to have been somewhere before [22:37:39] oh nm, it's supposed to happen from UpdateSearchIndexConfig.php --startOver. And the reason it happens is because when i fixed maintenance script arg passing i only half fixed it, and it was poorly tested because cindy wasn't even running the maintenance scripts at that point (changes to master broke the integration pipeline, required the maint patch to fix) [22:37:49] re: I8dc4ef174a0c0 [23:40:23] other random oddity, eventgate stopped running on the cindy instance. It looks like the vagrant git-update routine ran `npm install` there and failed to put a node_modules directory into place. Manually running didn't help, but reinstalling the nodejs 10.x deb made it work again