[08:20:13] o/ [08:43:05] o/ [08:43:55] have a quick SUP patch to start consuming "v1" streams: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124484 [08:44:51] followup patch is to start writing to these v1 streams [08:46:26] gmodena: any objection to merge&deploy https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1105 ? [08:48:11] dcausse let's do it [08:48:37] i don't have a good way to test it without hitting the airflow scheduler [08:49:13] i'll clean up the commit message and merge [08:51:34] done [08:54:19] thanks! [09:11:37] np! [09:21:46] running https://phabricator.wikimedia.org/P74203 to salvage historical query clicks [09:28:35] done [09:31:29] ack! [09:43:52] seems like refinery-drop-older-than was designed with systemd-timer/cron like systems [09:44:24] ideally it should drop a bunch of partitions based on airflow execution date [09:45:18] here if it fails for more than --allowed-interval days you have to manually do something [09:49:24] yep [09:49:47] that requires some manual execution, and care in dropping the right data [09:50:36] IIRC it should suffice to recompute the validation checksum ? [09:52:24] yes but it's somewhat tedious to do the cleanup manually, if refinery-drop-older-than was designed to use an execution date instead of now it might be able to backfill failed runs without manual intervention [09:53:42] here I update the checksums but I must deploy it within the allowed-internal (3days) otherwize I might have to another set of cleanups manually :) [09:54:21] :| [09:54:51] i think i had to do the same for webrequest. could not find a workaround [10:40:54] gmodena: if/when you have a sec: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1160 [10:48:37] i'm indexing enwiki embeddings locally - it might be very much viable. Wish me luck :) [10:48:40] dcausse checking now [10:48:54] :) [10:59:05] re-enabling the drop_old_data_daily dag... [11:05:24] drop_old_data_daily has max_active_runs=2 not sure that's wise, it might try to delete the same set of data... [11:06:10] dcausse ack... mmm. the checksum will guard, but yeah I think it's risky [11:10:18] refinery-drop-older-than ran fine but now it's running 2 concurrent drop_snapshot_partitioned_partitions which both say: "Dropping 6 partitions from discovery.cirrus_index" [11:13:33] skein logs dont' show much [11:14:16] and these script don't run on spark. Annoying [11:16:00] they run pyarrow which I guess interact directly with hdfs [11:17:24] I guess they could both fail... we'll see [11:20:06] dcausse possibly. I see they are running on overlapping partitions :| [11:20:48] lunch [11:26:27] actually they both succeeded :) [11:46:07] nice :) [11:56:30] lunch [13:11:54] o/ [13:44:43] o/ [13:59:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129264 CR for adding the cirrus::opensearch role and hieradata for prod hosts [14:09:18] \o [14:09:23] o/ [14:16:17] * ebernhardson wonders why i ever set that at max_active_runs=2 [14:17:02] this repo doesn't say, the first commit where i brought it over from airflow v1 has that [14:20:54] ebernhardson maybe it was assuming a sequential pool? [14:21:35] o/ [14:21:46] gmodena: nope, just catchup=False, max_active_runs=2 [14:21:56] ebernhardson: we're in https://meet.google.com/aod-fbxz-joy?authuser=0 discussing sudachi if you're around [14:22:01] i suppose in theory, that means it could only run 2 if somehow the previous days hadn't finished [14:38:25] errand [15:02:11] confirmed, wikidata dumps are cursed :P [15:07:53] hmm, getting an alert for cloudelastic SUP [15:15:53] hmm [15:16:40] actually, saneitizer [15:17:20] saneitizer fix rate .... happening in all three clusters. It's certainly a lot of fixes [15:17:31] almost entirely pageInWrongIndex [15:17:52] which are usually very rare, thats unexpected [15:18:44] spike at 9:16, then flat, then spiking since 13:10. Related to DC switch somehow? [15:18:59] hmm, no that was an hour ago, so this was before [15:19:47] maybe there were some prep steps that could trigger than? Doubtful though [15:21:19] it's 40k docs per hour, or ~10/sec [15:22:39] not obviously an api problem, spot checked some random batches of 500 on enwiki through the api and nothing coming back [15:23:33] might be nice to know what wikis/id's are being seen. I suppose could guess page id from completion % [15:24:57] unrelated, but looks like we'll need to dig up what renders `/etc/spicerack/elasticsearch/config.yaml` on cumin hosts (it has all our endpoints) and make one for opensearch dir [15:26:16] profile::spicerack::elasticsearch_config_data [15:26:54] volans thanks! I'll probably have some more questions for ya before too long ;P [15:27:10] hmm, avg loop completion is 93%, suggests we are at the end of the loop and visiting new-ish pages. [15:28:17] * ebernhardson finds it would have been nice if sanity check events were emit to kafka and then read back in, simply because then i could see where it is and run the same api requests ahead of it :P But that would lack backpressure and be annoying for other reasons [15:28:34] i guess we could add a per-wiki gauge of current page_id, maybe another day [15:33:31] another option: some wiki changed their content namespaces? looking [15:35:02] hmm, there was a change to wmgContentNamespaces at 10:05 today [15:36:09] ah must be it [15:37:20] it looks minor though, but maybe also mistaken: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1129204 [15:37:31] it changes kawiki to kaawiki [15:37:44] oh, wait no thats just a wierd diff, it adds a new kaawiki [15:38:09] kaawiki only has 23k pages [15:39:00] volans probably a dumb question, but can spicerack/cookbooks gate on a host's role? Maybe somewhere in wmflib? [15:39:19] define gate :D [15:39:34] I'm trying to parametrize opensearch/elasticsearch for our `systemd` commands [15:40:07] so like..if role==cirrus, cmd=opensearch, elif role==elasticsearch, cmd=elasticsearch, that kinda thing [15:40:53] in which context, elasticsearch_cluster module or a specific cookbook? [15:41:34] I was gonna do it in rolling-operation.py, but if it's easier to add in spicerack that's OK [15:42:56] are clusters homogenous or have mixed hosts? [15:43:00] I was also thinking of reading the role out of `/etc/wikimedia/contacts.yaml` on the hosts [15:43:11] the clusters will be mixed during the migration process [15:43:39] we'll change the role in puppet and reimage from elastic->opensearch [15:45:05] does systemd supports service "aliases" ? [15:45:29] it does! That's an interesting ides [15:45:30] a [15:46:38] the reason I'm asking is because clustershell (cumin's underlying library for parallel execution) doesn't support sending different commands to different hosts [15:47:07] so however you do the discrimination, it means that you'll have to do a run_(a)sync() for each group of hosts that have a different service name [15:47:20] unless you can find a pattern that matches them all [15:47:46] Could we read the role from the host, set a variable, and use it to render the systemd command? [15:50:10] I also think you can pass multiple patterns [15:50:38] to systemctl $command [15:50:52] so you could incldue them all and they will match only the ones that should [15:51:18] reading an argument from a file on the host seems quite a hacky way to do that :) [15:52:27] `sudo service restart $(cat somefile)`. What could go wrong? :P [15:52:51] so I think you could pass 'opensearch_*' 'elasticsearch_*' 'cirrus_*' [15:52:54] if that works [15:52:56] Will systemd throw an error if I try to restart units that doesn't exist? [15:52:58] to be tested [15:53:27] oh yeah, that could work [15:53:37] with the globbing and the multiple pattern it might allow it [15:54:28] Worst case I can set the var within a shell cmd [15:55:54] hopefully not needed. If you can't solve it any other way I'd suggest a bash wrapper at that point injected by puppet that knows what to do [15:56:28] talk about hacky ;) [15:56:46] that would be used by humans too though :D [15:59:39] taking a 5mins break [15:59:56] coming to no great conclusions about saneitizer :S [16:01:00] it could be seeing a real problem, but i'm not finding representative pages to understand whats going on [16:01:17] workout, back in ~40 [16:45:13] back [17:09:38] hmm...so I think we'll have to actually touch spicerack. https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L209 reads from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/elasticsearch_cluster.py#111 . So a slight change to that might do the trick [17:30:38] probably easier to just symlink `/etc/opensearch/instances` to `/etc/elasticsearch/instances` [17:35:21] what's safer? symlinking `/etc/elasticsearch` to `/etc/opensearch`, or creating the same file resource twice? Does it matter? [17:35:59] I just get a weird feeling that things might hook on to `/etc/elasticsearch` if it exists, but maybe that's just paranoia [17:36:08] hmm, I guess it would happen either way, too [17:36:28] hmm, both ways seem iffy :P [17:38:08] true [17:45:38] OK, this won't completely fix rolling-operation, but it gets us closer: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129334 [17:47:16] oh, hmm [17:47:22] this should probably be in our cirrus.yaml [17:47:27] or pp or whatever [17:47:29] 1 sec [17:53:32] ebernhardson: could be https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1127158/3/includes/Api/CheckSanity.php ? previously we did not report those problems and just failed this chunk of page ids? [17:54:49] I put the symlink in server.pp. It's gross, but at least we're not adding stuff to logstash servers. We can always remove it once the migration is done [17:54:55] lunch [17:56:00] well perhaps not... looking at logstash there were a very number of these errors... less than 50 in 2 weeks.. [17:56:31] dcausse: ahh, yea i suppose in theory it could be but the error rate is too low [17:56:39] since the same batch had been failing consistently [18:19:54] back [18:23:15] * ebernhardson wonders if i really have to define a new event stream in the external configs for a remediations topic of UpdateEvents, or if i can simply reuse something reasonable with a new topic name [18:23:24] one seems tedious, the other hacky :P [18:30:31] well the wdqs updater ran without a proper declared stream so you could just write to kafka but you won't get any hdfs ingestion and all other event stream utilities [18:31:52] hmm, i guess might as well do it properly, although not certain we will need the extra bits [18:32:53] but probably easier if everything is "as expected" [18:34:16] sure [18:34:17] Unfortunately it looks like the rolling-operation cookbook doesn't repool hosts after it fails. I probably should at least add a warning for that [18:36:03] would be nice to have the kind of problem the saneitizer detected but not sure if it can be fitted into existing schema [18:37:30] dcausse: hmm, we would have to wedge it into the UpdateEvent somewhere, or i guess we could create a custom side-output [18:37:45] but with a side output then we also have to define a new schema :) [18:37:52] :) [18:39:31] i suppose the dumber idea...log the remediations and skip kafka [18:39:45] 10 logs per second isn't great, but might be acceptable-ish [18:41:06] oldDoc remediation perhaps does not need logging so perhaps that's even less? [18:41:12] and there are ways to change log levels while the system is running, so i suppose they could be debug logs in the right place and only emit when we want them [18:41:42] could be nice indeed, esp. if we believe that's only useful for debugging [18:42:27] (scarily, iirc changing log levels at runtime amounts to `kubectl edit cm flink-config-flink-app-consumer-cloudelastic` and edit the deployed log4j config directly [18:43:17] i suppose lets go with the easier answer, debug logs seem reasonable [18:43:24] I think that's a nice a capability actually, scary indeed but useful for quick debugging session [18:43:32] +1 [18:50:24] how much context do we need? I suppose something like this might be enough: log.debug("Problem {} on {} for page {} in index {}", errorType, wikiId, pageId, indexName); [18:51:31] i was going to write messages per problem type, maybe there would be some extra context but only for redirectInIndex and pageInWrongIndex...but considering there are 2 indexes, we can guess which wrong index [19:15:48] this might be a dumb question, but I'm trying to figure out what, if anything, this does: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L209 ? [19:16:31] inflatador: the ExitStack [19:16:33] ? [19:16:59] oops, was on an old version of the code ;( [19:17:07] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L217 [19:17:34] inflatador: it's similar to `with self.elasticsearch_clusters.stopped_replication():` instead of `with ExitStack() as stack:`, the main difference is you can conditionally add to the stack [19:18:09] the `nodes.stop_elasticsearch()` . It looks like it runs regardless of the operation (reimage, restart etc) [19:18:36] inflatador: yes, that seems correct [19:18:41] which I guess begs the question why we're doing it here: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L260 [19:19:05] and if the answer is "we screwed up", that's OK...just wondering if it would be OK to replace line 260 with the same call [19:19:38] inflatador: hmm, that does seem to suggest it stops it twice. I imagine looking at the logs of a run should make it more clear that it's doing it twice? [19:20:50] ACK, let me kick off another run and see what happens [19:30:56] not sure why, but it's failed twice on connection errors to psi: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='cloudelastic.wikimedia.org', port=9643): Read timed out. (read timeout=10)) [19:31:49] hmm, the ports ending in 43 should be nginx? Odd to connect timeout there, i would assume nginx would connect and error some other way [19:32:39] I have a feeling we aren't waiting long enough after depooling the server [19:32:47] and/or timing out too quickly [19:33:14] b/c I seem to be getting cloudelastic1012 every time I manually curl. I guess I could depool and see how long it takes to get another host [19:35:05] does seem plausible it needs more time to depool, i'm not sure how quickly lvs picks up changes [19:35:44] in my test, it took only about a second to get a different host, so...hmmm... [19:39:00] I do see some 1-second timeouts in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/elasticsearch_cluster.py#474, but I don't think we're actually hitting that code [19:50:06] hmm, so get debug logs now. They are essentially issuing pageInWrongIndex for everything. Example log: roblem pageInWrongIndex on frwikisource for page 4236999 expected in index frwikisource_general [19:50:12] but it repeates for 98, 97, 96, etc. [19:50:35] performing same api call externally doesn't report any problems. [19:51:11] should have noticed before...this translates into a bunch of document_missing_exceptions as well [19:52:02] does'nt seem wiki specific, same for bnwikisource, eswikisource, svwikisource....curious they are wikisources [19:55:42] curiously it's always _general, never _content (so far), and the only wiki that doesn't mention wikisource is sourceswiki [19:56:30] was there a general change in wikisource content or searchable namespaces that wasn't obvious, perhaps as part of the train? [19:57:41] turns out i can get the errors from the public api, just have to get far enough ahead of the saneitizer: https://pl.wikisource.org/wiki/Specjalna:%C5%9Arodowisko_testowe_API#action=cirrus-check-sanity&format=json&from=1171190&formatversion=2 [19:57:46] (works for now, probably not in a few :P) [19:58:34] it feels like namespace 100 moved from _content to _general [20:02:41] Picking my way thru the stack trace here https://phabricator.wikimedia.org/P74260 [20:08:15] this is so wierd... MediaWiki\MediaWikiServices::getInstance()->getNamespaceInfo()->options->get( MediaWiki\MainConfigNames::ContentNamespaces ); is missing ns 100 and 102, but its in wmf-config and $wgContentNamespaces [20:12:28] so here is the mystery: https://phabricator.wikimedia.org/P74261 [20:13:29] sudo? [20:14:25] ah no sorry its getMainConfig vs getNamespaceInfo [20:14:54] dcausse: yea sudo is just to let the ->options work since it's not public [20:15:22] must be new as part of the train [20:15:53] dcausse: but even still, the service options should just be fetching from MainConfig? [20:16:15] but MainConfig has the right data, and the ServiceOptions passed to NamespaceInfo doesn't have it [20:16:49] and we are deciding based on NamespaceInfo::isContent [20:22:10] Double checked and indeed the timing all lines up with synching wikiversions files [20:37:20] nothing jumps out poking through the mediawiki/core patch list for updates in wmf.21 [20:41:24] > 100 is extensions [20:45:34] hmm, so for the example wiki in paste (plwikisource), 100 is Page and 102 is Index. [20:56:58] i guess if enough copying goes on, the problem could be order of operations? wmf-config for plwikisource only adds 104 and 124. 100 and 102 seem to plausibly be added by ProofreadPage\ProofreadPageInit::initNamespace via the SetupAfterCache hook [21:02:35] sigh, yes that's what happened (added debug echo's to mwdebug1002). NamespaceInfo gets initialized before ProofreadPageInit runs [21:02:42] but what am i supposed to do with that? [21:03:51] :| [21:05:17] ProofreadPageInit is wrong should be using onCanonicalNamespaces no? [21:05:44] well not sure, I'm mixing things [21:06:14] hmm, maybe? I'm not sure ContentNamespaces has a direct hook [21:06:38] i guess i could add a core hook in NamespaceInfo... [21:07:03] or we could file a ticket and say "not our problem". but kinda meh :P [21:07:41] on our side, it's doing what it's supposed to. It's not content anymore so it's getting moved...but thats just making a mess of searchability there [21:08:11] well... if some subtle order of operation changed and it's affecting ProofreadPage other things might be affected [21:08:37] true [21:09:34] with a quick grep, proofread page looks to be the only extension directly changing wgContentNamespaces [21:09:41] at least the only prod deployed extension [21:09:45] ack [21:10:26] this probably breaks something else on wikisources thats related, but i wouldn't know what. no new tickets that look related filed today in phab [21:10:31] the bug is that Proofread ns are now non-content ns, I have no clue if that's a big deal or not [21:10:56] seems close to a UBN at least high imo [21:11:22] realized one annoyance...before when this happened on small wikis we could manually run an in-process saneitizer via mwscript that got it all done quickly instead of slow-rolling over 2 weeks [21:12:12] we can force run a whole ns via cirrus-rerenfer but that does not cleanup the other index [21:12:31] i suppose i'm just wondering if lots of things are currently unsearchable [21:12:36] because we look in the wrong index [21:13:04] very likely, no clue how big these namespaces are [21:13:06] well, not unsearchable, but wont be found depending on if your namespace selection gets only one index or both [21:13:23] you have to search all ns to bring the two indices [21:13:31] size must be reasonable to get 10/sec on the two week loop [21:14:07] problem is that if that bug is fixed in a couple days it'll do the same dance again :/ [21:14:11] yes [21:15:54] looking at a stack trace from the NamespaceInfo constructor, NamespaceInfo is being initialized during HookRunner::onSetupAfterCache, so inside the same hook (iiuc) [21:16:20] when it's getting the handlers to be run [21:17:19] stack trace: https://phabricator.wikimedia.org/P74262 [21:18:58] so some handler wants the SpecialPageFactory that wants the ContentLanguage which wants the LanguageFactory which wants the NamespaceInfo [21:21:28] changing globals is too brittle... [21:22:14] indeed, the proper fix is probably to have an expected way to augment ContentNamespaces, probably via a new hook in NamespaceInfo [21:22:32] i was hoping to find something to revert while that gets figured out though :P [21:23:46] Hope I don't jinx it, but it looks like the rolling-operation cookbook is gonna work [21:24:04] apparently we can blame someone in 2015 for this....but there weren't better options a decade ago:P [22:06:26] filed T389430 and the train was rolled back for now, so the pages that haven't been moved should become searchable, but a fix will be needed by someone [22:06:27] T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430 [22:06:28] Mediawiki config change to shut off codfw here https://schedule-deployment.toolforge.org/window/1742500800 [22:07:09] looks good [22:09:53] Stepping away for ~20m to pick up my son. Rolling operation is happening in cloudelastic. No problems so far, but if you see any alerts, that might be why [22:10:01] cc ryankemper [22:17:22] with the rollback saneitizer stopped moving pages, I do wonder a bit what to do with the 300k that got moved. We don't have a simple log of what pages need to moved back and it will take two weeks to get back to them [22:25:46] back [22:35:14] cookbook failed with `Error while waiting for yellow with no initializing or relocating shards`. Maybe a bug in our logic that looks for yellow specifically when the --allow-yellow flag is set [22:44:58] I repooled all CE hosts and turned replica relocation back on. See ya tomorrow!