[05:57:09] All the reboots are done except elastic eqiad (partially done, can resume w/ `sudo -E cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad cluster reboot" --reboot --nodes-per-run 3 --start-datetime 2023-05-04T04:35:07 --task-id T335835`) and wcqs codfw [09:42:30] https://wiki.bitplan.com/index.php/Wikidata_Import_2023-04-26#stats_.2F_ETA_script shows a difference of performance in the two import machines. I fear i am doing the import on a rotating disk RAID. Should i restart on an SSD or simply put up with the extra time this migh take? For jena the effect of a rotating disk was horrible - like 180 days instead of 12 days. [09:57:50] Lunch [09:58:53] seppl2023: we're running WDQS on SSD. In our context, the limiting factor is the CPU. [10:04:31] I suspect the spinning disks will reduce the throughput very significantly, but that's not something we've tested [10:55:20] I'll let things run then for comparison and simply copy the journal file from the machine that is quicker back to the slower one later if the load should be successful [14:37:59] o/ question about Search. is it fair to say that our current infrastructure operates at the page level. that is, all keyword extraction, weighting statistics, etc. happens at the page level as opposed to, for example, by section or paragraph or some other unit of content? [14:38:19] context: for the hackathon, I'm putting together a demo on natural-language search with Wikitech to showcase a use of language models and wanted to be able to describe how it's similar/different from our existing Search. my demo operates at the section level for comparison. you can see the initial static version if you're curious: https://public-paws.wmcloud.org/User:Isaac_(WMF)/hackathon-2023/wikitech-natural-language-search.ipynb [14:38:57] isaacj: yes, that's a fair description! [14:39:35] excellent, thanks for the confirmation! [14:57:32] \o [15:29:17] isaacj: maybe our highlighter is relevant in this context. While we search at article level, we do try to highlight the more relevant section in the search result page [15:30:45] gehel: that makes sense. essentially, it uses the page unit for pre-processing / candidate generation but then the post-processing can still help narrow down to smaller units. [17:14:03] ryankemper, inflatador: I just replied to Willy about reducing the hardware budget for next year. Feel free to disagree with me directly on the email thread! [18:32:23] gehel: I’ll be at pairing in 2’ [18:32:29] ack [23:00:32] trying to look at `(RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable` [23:01:39] The pod in question appears to be this one: [23:01:41] https://www.irccloud.com/pastebin/hhAu5E4w/ [23:02:34] Checking `kubectl logs -f flink-session-cluster-main-86d8f9978-4z29r -c flink-session-cluster-main` I see a whole lot of: [23:02:49] `{"@timestamp":"2023-05-04T23:01:58.120Z", "log.level": "INFO", "message":"The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.", "ecs.version": [23:02:49] "1.2.0","service.name":"main","event.dataset":"main.log","process.thread.name":"flink-akka.actor.default-dispatcher-8962","log.logger":"org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor"}` [23:03:09] Oops, message got split. TL;DR `The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet` [23:19:22] ...And this is where I'm a bit deadended though. It's not super clear to me where the jobmanager lives