[09:45:50] sigh... rdf-streaming-updater in staging is suspended since yesterday... [10:02:43] Hm [10:05:17] dcausse: I can’t see k8s staging in logstash at the moment do you know the cause already? [10:05:56] Ah, never mind, I see the logs now [10:06:26] pfischer: no clue what's root cause the flink message is pretty uninformative: "Job 5bb76cd4d0165a21cc098c2dc6080557 reached terminal state SUSPENDED" [10:06:48] at 15:17:15 UTC yesterday [10:07:17] my understanding is that SUSPENDED is a terminal state and no retries will be applied... [10:08:51] will dig more into it after lunch, was hoping to use a running job to test the transition to flink 1.17 [10:08:59] and the new updater [10:09:04] lunch [10:28:03] Hm, I see a bunch of failed connection attempts to ZK in the logs [11:41:51] how weird the commons job just resumed... [11:43:58] which one are you looking at? On codfw I see commons in a crash loop [11:49:35] pfischer: I'm looking commons@staging which I think runs in eqiad [11:51:03] Okay, should we be worried about the commons@staging in codfw (the one in a crash loop)? [11:51:04] just ran helmfile -i -e staging --selector name=wikidata apply and wikidata is now restarting... [11:51:22] staging in codfw is a bit special I think... [11:51:32] how so? [11:51:48] I think it's used by serviceops for testing k8s itself [11:52:41] our jobs should not run there but not sure if we can deploy/undeploy there [11:53:44] we could ask janis if it's causing issues to have that job running there in a crash loop and he might be able to stop it [11:54:55] I think the problem is that "staging" points to staging-eqiad but IIRC we can't really force to go to codfw by e.g. doing helmfile -e staging-codfw [11:58:56] I’ll ask him. So we never explicitly deployed it on codfw? [12:01:10] for staging no, my understanding is that it's serviceops who decides where staging runs, it ran for a short time in codfw and probably why the jobs got deployed there [12:52:28] o/ [13:04:19] o/ [13:06:49] getting org.apache.kafka.common.errors.InvalidTxnStateException: The producer attempted a transactional operation in an invalid state. [13:09:55] Hm. IIRC there was sth. related transaction-per-subgraph in the code… Looking [13:11:15] But that looks alright. [13:11:50] hello! yall see email from Giuseppe with subject “[Ops] Wikimedia production: turning off bare metal MediaWiki clusters”? I think I see some api-ro.discovery.wmnet usages in some rdf updater airflow-dags [13:12:58] ottomata: yes, thanks for the ping, just pushed&deployed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/756 [13:14:33] I also upgraded to flink 1.17.1 so that might perhaps be the reason... [13:15:59] excellent [13:23:27] reverting, it's looping quickly over kafka transaction [13:35:25] I suppose I'll test again in yarn but with flink 1.17 [13:40:40] oh! nice just saw your patch, sorry! ty [14:06:21] back [14:06:40] \o [14:07:00] huh, left the move page test running in a loop overnight, hasn't failed yet. Still not convinced i want to ship a sleep [14:07:03] sleep(2) as the fix [14:08:38] ebernhardson: yes... a sleep like this one: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1053253/12/tests/integration/features/update_general_api.feature ? [14:08:50] or in javascript in the hooks? [14:10:52] so flink 1.17 works well in yarn... so now wondering what's causing the issue in staging :/ [14:15:46] dcausse: did that run against the same kafka brokers? [14:16:11] pfischer: no, in yarn I use kafka-jumbo [14:17:11] Hm, the exception message suggested to have a look at the kafka (broker?) logs. Maybe it says why the transaction was aborted (if at all) in there. [14:20:38] dcausse: in the feature file itself, basically after cirrusdoc says that the moved document has shown up in the new index wait 2 more seconds [14:23:39] ebernhardson: yes, I don't why it's failing tho, it should wait for the target of the move to be there, but wondering if the index is perhaps not refreshed fast enough and still finds the old page [14:24:28] my understanding is that it's a move between indices and we rely on a delete to not show this page [14:24:29] dcausse: i think whats happening is the page is supposed to be deleted from index A, and show up in index B. I suspect ApiTrait::loadDocument is rejecting doc A from the response, because it's wrong, and returning doc B [14:24:34] hmm [14:24:55] basically the doc is in both indexes [14:25:47] after such failures I checked the index and I couldn't find it in the source index IIRC [14:26:08] dcausse: sleep(2) working suggests it does get deleted, but the order-of-operations varies [14:26:33] and because the job runner doesn't run all the time, instead the jobrunner in mwcli basically sleeps 1s, execute runJobs.php, repeat [14:27:01] in this case we should trigger two jobs [14:27:27] but we don't explicitely wait for the page to be removed from the old index [14:30:32] so yes, my understanding is that unless we find something similar to waitForOperation with loadDocument but for page moves I don't see another way than sleeping an arbitrary amount of time and hope for the best [14:32:28] pfischer: if you search for "Invoking InitProducerId for the first time in order to acquire a producer ID" in the logs you'll see that it creates a ton of producers... [14:33:20] but good idea about the kafka broker logs, will check if they're available in logstash [14:33:53] i guess it's been working overnight, running the test every 2-3 minutes, i suppose even if it still fails itermittently it's still less frequent [14:34:05] agreed [14:35:00] someone sent an email pointing to this and basically asking why are we so terrible: https://www.objective.inc/articles/how-we-built-search-over-all-of-wikipedia-in-30-minutes-with-34-percent-better-relevance [14:35:15] note: they did not do all of wikipedia [14:40:35] the "automatically generate queries" seems suspicious [14:50:40] indeed, but sadly we cant really release useful query logs. [14:51:52] true, nice response btw [14:54:11] can't find kafka brokers logs in logstash :/ [15:00:13] * ebernhardson should have also mentioned the 80/20 session split for fulltext. Maybe next time :P [15:07:27] seems like many transaction were created: https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=__transaction_state&from=now-3h&to=now [15:47:34] dcausse: Andrew sent me broker logs, see https://wikimedia.slack.com/archives/CSV483812/p1720709716099639, but there’s nothing suspicious in them. [15:47:44] thx! [16:40:26] dinner [16:44:31] sigh... still same issue [16:45:52] trying from an empty state... [16:49:30] same... [16:53:49] ebernhardson: where was the email about Objective? I missed it. [17:01:00] works with flink 1.17 from an empty state but a single KafkaSink (no subgraphs) [17:11:21] Trey314159: it was just a private email to me from someone in public, i suppose i only cc'd david and guillaume. I'll forward to the list i suppose [17:33:48] I'm a bit puzzled, wondering if it's because we use a single task manager that the kafka producers are running on the same jvm... will try with a single tm in yarn [17:34:25] hmm, odd step-change in saneitizer metrics. around 8:40 yesterday it dropped from ~500/s to ~200/s, then at 8:20 today it went to 0 [17:34:51] oops [17:34:54] maybe thats the switch to prometheus and some metrics have wrong name (i guess i've seen a couple related patches) [17:35:12] yes probably my fault... [17:36:35] giving up on perfecting the elastic envoy patch for now. Assuming PCC is happy I'm going to merge as is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053041 . AFAICT, this means we'll have an extra envoy doing nothing on 443 [17:36:42] from the graphs doc sizes also missing [17:37:18] doc sizes are kind of expected, they were only computed when running in the jobqueue [17:37:39] but we should still have the private wikis tho [17:37:43] sure, but they were still reporting for the private wikis and archive, but now they are not [17:37:51] :/ [17:37:57] also the latency percentiles on the elasticsearch-percentiles graphs [17:38:02] sigh... [17:38:04] i suppose you must be almost done for the day, i can poke over this [17:38:15] sure, thanks [17:38:41] or could try and build replacement prometheus graphs... [17:39:23] but i think i would like prometheus to have some more history before we switch graphing over...i suppose will fix the metrics first [17:39:26] could be the right time indeed but it's probably populating somewhere new in graphite [17:39:41] yes [17:51:38] dinner [18:00:02] welp, envoy is still broken...looks like it's omitting some required envoy config for the listeners we actually care about [18:05:48] ebernhardson: I had a meeting so I just saw your email. Nice response! I also found their automated testing to be very suspect. For better or worse, it's not currently worth it to spend time digging into it just because someone called us out on their blog... so I'm making myself look away! [18:16:05] +1, thanks for the diplomatic response [18:29:31] OK, this envoy patch is looking a bit better https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053751 [18:41:23] lunch, back in ~40 [20:34:42] hmm, thats a new one. While use the mediawiki shell.php: psysh (loop): literal.c:136: literal_get: Assertion `l->l_idx > (size_t)idx' failed. [20:35:06] that must come from the php code :S [20:35:12] php c code [21:00:36] * ebernhardson is not finding obvious reasons for wrong metrics in the code....i guess next step is figure out where they are ending up [21:11:03] graphite search says: Lost the CirrusSearch prefix on doc_size, same on requestTime metrics. They are now `MediaWiki.eqiad.requestTimeMs.comp_suggest...` and similar. But the code clearly has CirrusSearch in the provided statsd namespaces, and it's in the wmf.13 branch [21:16:37] * ebernhardson wishes phpdbg still worked...