[08:03:45] <gehel>	 dcausse: is T365155 affecting the WDQS data reloads?
[08:03:48] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[08:04:39] <dcausse>	 gehel: no I don't think so
[08:05:25] <gehel>	 good! we have enough challenges already!
[08:06:02] <dcausse>	 :)
[08:12:57] <gehel>	 weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-05-31
[08:13:36] <dcausse>	 I love clear error messages: "Cannot resolve column name "snapshot" among (subject, predicate, object, context, date, wiki, scope)" which actually means "date" is not a known column...
[08:14:36] * gehel is trying to parse that and failing
[08:17:13] <dcausse>	 spent most of the week dealing with missing fields, wrong field names, table/schema definitions... wondering if we do something bad or if that's simply something that has to be tedious 
[08:37:51] <pfischer>	 dcausse: Is that related to the SUP schema update that adds wikidata-specific fields to the update event schema?
[08:42:15] <dcausse>	 pfischer: it's one of them yes but had also other unrelated problems when I made a typo using a column named `date` while `snapshot` was expected
[08:42:38] <dcausse>	 for the graph split hive table
[08:43:02] <dcausse>	 it's generally tedious to manage these schemas and table definitions
[08:43:53] <dcausse>	 for the SUP we were wondering with Erik if it still makes sense to define all the fields in the update schema since we no longer have fat events
[08:51:31] <pfischer>	 Hm, the schema is used for serialising updates, for example, in when encoding the original event as part of a fetch_error event, and the UpdateEventTypeInfo uses this for flink state serialisation, but I’m sure we could get rid of the strict definition for the raw fields and replace it with something more generic.
[08:53:48] <dcausse>	 yes that's the annoying part, not sure I like it but java serialization of a Map (assuming we only have simple types on the leaves) should be stable
[08:54:43] <dcausse>	 or keep a json string, not really efficient but doable
[08:55:56] <pfischer>	 I’ll create a ticket, unless you already have?
[08:56:27] <dcausse>	 no it was just a brief discussion we had last wednesday
[09:21:24] <pfischer>	 I looked into the fetch failure alert regarding fetch failure rate this morning and roughly 75% are caused by timeouts; overall zh- and enwiki are responsible for 25% of all fetch errors each followed by wikidata with 12% - Do we know what causes them or should we investigate? Otherwise we would have to increase the threshold of the alert.
[09:24:33] <dcausse>	 pfischer: no clue what they could be, parsing the page is a costly operation and can take time, what's the timeout we currently allow? there are pages that might take 30+s to parse 
[09:26:01] <dcausse>	 wikidata is surprising, it should not require parsing the page on the main namespace
[09:27:01] <dcausse>	 if these are wikitext pages it could make sense, there are pages with long list of items used in a template that might be slow to execute
[09:27:07] <pfischer>	 But it’s responsible for 37% of all timeouts: https://logstash.wikimedia.org/app/discover#/view/e7c4f7a0-1f1e-11ef-9df7-259cde0053cb?_g=(filters:!(),query:(language:lucene,query:''),refreshInterval:(pause:!t,value:0),time:(from:now-4h,to:now))&_a=h@31c361e
[09:31:42] <dcausse>	 looking
[09:37:21] <dcausse>	 pfischer: kafkacat -b kafka-main2005.codfw.wmnet:9092 -t codfw.cirrussearch.update_pipeline.fetch_error.rc0 -o -10000 -e | grep Timeout | grep wikidatawiki | jq .namespace_id | sort -n | uniq -c
[09:37:37] <dcausse>	 most seem to be namespaces 2 and 4
[09:38:41] <dcausse>	 2 is User and 4 is Wikidata
[09:39:20] <dcausse>	 so I'd suspect pages with long list of items that are very slow to parse
[09:40:50] <dcausse>	 e.g. https://www.wikidata.org/wiki/Wikidata:Database_reports/Complex_constraint_violations/P453
[09:41:20] <dcausse>	 not sure what we can do about those... :(
[09:42:35] <pfischer>	 “Timeout” does not grep all timeouts, “Timed out” occurs way more often
[09:43:03] <pfischer>	 https://phabricator.wikimedia.org/T366340
[09:45:02] <dcausse>	 trying without any filter, the main namespace gets 350 failures and the Wikidata namespace gets 6382 counts
[09:51:14] <dcausse>	 https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=26761954&prop=cirrusbuilddoc is one instance on the main ns that timed out
[09:51:28] <dcausse>	 no reasons for it to timeout
[09:51:49] <dcausse>	 esp with few retries
[09:57:34] <pfischer>	 Hm, I guess I'll increase the timeout as part of the ticket mentioned above. Thanks for looking into this, dcausse
[09:57:42] <pfischer>	 lunch
[09:58:00] <dcausse>	 bon apétit!
[10:08:22] <dcausse>	 lunch 2
[10:17:59] <dr0ptp4kt>	 dcausse: pfischer good day - i rescheduled to monday. am facilitating a bit on T365155 and my morning schedule is a bit thorny. hopefully monday works better anyway
[10:18:04] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[10:28:49] <godog>	 hi folks, I was wondering what's up with the helmfile apply messages for cirrusearch-updater
[10:28:52] <godog>	 10:23 -logmsgbot:#wikimedia-operations- !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:29:01] <godog>	 cirrus-streaming-updater that is
[10:30:35] <gehel>	 dcausse will be able to provide more context, but this is automated for resolution of discrepancies in search indexing.
[10:30:54] <gehel>	 Basically, we're running short jobs on k8s
[10:31:35] <godog>	 ack, thank you gehel 
[10:32:17] <gehel>	 I'm not sure we have a way to suppress those messages...
[10:33:27] <godog>	 IIRC we do, helmfile.yaml for cirrus-streaming-updater will have the settings
[10:33:47] <godog>	 happy to file a task for this too, IMHO unless it is temporary we should be indeed muting the SAL messages
[10:34:22] <gehel>	 Oh, if we do have a way to suppress, we definitely should!
[10:34:51] <godog>	 ack filing a task
[10:36:48] <godog>	 T366346 not sure about the tags
[10:36:49] <stashbot>	 T366346: Mute helmfile apply notifications from cirrus-streaming-updater deploys - https://phabricator.wikimedia.org/T366346
[12:47:39] <inflatador>	 <o/
[12:52:45] <dcausse>	 o/
[13:11:13] <pfischer>	 o/
[13:21:39] <inflatador>	 can someone send me a Slack message? I'm having trouble getting the email notifications (almost like Slack doesn't want you to use them)
[13:22:04] <dcausse>	 inflatador: done
[13:22:23] <inflatador>	 Thanks dcausse ! Let's see if it works
[13:23:33] <dcausse>	 inflatador: re wikidata max lag, I think we should use our own metrics based on: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=41
[13:26:27] <dcausse>	 a query like: https://w.wiki/AFFb in alert manager
[13:26:46] <inflatador>	 dcausse oops! I missed that panel. Thanks, will get a CR up for that
[14:39:28] <ebernhardson>	 \o
[14:40:37] <dcausse>	 o/
[14:42:33] <ebernhardson>	 realized while pondering reindexing...i'm still doing it wrong :P  The problem is cirrus can reindex commonswiki_content, fail _general, then we backfill and it reindexes content.  The fix in cirrus is plausible, but still incomplete. The reindexer should be using UpdateOneSearchIndexConfig and doing one index at a time directly
[14:43:12] <ebernhardson>	 the problem is there is always this possibility that you are backfilling, and cirrus is reindexing that index you are backfilling, in which case the scroll loses the updates. The solution is to know you wont reindex that index again
[14:43:59] <ebernhardson>	 the problem with the cirrus fix as well is we have to signal it...i'm thinking to have cirrus exit with a special exit code the orchestrator can recognize as "this index will not be replaced, its fine"
[14:46:29] <dcausse>	 how can you backfill an index that is still reindexing?
[14:47:02] <ebernhardson>	 UpdateSearchIndexConfig always does them in order,  so it the second fails we start a backfill because indices changes, and we start a reindex becaues the reindex failed
[14:47:26] <ebernhardson>	 so say _content finishes, _general fails, we start a backfill, we start the reindex again, while backfilling the _content reindex starts over
[14:47:34] <dcausse>	 we don't wait for the backfill to complete to attempt a retry?
[14:47:48] <ebernhardson>	 right now i solve that by always backfilling from the beginning, but so the second backfill starts at the same time as the firs
[14:47:58] <ebernhardson>	 but that means with the failure, commonswiki is going to issue a 3+ day backfill
[14:48:12] <claime>	 ebernhardson: sorry to hijack the conversation, but is the constant redeployment of cirrus-streaming-updater intended?
[14:48:33] <ebernhardson>	 claime: yes its an automated process, i'll have to poke how to remove the sal logs
[14:49:14] <ebernhardson>	 the process will reindex and backfill ~4k indices, they get batched but there will still be hundreds of deploys
[14:50:08] <claime>	 ebernhardson: set SUPPRESS_SAL environment variable to true
[14:50:40] <ebernhardson>	 claime: ahh, ok i can set that. Although i cant set it on the current run which will finish in a few days, unless there is some sneaky way to inject env vars into /proc/<pid>
[14:52:46] <ebernhardson>	 alright pushed that update to the repo, it will now set SUPPRESS_SAL=true on all helmfile invocations
[14:53:24] <claime>	 well you can do it with gdb and setenv buuuuut it's not really recommended x)
[14:53:32] <claime>	 also probably needs root
[14:54:12] <ebernhardson>	 dcausse: as for waiting now, no i didn't have it wait for the backfill to complete before retrying the reindex. That has to do with how retrys work, and how since we run heaviest to lightest if a retry was not immediately rerun then something else would take its spot, and there wouldn't be any space in the queue until everything is done
[14:54:56] <ebernhardson>	 i guess it's perhaps not super important to run in a particular order, but it seemed best to start the heavy things and fill in gaps with small things
[14:55:01] <dcausse>	 can it stay in its work() method while retrying?
[14:55:43] <ebernhardson>	 dcausse: not while waiting for the backfill, at least not easily. It's separate threads and they only communicate through the event-log state
[14:56:31] <ebernhardson>	 the idea would be to issue all reindexes as a single index, then we track that single index change, one index only ever needs 1 backfill, it either works or doesn't
[14:57:05] <ebernhardson>	 it does mean some oddity with issueing multiple backfills for the same wiki, but i suspect thats ok
[14:57:14] <ebernhardson>	 would be odd if manual, but automated i suspect its ok
[14:57:54] <dcausse>	 from my little understanding a per index process seems easier to reason about than a per wiki I think
[14:58:22] <dcausse>	 that means extra backfill I suppose
[14:59:28] <dcausse>	 the pipeline does not know what index suffix to backfill
[15:03:10] <inflatador>	 heading to cowork office, back in ~30
[15:03:42] <ebernhardson>	 yea, it will backfill everything.  I suppose one other issue with the current way, is that _content finishes reindexing, then _file reindexes for the next day. _content is missing its backfill that hwole time
[15:04:26] <ebernhardson>	 i think i will switch it over, seems like less edge cases and strong guarantees. Shouldn't be too complicated
[15:04:34] <ebernhardson>	 s/strong/stronger/
[15:07:39] <dcausse>	 I think the consumer events might have the index_suffix in them so should be possible to filter those as well in theory if that helps
[15:12:24] <ebernhardson>	 dr0ptp4kt: for job queue, deploys were may 8 for codfw, and may 14 for partial eqiad, then may 28 for full. Looking for representative graphs, the thing is we are still sending the jobs so the # of jobs didn't decline, but they are all no-ops now (they will be removed next train).
[15:12:49] <ebernhardson>	 trying to find a good graph that shows actual work performed, not a number of requests but cpu wall time or some such. probably by the jobrunner cluster.
[15:13:32] <ebernhardson>	 dcausse: hmm, i suppose i'm not too woried about it. the backfills are quick compared to the reindexes, i suppose it's mostly extra load on the mw-api-int servers
[15:14:30] <ebernhardson>	 the mw on k8s dashboard really didn't like being set to 30 days :P
[15:21:31] <ebernhardson>	 from individual graphs, it claims we used to run ~100 cirrusSearchLinksUpdate jobs in parallel at any given time, right now its down to 6
[15:22:37] <claime>	 ebernhardson: I assume requests like ?action=query&format=json&cbbuilders=content%7Clinks&prop=cirrusbuilddoc&formatversion=2&format=json&pageids=142900721 with UA Apache-HttpAsyncClient/5.3 (Java/11.0.20) are from the backfill ?
[15:22:52] <ebernhardson>	 but if we claim that means freeing up ~100 cores in mw-jobrunner, the cluster has 2.5k idle apache workers so it's hard to see in the normal variance
[15:23:21] <ebernhardson>	 claime: they could be, but thats also part of the standard search update pipeline (backfill invokes same code)
[15:24:07] <ebernhardson>	 claime: i think we notified a few people, but part of the switch we are making moved a few hundred concurrent requests from the mw job queue to the mw-api-int cluster
[15:24:55] <claime>	 Would it be possible to change the UA to something different for each so we can identify the requests correctly in logs?
[15:25:34] <claime>	 Like WMF/CirrusBackfill and WMF/SearchUpdatePipeline, or something like that?
[15:25:37] <ebernhardson>	 claime: hmm, i thought we did but might have missed it. I'll add a ticket to re-review
[15:26:47] <claime>	 thanks a bunch :)
[15:27:49] <inflatador>	 back
[15:30:05] <ebernhardson>	 dr0ptp4kt: in terms of being able to see the change, perhaps the saturation section of mw-on-k8s dashboad for mw-jobrunner shows best.  Clear drop in cpu and network usage around 2000 on 5/28
[15:30:38] <ebernhardson>	 the removal of saneitizer from job queue also removed some spikiness from top-level graphs
[15:34:05] <ebernhardson>	 hmm, don't know if it's intentional but most of product and tech is no longer on https://wikimediafoundation.org/role/staff-contractors
[15:34:53] * ebernhardson should use the contact list anyways
[15:39:03] <dcausse>	 yes that page stopped being updated properly
[15:40:21] <dr0ptp4kt>	 ty ebernhardson . hopping in vehicle to meet locals and work from a cafe. ttyl and ty again!
[15:44:05] <dcausse>	 seems like renaming spark partitions to something human readable should happen in hadoop (and not on the fly while doing hdfs-rsync), adding that to the ntriple generator spark job
[16:24:26] <dcausse>	 going offline, have a nice week-end
[17:55:19] * ebernhardson has avoided naming this property because i don't know what to call it...but now i refer to the `something` property in 3 files :S
[17:55:33] <ebernhardson>	 usually using things gives me better naming ideas :P