[00:19:11] ahh..finally found why commonswiki descriptions are an array, they swap the labels into descriptions [11:06:02] lunch [11:10:41] Lunch 2 [11:53:46] hi all. i've another issue with my own wikidata instance, hope you can give me some hint :-) [11:55:03] the issue: many - but not all - non-ascii characters in string are missing, replaced by a utf-8 "replacement character" [11:56:09] for example: fetching the japanese labels for italian cities, i get (about) half of them right, and the other half as ��������������������� [11:57:34] or even running the "cats" example query: wd:Q100451536 becomes Sanj��r�� [11:59:34] madbob: Hi! Currently all experts are out for lunch or are outside their working hours. They will get back to you shortly. [12:00:01] i've not yet explored the dump i used to initial import (as it is a very large file, it is a bit complex to analyze...) to verify if the broken string where there on the first place, nor i've yet explored each step of import process, but perhaps i missed some step which is obvious for you... [12:01:20] thank you pfischer! no urgency, indeed it is lunch time for me also ;-) [13:05:12] madbob: hey, so the question is try to figure out if the problem happens while importing the dump or when querying your wdqs replicata [13:05:44] does the simple query: select * { wd:Q100451536 rdfs:label ?o } also prints replacement chars? [13:07:04] yes, that query produces mostly � results [13:07:41] madbob: what process did you use to import the dumps? [13:08:11] this: https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md [13:09:11] i don't really know where the issue started: starting from the data, at import time, in some blazegraph configuration... the strange part is that only a part of strings are broken, not all of them [13:10:40] it's likely strings requiring multi bytes in utf-8 (roughly anything that is not ascii) [13:11:05] if you have a string that has non-ascii chars and prints properly that is definitely weird [13:14:24] turtle RDF format states that it should be UTF-8 so I doubt that the RDF parser would use the system encoding to parse the dumps... but I'd be curious to know the output of "java -XshowSettings:properties -version" when run from the same user you used to import the dumps [13:14:37] madbob: ^ [13:15:14] (only interested in the "file.encoding"' value) [13:32:29] oops... probably the liberachat web client blowed my latest messages... [13:32:57] openjdk version "11.0.16" 2022-07-19 [13:33:18] asking the japanese labels for italian cities, i obtain many legit strings (i do not read japanese, but looks right ;-) ) [13:33:24] e.g. wd:Q8621 is テルニ [13:33:28] but wd:Q279 is ��������� [13:33:33] select ?o (lang(?o) as ?lang) { wd:Q8621 rdfs:label ?o } - returns all legit strings, in any possible charset [13:33:37] select ?o (lang(?o) as ?lang) { wd:Q279 rdfs:label ?o } - all non-ascii chars are broken [13:33:56] an interesting thing just found opening random items: the broken strings belong to entities which have not been updated before the dump i used to perform initial import, the right strings belong to entities which have been updated later [13:34:02] do runUpdate sync the whole entities that have been modified (so, it may have fixed the strings which were broken at import time) or just the modified properties (so, this is not a useful hint)? [13:34:43] so definitely something happened while importing the dump [13:35:09] runUpdate should sync the whole entity [13:35:45] what's the output of "java -XshowSettings:properties -version | grep file.encoding" ? [13:36:20] err: java -XshowSettings:properties -version 2>&1 | grep file.encoding [13:37:03] file.encoding = ANSI_X3.4-1968 [13:37:04] ah! [13:38:05] I surprised that the dump import process relies on this but it might be the cause [13:38:49] perhaps to confirm would be to generate a small dump file with some non-ASCII chars and import it [13:39:01] using the same technique you used initially [13:39:40] if that's confirmed we need to fix the import script to force the file.encofing to UTF-8 [13:39:58] actually perhaps you still have the munged files? [13:41:33] yes, i still have the munged files [13:41:42] going to run some test [13:45:13] sure, so the problem might happen either during munge.sh (in which case you can confirm by opening a munged chunk and observe the � replacement char) or during loadRestAPI.sh in which case I think you need to write a custom ttl file [13:46:58] the munged file are broken! [13:47:22] so most probably the RDF parsers are relying on system encodings :( [13:47:27] running a `zgrep � *` on them, i find many results [13:47:52] e.g. wikidump-000000001.ttl.gz: pq:P1810 "Fi� allo Scilliar" . [13:50:06] madbob_: (if you already have an account) would you mind filing a bug in https://phabricator.wikimedia.org/ and tag Wikidata-Query-Service) [13:50:50] an easy workaround is to change the munge script to pass -Dfile.encoding=UTF-8 everytime java is called or simply change your system locale [13:52:46] just registered an account (i was already logged in with wikipedia account ;-) ), going to create a report [13:52:58] thanks! [14:02:42] https://phabricator.wikimedia.org/T323575 [14:03:55] ok, probably i will start over. i have to figure out how to change openjdk configurations (just to avoid missing some -Dfile.encoding=UTF-8 within the scripts) [14:04:28] thank you! please let us know if you run into trouble with this workaround [14:30:03] o/ [15:30:42] inflatador: did WDQS fully recovered? [15:31:59] gehel I still have to restart a few more hosts but load is dropping on all the hosts that have been restarted [15:32:20] sounds good! keep an eye on it and scream if it goes back up [15:32:41] Will do. All hosts appear to be passing their health checks as well [15:37:47] going back a week, looks like load avg-15 for wdqs in eqiad peaks at around 25, usually closer to 15 [15:39:01] load15 > 40 is generally bad for wdqs [15:39:30] here the threadcount exploded in no time [15:40:45] ryankemper: this is a good check of our SLO work. This seems to clearly be a case of things going wrong and should be detected as such by our SLI. [15:41:16] dcausse thanks, it looks like load15 was over that from ~1520-1530 UTC [15:41:41] inflatador: yes [15:46:06] I'll get started on an incident report [15:51:22] \o [15:51:34] What dashboard has HTTP response codes from WDQS? I noticed Alex mentioned 429s [15:51:41] o/ [15:52:21] inflatador: might be https://grafana-rw.wikimedia.org/d/000000522/wikidata-query-service-frontend?orgId=1 ? [15:53:10] also the blip is pretty visible the new SLO dashboard: https://grafana-rw.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&from=now-6h&to=now [15:53:10] oh yeah, I found that one...but didn't look carefully at the Varnish Error rates ;( [16:02:55] sometimes I wonder if making pre-defined elastic filters from the extra plugin might help a bit instead of repeating all these asciifolding, asciifolding_preserve, truncate_keyword, remove_empty... for every language [16:04:06] s/language/index/ [16:15:10] hmm, maybe? I suppose after all our unpacking of things, i wonder if packing things up together really helps much. I guess it reduces some duplication in the definitions, but those are auto-generated from code anyways? [16:16:41] this would reduce the size of the analysis config and save an instance of these token filter factories [16:19:10] they'll be like "remove_duplicates" which only appears when you reference it when defining the analyzer, not like "remove_empty" that hops to the "length" filter that's always configured with min: 1 [16:20:12] not sure that's worth tbh but it's always frustrating to see all these configs being duplicated :) [16:21:05] hmm, i suppose that could help a little bit. Indeed there is a ton of duplication in the analysis config [16:58:32] Had a mild family emergency, so even worse at keeping up with IRC than usual. I'd be up for trying to refactor the analysis configs to reduce duplication and also remove unnneeded config (like having word_break_helper configured where it doesn't actually do anything) I'll open a ticket and we can think about it for next quarter [17:00:39] Trey314159: sorry to hear this, hope it's nothing too bad, no worries for the duplication in the analysis config, so don't bother too much with it with unless you find easy wins :) [17:05:02] dcausse: thanks.. it was nothing *too* serious.. a busted ankle (not mine) that probably needs xrays, coordinating transport, finding doctor appointments, etc. [17:05:15] I wouldn't mind doing some general clean up in the analysis config. I see some things from time to time but avoid dealing with some of it because it is a bit messy and could generate a lot of changes and I don't want to mix it in with whatever I'm working on. [17:14:20] errand [17:23:19] Trey314159: good luck with that ankle! Feel free to disappear as needed [17:23:27] thanks! [17:31:01] workout/lunch, back in 1hr. Incident report draft is here: https://wikitech.wikimedia.org/wiki/Incidents/2022-11-22_wdqs [18:47:50] Hi! I do have a kafka-related question: We have an abstraction called stream-config, that maps streams to kafka topics. Most of the time equiad and codfw. I don’t know how the flink-kafka-connector is instructed to pic one of those topics to subscribe to. Does it make a difference which topic is picked or is one the mirror of the other? How do we decide which kafka-cluster to write to? [18:49:14] pfischer: i think you need to read from both. The eqiad/codfw refers to where the events were produced from, they are then replicated from the source cluster to the other dc's [18:50:21] in normal operations mediawiki only produces from one cluster, so mostly the other should be empty. But reading from both should give a clean switchover if mediawiki writes move between datacenters [18:54:48] sorry, been back [18:57:40] ebernhardson: Thanks! So assuming the search update pipeline runs in every dc. Does it write to it’s dc-local kafka cluster? [18:59:20] pfischer: for the part the consumes and then writes to elasticsearch, yea that should be dc-local topic. For the part the reads mediawiki event streams it would need to read all [19:21:42] inflatador: interestingly looking at https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&from=1669129833010&to=1669130793940 we do have a lot more 429 responses, but our total number of (200+403+429) didn't seem as high as I might expect [19:21:50] so that might be some evidence for it being just a bad query rather than overall load [19:26:43] Hmm.. Looking at the analysis config and thinking about deduping stuff.. it's kind of hard because we modify a lot of the analyzers if, for example, the ICU plugin or extra plugins are available. Preconfiguring all possible options sounds impossible. Preconfiguring even our favorites might cause interesting problems with dependencies, and we might end up duplicating specific versions of jars and/or landing in jar hell. [19:26:54] ah, doh, it's mostly 5xx errors that spiked [19:28:22] OTOH, word_break_helper—as configured for the default "text" analyzer—is *completely* useless.. it is only configured when it cannot be applied. I'm sure it's a historical remnant of evolving code. [19:29:47] ryankemper valid point. I haven't thought too deeply about it yet. I imagine if the total response codes drop, that implies some connection timeous maybe?