[11:34:13] tholzheim (Tim) is working with me on the Wikidata import. I'd like to try out the log configuration and assume we could try loading a single zip file to check the effect. [11:37:21] I made the files available via a symlink at http://wikidata.dbis.rwth-aachen.de/downloads/split/ [11:38:26] the first file had already been renamed to wikidump-000000001.ttl.gz.fail - I am moving it back to the original state now [13:02:10] seppl2023 sounds like you're making progress. I'm not sure if I'd be much help, but if you need anything feel free to reach out [14:08:08] dcausse did you want to do pairing today? [14:08:38] oops nm, I see you are OOO [14:34:52] \o [14:49:39] hmm, mjolnir daemon failing to load on 2001 :( I replaced the environment but thought i restarted all the daemons. guess not... [15:57:28] workout, back in ~40 [16:09:10] * ebernhardson finally realizes everything is wonky because ipv4 is up, but ipv6 is unreachable [16:09:18] my laptop has been annoying this morning :P [16:14:31] nope, still being wonky :( after disabling ipv6 `git review` works, but the gerrit web interface is having timeouts [16:35:04] back [16:44:29] I am working on https://wiki.bitplan.com/index.php/Wikidata_Import_2023-04-26#Logfile_issue again. Since i do not have the chat history i don't have the hints any more about how to change the logging settings. [16:45:40] What would be the ETA for splitting a single -rw-rw-r-- 1 wf wf 375626798 Apr 29 14:28 wikidump-000000001.ttl.gz file? The first few split files are comparatively large ... [16:46:42] The log file is already at 30 GB size after 15 mins of operation on a single file ... [16:46:46] seppl20236: for the logging settings, you could refer to how we do it. We use this script to run blazegraph: https://github.com/wikimedia/wikidata-query-rdf/blob/master/dist/src/script/runBlazegraph.sh [16:47:16] around like 108 its doing `LOG_OPTIONS="-Dlogback.configurationFile=${LOG_CONFIG}` which is setting a jvm option to tell it where to load the logging config from [16:47:39] 30G sounds surprising for a log file [16:49:45] I don't have those specific files in the script and asked for them to make progress. The latest reply was to look in the documentation of the log framework or use a ruby erd template file version. I was hoping there is a simpler approach to get things right e.g. by having the original files referenced in the script - the one i have is the same? [16:49:46] #!/usr/bin/env bash [16:49:46] set -e [16:49:47] BLAZEGRAPH_CONFIG=${BLAZEGRAPH_CONFIG:-"/etc/default/wdqs-blazegraph"} [16:49:47] if [ -r $BLAZEGRAPH_CONFIG ]; then [16:49:48]   . $BLAZEGRAPH_CONFIG [16:49:48] fi [16:49:49] HOST=${HOST:-"localhost"} [16:49:49] CONTEXT=bigdata [16:49:50] PORT=${PORT:-"9999"} [16:49:50] DIR=${DIR:-`dirname $0`} [16:49:51] PREFIXES_FILE=$DIR/${PREFIXES_FILE:-"prefixes.conf"} [16:49:51] BLAZEGRAPH_MAIN_NS=${BLAZEGRAPH_MAIN_NS:-"wdq"} [16:49:52] WIKIBASE_CONCEPT_URI_PARAM=${WIKIBASE_CONCEPT_URI_PARAM:-""} [16:49:52] COMMONS_CONCEPT_URI_PARAM=${COMMONS_CONCEPT_URI_PARAM:-""} [16:49:53] OAUTH_RUN=${OAUTH_RUN:-""} [16:49:53] OAUTH_CONSUMER_KEY_PARAM=${OAUTH_CONSUMER_KEY_PARAM:-""} [16:49:54] OAUTH_CONSUMER_SECRET_PARAM=${OAUTH_CONSUMER_SECRET_PARAM:-""} [16:50:05]          -XX:+ParallelRefProcEnabled"} [16:50:05] else [16:50:06]     GC_LOGS=${GC_LOGS:-"-Xloggc:${LOG_DIR}/${GC_LOG_FILE} \ [16:50:06]          -XX:+PrintGCDetails \ [16:50:07]          -XX:+PrintGCDateStamps \ [16:50:07]          -XX:+PrintGCTimeStamps \ [16:50:08]          -XX:+PrintAdaptiveSizePolicy \ [16:50:08]          -XX:+PrintReferenceGC \ [16:50:09]          -XX:+PrintGCCause \ [16:50:09]          -XX:+PrintGCApplicationStoppedTime \ [16:50:10]          -XX:+PrintTenuringDistribution \ [16:50:10]          -XX:+UnlockExperimentalVMOptions \ [16:50:11]          -XX:G1NewSizePercent=20 \ [16:50:11]          -XX:+ParallelRefProcEnabled \ [16:50:12]          -XX:+UseGCLogFileRotation \ [16:50:12]          -XX:NumberOfGCLogFiles=10 \ [16:50:13]          -XX:GCLogFileSize=20M"} [16:50:13] fi [16:50:25]     f) CONFIG_FILE=${OPTARG};; [16:50:25]     n) PREFIXES_FILE=${OPTARG};; [16:50:26]     w) WIKIBASE_CONCEPT_URI_PARAM="-DwikibaseConceptUri=${OPTARG} ";; [16:50:26]     m) COMMONS_CONCEPT_URI_PARAM="-DcommonsConceptUri=${OPTARG} ";; [16:50:27]     r) OAUTH_RUN=" mw-oauth-proxy-*.war";; [16:50:27]     k) OAUTH_CONSUMER_KEY_PARAM="-D${OAUTH_PROPS_PREFIX}.consumerKey=${OPTARG}";; [16:50:28]     s) OAUTH_CONSUMER_SECRET_PARAM="-D${OAUTH_PROPS_PREFIX}.consumerSecret=${OPTARG}";; [16:50:28]     b) OAUTH_NICE_URL_PARAM="-D${OAUTH_PROPS_PREFIX}.niceUrlBase=${OPTARG}";; [16:50:29]     i) OAUTH_INDEX_URL_PARAM="-D${OAUTH_PROPS_PREFIX}.indexUrl=${OPTARG}";; [16:50:29]     g) OAUTH_WIKI_LOGOUT_LINK_PARAM="-D${OAUTH_PROPS_PREFIX}.wikiLogoutLink=${OPTARG}";; [16:50:30]     ?) usage;; [16:50:30]   esac [16:50:31] done [16:50:31] pushd $DIR [16:50:32] JETTY_RUNNER=${JETTY_RUNNER:-$(echo jetty-runner*.jar)} [16:50:32] # Q-id of the default globe [16:50:33] DEFAULT_GLOBE=${DEFAULT_GLOBE:-"2"} [16:50:33] # Blazegraph HTTP User Agent for federation [16:50:45]      -Dhttp.userAgent="${USER_AGENT}" \ [16:50:45]      $WIKIBASE_CONCEPT_URI_PARAM \ [16:50:46]      $COMMONS_CONCEPT_URI_PARAM \ [16:50:46]      $COMMONS_CONCEPT_URI_PARAM \ [16:50:47]      $OAUTH_CONSUMER_KEY_PARAM \ [16:50:47]      $OAUTH_CONSUMER_SECRET_PARAM \ [16:50:48]      $OAUTH_INDEX_URL_PARAM \ [16:50:48]      $OAUTH_NICE_URL_PARAM \ [16:50:49]      $OAUTH_WIKI_LOGOUT_LINK_PARAM \ [16:50:49]      $OAUTH_SESSION_STORE_HOSTNAME \ [16:50:50]      $OAUTH_SESSION_STORE_PORT \ [16:50:50]      $OAUTH_SESSION_STORE_KEY_PREFIX \ [16:50:51]      $OAUTH_ACCESS_TOKEN_SECRET \ [16:50:51]      $OAUTH_BANNED_USERNAMES_PATH_PARAM \ [16:50:52]      -Dorg.eclipse.jetty.annotations.AnnotationParser.LEVEL=OFF \ [16:50:52]      ${BLAZEGRAPH_OPTS} \ [16:50:53]      -cp "$JETTY_RUNNER:lib/logging/*" \ [16:50:53]      org.eclipse.jetty.runner.Runner \ [16:51:57] I'll try to use the ircloud client again - it seems to handle large messages better [16:53:06] better off not pasting long messages into chat, use a paste service [16:53:21] The level of detail in the log is ridiculous [16:53:37] https://www.irccloud.com/pastebin/zVkQPfRf/logsnippet [16:53:52] the irccloud does that automatically ... [16:54:05] for the issue at hand, there isn't really going to be a super easy solution. You'll need to create a logback configuration (likely based on the erb template we use: https://github.com/wikimedia/operations-puppet/blob/production/modules/query_service/templates/logback.xml.erb) and then you'll need to ensure the commandline that starts blazegraph includes [16:54:07] -Dlogback.configurationFile=path/to/logback.xml [16:54:49] I was figuring i simply need a "proper" logback.xml that is set to reasonable settings. [16:55:38] i don't know that one exists [16:56:48] https://github.com/ruby/erb looks a bit outlandish to me ... i have never used anything like that for the purpose. It's quite a hurdle ... [16:58:46] erb is the templating syntax, the `<%- if @evaluators -%>` and such. hmm [17:01:05] https://www.irccloud.com/pastebin/ES6UpjfG/logback.xml%20 [17:02:32] seppl2023: you could try https://phabricator.wikimedia.org/P47293 which i've just stripped down from the erb, you'll have to set the log path. No clue if it will work, but plausibly [17:04:30] the reason you need the fancy one and not the one you pasted is you are trying to reduce logspam. The one you pasted is about what it loads by default, what we use includes rate limiting and focused throttling of logging [17:06:59] so I have terminated service/loadall.sh -s 1 -e 1 - killed the blazegraph service, removed the jnl file at this time the log file was 87 GB [17:10:06] After restarting the logfile is 30 MB and the blazegraph service is reachable bigdata/#status looks ok via a browser. [17:25:38] https://wiki.bitplan.com/index.php/Wikidata_Import_2023-04-26#with_pasted is running and only garbage collector messages show up in the logfile [17:27:27] So that's an improvement that makes importing feasible. I'd love to see the messages of the importer as was the case when i tried things in 2018 so that the progress can be measured ... [17:32:53] lunch, back in ~45 [18:46:37] seppl2023: not sure what logs you'd like to have. The loadData.sh script should log each chunk that is going to be processed: https://github.com/wikimedia/wikidata-query-rdf/blob/master/dist/src/script/loadData.sh#L38 [18:46:50] feel free to send a CR with improvements! [18:57:49] https://wikitech.wikimedia.org/wiki/Conftool#The_tools