[07:52:49] https://www.irccloud.com/pastebin/GdrjfZzZ/ [07:55:59] I used to create an ETA estimator from the logfiles with a small stats script that created a spreadsheet file. Is there something similar available at the wikimedia search team or shall i modify my old approach accordingly? [08:17:00] seppl2023: nothing is available on our side. We monitor the number of triples in Blazegraph, which gives us an estimate of the time remaining, and the number of chunks processed. [09:00:58] @gehel - thx - how do you get the number of triples in Blazegraph? Do you run queries simultaneously with the loading? [09:08:00] we monitor the number of triples as part of our standard monitoring: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=7 [09:10:18] This is coming from this script, which you should be able to replicate: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/query_service/files/monitor/prometheus-blazegraph-exporter.py#214 [09:10:45] Specifically this query: `SELECT ( COUNT( * ) AS ?count ) { ?s ?p ?o }` [09:13:12] seppl2023: did I read that you are running multiple loaders in parallel? Did you see an improvement in throughput? I remember the data loading being bound by single CPU in Blazegraph (so I'm not expecting a performance improvement with multiple loading threads, but that was a long time ago). [09:33:41] I am running the import on two different machines with two different target blazegraph installations. [09:34:40] see https://wiki.bitplan.com/index.php/Wikidata_Import_2023-04-26#Progress - where there is now a new statistics script. I might ammend the script to take filesize and loaded triples into account in the upcoming days. At this time it looks like the ETA is some 5 days [09:59:50] throughput tends to decrease over time, so your ETA is probably > 5 days. We'll see... [09:59:52] lunch time! [12:56:27] inflatador, ryankemper: one more round of server restarts: T335835 Can you have a look and see when this can be done? [15:00:12] 5-10m late to office hrs [15:04:14] office hours:https://meet.google.com/vgj-bbeb-uyi [15:47:18] feeling sick all of a sudden...going to rest and hopefully get back soon [17:02:24] OK, I took some dramamine, hopefully will be better [17:34:37] does anyone know how to link one's github account to wikimedia? Would like my contributions to appear on the wikimedia repos, such as https://github.com/wikimedia/operations-cookbooks/graphs/contributors [17:39:07] hmm, at some time i knew but have forgotten :( [17:42:13] inflatador: are you in the GH organization? [17:42:36] volans I must not be? [17:42:51] definitely not [17:43:01] give me the username and I'll add you [17:43:10] volans thanks! Username is inflatador [17:45:16] invitation sent [17:52:34] Hi gehel - I'm seeing some errors about CirrusSearch not able to connect to labtestwiki re. SaneitizeJobs.php - not sure if they're relevant though [17:53:07] Access denied for user 'wikiadmin2023'@'mwmaint1002.eqiad.wmnet' (using password: YES) [17:53:43] just joined, thanks again [17:53:56] Lunch, back in ~45 [17:55:24] hauskater: known issue, i think thats T328289 . I'm not really sure who is responsible for that [17:55:25] T328289: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289 [17:56:06] ebernhardson: ah, thanks - I'd say cloud services team then [17:59:12] ebernhardson: I assume the "Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY" is related as well? [17:59:25] that one for labswiki, not labtest [18:01:28] hmm, not sure but possibly. There is a general problem that because labswiki and labtest are different. really the maintenance scripts for those should probably run somewhere other than mwmaint1002, but the current automation runs all wiki maintenance from the one maint host [18:02:24] ack, just run into these looking for something else and though I should pass the info :) [18:55:17] back [19:07:29] hmm, odd failure case in airflow. We set max_active_runs=1 and depends_on_past=True. When a task fails it starts the next run, but it can't continue because it depends on past. But clearing the old run doesn't start it, because max_active_runs=1 and there is already an active run [19:07:37] (not everywhere, but a specific task) [19:08:11] and mostly you can set task states, not dag so not sure yet how to tell it to make the older dag run the active one... [19:20:37] turns out you need to clear the newer dag, that will reset it to a queued state and then it will run the oldest queued dag run [19:20:44] s/newer dag/newer dag run/ [19:22:30] ryankemper FYI I started the rolling reboot of CODFW for T335835 [19:37:09] Getting a few alerts, but they're clearing. I don't see any shards relocating though, just unassigned. I'm going to stop the cookbook and check what's holding things up [19:44:27] OK, everything's back to green. Resuming... [19:50:41] hmm, the same thing is happening again. I wonder if this is a function of us telling ES to wait longer before trying to relocate shards. Unassign numbers are dropping now [19:50:59] So everything looks OK [21:35:42] inflatador: that unassigned behavior is pretty normal with restarts. first elastic is gonna notice the shards are unassigned once the respective nodes have been restarted, and then it's going to figure out where to move them [21:36:03] by contrast when we ban a node (disable allocation) shards are going to go straight to relocating for a clean cutover