[01:31:45] ebernhardson: new `wmf-elasticsearch-search-plugins` should be ready to rock: https://apt.wikimedia.org/wikimedia/pool/component/elastic65/w/wmf-elasticsearch-search-plugins/ [07:59:10] ejoseph: want to discuss that Game of Life task? [08:34:29] Yh sure [08:34:35] Good morning [08:35:59] https://meet.google.com/eir-wwuv-usx [09:05:20] ejoseph: if you want to make you generations animate, here's some example on a crude ascii art animation - https://github.com/thatcherclough/AsciiAnimator/blob/master/src/main/java/dev/thatcherclough/asciianimator/Animator.java [09:31:37] ok [10:02:33] erand [10:10:26] dcausse: is there any good way to determine if bootstrap jobs works properly? [10:11:48] zpapierski: you could perhaps run the opposite job (state-extract-job) and make sure the revision maps is identical? [10:12:01] ah, makes sense [10:12:02] the state-extract-job is super slow [10:12:08] that's the issue [10:12:15] how slow is slow? [10:12:33] might take 6+ hours [10:12:56] ah, that's ok, I'll check it out later [10:13:28] just the size of the resulting flink savepoint should be a good indication [10:14:14] I'd expect between [100M, 200M] partition sizes when listing the content of the savepoint [10:57:13] errand, after that lunch [11:06:32] lunch [11:08:15] dcausse: -rw-rw-r-- 1 zpapierski wikidev 4.4G Nov 18 11:07 6ed10c3a-de41-4ad1-a7d0-6e1e6a00317e [12:42:04] 366M per partition (assuming parallelism 12), this is more than wikidata apparently (wikidata is around 240M * 12) [12:43:05] since we assumed commons to be smaller I guess we should do a quick check to undersand why it's the case [13:56:41] ejoseph: let me know when you want to get back to plugins [13:59:32] dcausse: some pointer to doc or cli command for state extraction? [14:01:32] zpapierski: see ./flink-1.12.1-wdqs/state-extraction-job.sh as an example run on stat1004:~dcausse [14:01:58] will do, thx! [14:22:26] hmm, weird, it timesout while wating for a heartbeat from a taskmanager [14:33:34] zpapierski: is it something related to spark? [14:34:08] no, it's a Flink thing - state extraction job puts too much of a strain on the task manager, at least in my configuration [14:34:19] solution was to limit the number of slots on each [14:34:35] (memory strain, to be exact) [14:35:03] oh ok, i am having issues with running spark, similar timeout issues. anyways... [14:35:22] give us some more details [14:37:24] zpapierski: `Executor heartbeat timed out` and then `Container container_xxx exited from explicit termination request`. I don't think it is due to memory issues or complex query, because I had run even harder queries with no issues.zpapierski. NB I am using pyspark from jupyterLab [14:37:53] a specific query times out? [14:37:57] I get lot of lost executors as well. [14:38:34] or generally any queries now are problematic? [14:39:13] not really. a few queries time out. a few run fine. But I cannot find why. Also it is not consistent. for example, I ran a few queries fine earlier that are showing error now. [14:39:43] huh, this sounds like a cluster issue - ottomata, can you help or point to somebody who can? [14:40:49] but Joseph mentioned some work is going on that reset the clusters last week probably, not sure if it is still the case, or if more work is going on that may be causing these inconsistencies. [14:47:12] zpapierski: humm...so that query now ran successfully. [15:12:23] tanny411: i think joseph can help better than i can atm, but yeah come over to #wikimedia-analytics and ask [15:13:00] ottomata: Thanks, will do :) [16:02:14] retrospective: https://meet.google.com/ssh-zegc-cyw cc: dcausse, ejoseph [18:18:26] dinner [18:39:57] ebernhardson: so what needs to be done to get the swift plugin operational? I imagine just upgrading the deb package on the hosts and doing the rolling restart? [18:40:32] ryankemper: yup, after that it's a couple web requests and waiting, hopefully :) [18:40:50] (and then once the index is on the cluster more work to finish up) [18:56:01] During today's attempt at importing WCQ into Blazegraph, the database screamed at me because I hadn't defined wcq as a namespace. Is there a straightforward way to do this? [18:56:18] I tried looking it up and it didn't look straightforward but perhaps I didn't look hard enough [18:57:05] hare: i'm not super familiar, but there is a createNamespace.sh script in wikidata/query/deploy repo that we probably use [18:57:26] hare: the wcqs-data-reload.sh script in there is probably relevant to you as well [18:57:51] Would you look at that, there is a createNamespace script. Thank you [18:58:11] :) [19:00:48] I take it there isn't an officially supported streaming updater for WCQ at this time? [19:01:56] hare: i have a test instance running in not-prod, i think right now it will mostly generate the right updates, but it will retrying 404's for entity data on pages with no entities [19:02:13] so it basically issues lots more queries to mediawiki than it should [19:04:09] s/will retrying/keeps retrying/ i dunno why english is still hard :P [19:14:05] There's something about the way we wait for the write queues to drain during cirrussearch rolling operations that is just...wrong... [19:14:23] In theory once we freeze writes the write queue should only be monotonically decreasing, but that's not really what I see in [19:14:25] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22eqiad%20prometheus%2Fops%22,%7B%22exemplar%22:true,%22expr%22:%22kafka_burrow_partition_lag%7B%20%20%20%20group%3D%5C%22cpjobqueue-cirrusSearchElasticaWrite%5C%22,%20%20%20%20topic%3D~%5C%22%5B%5B:alpha:%5D%5D*.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite%5C%22%7D%22,%22requestId%22:%22Q-7c06ead2-0811-4e45-ba1f-3ec41fffcf53-0A%22%7D%5D [19:24:10] i never trust the way burrow reports partition lag, it includes guessowrk [19:24:31] i think it's trying to avoid complaining about consumers that don't exist anymore or something, only ones that are currently consuming but behind [19:24:43] so if we pause and stop consuming, the metrics get wonky [19:28:34] But after we kick off a rolling operation and freeze writes, we should still be consuming from the queue, just not pushing more stuff onto it from the mediawiki side right? [19:28:55] actually, i'm totally misremembering there. When we pause i don't think the queue itself really gets paused, rather the jobs re-enqueue themselves with a delay [19:29:07] there was an intention to, but i don't think we ever actually did [19:29:24] Ah yes I forgot that impl. detail, you're right that the jobs re-enqueue [19:29:56] mediawiki should still be adding new things to the queue as pages are edited and whatnot as well [19:48:44] porting a couple metrics to alertmanager...it seems like we need not only the metric that alerts, but a second metric that says the first metric is valid. For example mjolnir seems to be alerting against an old prometheus name we don't use anymore [19:49:40] or maybe the metric has to be more carefully constructed to fail if no data is provided [20:03:58] Since the only thing blocking the restarts is the waiting for write queue part, we're disabling that part of the cookbook temporarily and just opting to sleep 10 minutes instead https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/739924 [20:04:22] next week we'll meet with someone who understands the jobqueue stuff well and figure out how to fix the actual problem [20:18:59] looking at the graph over a longer timespan, it seems we are regularly backlogging there and delaying writes. It might be capacity