[00:43:15] inflatador: I think we banned & depooled the hosts a week too early: https://phabricator.wikimedia.org/T365998 [00:44:07] It's a bit confusing tho since it says july 18 on one spreadsheet page but jul 23 on another [09:17:57] just deployed the new wdqs updater to production, seems to work smoothly so far :) [09:18:00] pfischer: ^ [09:18:24] dcausse: awesome, thanks! [09:19:25] dcausse: what are you monitoring? just the dashboard? [09:19:54] pfischer: I look at kafka metrics at [09:19:55] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.rdf-streaming-updater.mutation-main&var-topic=codfw.rdf-streaming-updater.mutation-scholarly&var-topic=eqiad.rdf-streaming-updater.mutation-main&var-topic=eqiad.rdf-streaming-updater.mutation-scholarly&from [09:19:57] =now-1h&to=now [09:20:20] but also checking few events with kafkacat -b kafka-main1005.eqiad.wmnet:9092 -t eqiad.rdf-streaming-updater.mutation-main -o end [09:21:01] surprisingly this time the scholarly graph has more activity than the main graph... [09:46:00] Hm, I thought there was a disproportionally large number of scholarly articles (hence the split)? What’s the expected ratio? [09:49:04] pfischer: it's roughly a 50/50 split but the stream reflects the edit activity on wikidata, this activity is I think no entirely proportional to the size of the subgraph [09:49:47] I suspect that this will vary a lot depending on what bots are running [09:53:39] I see a lot of edits from https://www.wikidata.org/wiki/User:Cewbot (e.g. around 20 edits in few seconds on a single item: https://www.wikidata.org/w/index.php?title=Q33700126&action=history) [09:56:01] and https://www.wikidata.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&limit=50&days=7&urlversion=2 appears to be full of scientific publications [09:58:14] lunch [10:46:38] I am switching Java 11 Jenkins jobs from Buster to Bullseye with https://gerrit.wikimedia.org/r/c/integration/config/+/1055171/ [10:46:59] so maybe something will break, but most probably nothing will happen ;) [12:02:58] hashar: thanks! [13:34:27] \o [13:37:19] o/ [15:01:47] dcausse (and maybe pfischer): https://meet.google.com/eki-rafx-cxi (retrospective) [16:02:05] dinner [16:38:15] * ebernhardson is mildly surprised to see cindy vote +1 3 times in a row. I guess those fixes worked [17:00:58] Yay, Cindy! [18:14:30] inflatador: off to lunch rn, won't be around for pairing [19:07:21] * ebernhardson realizes something is missing in the config...but not 100% sure what yet :P [20:03:02] getting some shard allocation failures on eqiad chi...checking it ou now [21:52:04] hmm, cook-book code mentions cirrus write queue, wonder if that is obsolete since we have the streaming updater now: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/force-shard-allocation.py#L18 [21:52:21] looking [21:52:38] the function itself appears to be in spicerack/elasticsearch_cluster.py [21:52:43] in the spicerack repo that is [21:53:01] ahh, most likely the answer is yes but i was going to read it to be sure :) [21:53:43] ES says the shards won't move 'cause: `shard has exceeded the maximum number of retries [5]` [21:53:50] inflatador: yea looking through it seems to be unused, it only ever gets passed around in constructors or set as a property, but i dont see anything use it [21:54:03] maybe in a cookbook, checking [21:54:23] ebernhardson y, it's used in the cookbook I linked above [21:55:00] inflatador: thats still just passing it into a constructor though, it's not actually doing anything with it [21:55:51] ebernhardson ah, sorry I missed that. I guess that means I can run the cookbook without it complaining? [21:56:10] inflatador: yea all instances in the cookbooks are just passing it into the constructor. I'm 99% certain they can all be simplified away [21:58:09] ACK, I'll create a task for it. Meanwhile, I ran the cook-book and it allocated the shards properly [22:15:26] * inflatador is not sure why the elastic1100 "node not indexing" alerts are still firing [22:15:36] hmm [22:16:15] shards are definitely moving to the host...checking it out [22:16:16] inflatador: it's because it doesn't have any active shards yet, it decided to start by shifting 3 commonswiki_file shards and 1 wikidatawiki_content shard [22:16:29] i'm guessing the other servers also grabbed some tiny shard that was able to start quicky [22:16:36] ebernhardson well that was a fast solve ;) [22:19:04] mostly i looked at `curl http://localhost:9200/_cat/shards | grep elastic1100` and tried to guess at what it means :) [22:19:26] but the main reason a server could be running and not indexing is that there are no shards to write to, or i guess it could have fallen out of the cluster [22:19:34] but i guess if it fell out, it wouldn't give stats? uncertain [22:21:11] ebernhardson it was banned for a maintenance that took place this morning. I unbanned it ~90m ago and when I went back to check, that's when I noticed the shard allocation problem [22:23:30] usually the alerts go away pretty quickly, especially when shards are being added. But I guess if it's all large shards that explains it