[08:32:32] hare: it will generally work, but journal file will get bigger -at least if you use the script [08:34:01] for some reason, deleting a namespace doesn't purge the data, so each time you reload, it gets another wcqs dataset size worth of dead data [08:34:19] you can work around this - we also considered rotating blazegraph instances - feeding a standby one with a fresh journal. [08:36:38] that's probably a better solution, at least until the streaming updater is runnable outside of the WMF ( I still think that EventStreams is a best way to go, simply pushing events from there to kafka would actually allow no change to Streaming Updater) [08:49:46] dcausse: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/739105/2 - I see no reviewers, want me to review this, or it's WIP? [08:51:39] zpapierski: sigh.. jenkins forgot to add reviewers, it should be ready for review so yes please take a look at it when you have time [08:51:51] sure thing [08:52:37] weird it didn't add reviewers, never happened to me [08:59:18] it happen sometimes not sure in which conditions tho.. [09:31:56] hmm, apparently IRCCloud had a hickup [09:32:52] dcausse: I don't understand this part " but subsequent operations on [09:32:52] the same entity are not allowed and causes the patch to be concretized [09:32:52] and applied before reaching the size limit or poll timeout" - could you clear it up abit for me? [09:34:03] sure [09:34:42] you can cumulate multiple edits on the same entity in the same patch [09:35:13] this will compact all these diffs to retain only what will be modified in the end [09:36:02] but when an entity is deleted/reconciled we don't want to accumulate further edits on this same entity [09:37:13] this is because when we accumulate events we're losing the ordering of events [09:39:36] e.g. four events on Q1: E1 {import Q1: add:stmt1}, E2 {diff Q1: add:stmt2, del:stmt1}, E3 { delete Q1 }, E4 { import Q1: add:stmt2 } [09:41:19] we have to stop at E3 and perform the delete of Q1 on blazegraph before re-importing Q1 via E4 [09:41:44] reconciliation is similar to deletes [09:42:05] that I get, so it's the same for reconciliation? [09:42:40] yes because reconciliation we want to delete "blindly" using WHERE statements [09:43:16] so we can't really optimize and accumulate post-reconciliation [09:43:32] ah I see, we might loose updates that come afterwards because of that WHERE [09:43:43] ok, this makes sense to me now, thanks! [11:18:43] lunch [12:16:45] dcausse: our presentation for Streaming updater is on Dec 1 - hope that's ok :) ? [12:51:22] ejoseph: do you have everything you need to publish version 6.8.2 of the extra-analysis plugin? Do you need help to walk through the procedure? [12:52:55] dcausse, zpapierski: reminder to add your thoughts on https://easyretro.io/publicboard/jPibkLBsemdD7MdOXxf9gI9Lf1D3/c53e4702-e44f-492e-ad09-4cf093dcd182 before the k8s / flink retro [12:54:39] ah, forgot about it [13:04:23] added, lunch [13:07:45] zpapierski: I'll prep some slides soon [13:07:45] gehel: added things this morning [13:07:59] dcausse: thanks! [13:17:05] gehel: i need help [13:21:37] ejoseph: just finishing something, we can have a call in 15-20', I'll ping you [13:21:48] Ok cool [13:28:36] ejoseph: you'll need a GPG key. Create one matching your @wm.o email. GitHub has a reasonably good guide on creating GPG keys: https://docs.github.com/en/authentication/managing-commit-signature-verification/generating-a-new-gpg-key [13:31:39] you'll need to edit your local Maven settings (`~/.m2/settings.xml`) to add an entry for Maven Central. It should look like: [13:31:44] https://www.irccloud.com/pastebin/v8EoYmj3/ [13:33:44] Since we don't want to store clear text passwords in a config file, there is a way to set a main password in Maven: https://maven.apache.org/guides/mini/guide-encryption.html [13:35:00] All that is a one time setup. Once it is all ready, you can create a new release and push it to Maven Central by following the instructions in https://github.com/wikimedia/wikimedia-discovery-discovery-parent-pom#release [13:35:12] Specifically: [13:35:15] https://www.irccloud.com/pastebin/DoEnG1yK/ [13:36:07] ejoseph: can you give it a try and ping me when you're stuck? [13:37:28] On it [13:37:30] Thanks [15:48:41] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:3.0.0-M1:perform (default-cli) on project extra-analysis: No SCM URL was provided to perform the release from [15:48:41] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [15:49:49] gehel: not sure what to do [15:50:20] ejoseph: strange... I'll have a look [15:50:46] One thing I forgot (but seems unrelated) is that you'll need to upload your GPG key to a keyserver [15:52:45] ejoseph: are you running the release from a clean checkout of the master branch? Can you send the full log? [15:55:33] on which step is this failing? [15:55:52] we can try doing this release together tomorrow during our 1:1 [16:02:00] ryankemper: triage meeting: https://meet.google.com/qho-jyqp-qos [16:39:12] ok cool [16:46:00] dinner, back later [17:11:10] hmm, relforge/dpkg might complain that i installed the plugin previously outside the deb. I guess if the .deb installs all should be fine [17:14:52] ebernhardson: seems to have taken fine [17:15:20] `wmf-elasticsearch-search-plugins/stretch-wikimedia,now 6.5.4-7~stretch all [installed]` [17:16:36] excellent [17:19:08] trying to figure out now whats up with 2044 and the old pool flatlines alerts that it fired. It does look wedged but not in an expected way [17:19:14] {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503} [17:19:29] so, i was annoyed that the new alert didn't work and was just annoying. Maybe its finding problems :P [17:29:30] i don't get whats happening here though. It starts with `no route to host` briefly to all the masters, then transitions to `connection refused` where it's been a few days. Telnet can open connections though, best guess is some circuit breaker in elastic is broken and it thinks it's retrying but it's not really. [17:33:09] restarted both elastic instances, looks like they are picking up the pieces [17:47:02] elastic2043 is struggling [17:48:47] weird :S [17:50:49] hmm, i guess can ban it? Not clear what's up since it just restarted [17:51:01] perhaps bad luck and it got a bad set of shards because of 2044 restart [17:51:05] oh, that was 2044 [17:51:16] ya, hmm [17:51:38] 2043 shows it's old pool drop to 1.5GB about 20 minutes ago, if not a restart then that's surprising [17:51:52] Rolling restarts were ongoing [17:52:03] I just halted them ~60s ago though because I'm seeing a lot of poolcounter rejections [17:52:41] As an aside this is a further indication that we need to increase our cluster capacity in the medium term [17:53:09] i think we asked for 20 servers * 2 dcs, scheduled for next Q? [17:55:11] pretty much...https://docs.google.com/spreadsheets/d/1xwdhKSjp0h8WItg_hobmjV6L4l83aiMeMR5S4_P_bE4/edit#gid=132951534 14 new hosts, scheduled for Q3 [17:55:25] 14 per dc that is [17:55:39] Okay good, that will take us from 36 to 50 so I'd expect we'll be in good shape with that [17:57:14] best guess on 2043 is the pending i/o operations being closer to the root [17:57:21] but not sure what yet [17:59:11] looks to be better now [17:59:50] hmm indeed, the graphs just drop right back to "normal" [18:06:34] ebernhardson: what did you mean by `pending i/o operations being closer to the root`? [18:06:49] well specifically what is the root in that context [18:08:09] ryankemper: trying to guess at what is actually causing the instance to report massive latency. During that time period read iops jumped from ~2k iops to a peak of 45k iops (following that chain further). Whatever spiked the iops probably triggerd the latency [18:08:32] pending i/o would be because it was asking for more iops than we can get [18:09:10] Ah I see, was confused what 'root' referred to but now I realize it's root cause, doh :P [18:09:24] sadly the node comparison graph doesn't make this clear, it's looking at `md1` but some instances use md2 [18:12:13] pulled 1 GB/s from disk, surprisingly not far from 10Gbit which this instance has, but it should all be rate limited in the cluster config [18:14:48] hmm, no similar network activity from 2043, instances that are transfering data look to be respecting the rate limit [18:14:59] dinner [18:20:10] well, going to stop investigating. 2043 starts rising usage at 17:30, same time i restarted the instances on 2044. Too close to be unrelated, but i can't really tell where all the bytes read from disk went or for what purpose. Maybe pulling the cluster tasks when it was still struggling would have said something, but a bit late now [20:02:41] ebernhardson: meeting? or cancel for today? [20:02:48] gehel: oh, sec was distracted [20:02:58] he, he [20:15:34] ebernhardson: (for when you're next available) we should take a look at the poolcounter concurrency limit for cirrussearch and see if it needs tuning. We're seeing downstream starvation of mediawiki php workers even when the poolcounter is rejecting half of incoming requests; presumably because they're blocking on synchronous calls to the elasticsearch api [20:32:34] huh, indeed the pool counters graph looks pretty sad [20:36:55] checking the per-node percentiles it looks like one or two instances are having trouble, and that causes the whole cluster to drop it's peak qps [20:37:39] (it also means adaptive replica selection is pretty meh for us :P) [20:40:03] ouch, the cluster cpu heatmap clearly shows the whole cluster going to have workload any time a couple instances are struggling. Clustering never as resilient as we would hope :( [20:40:16] s/have workload/half workload/ [20:43:16] it looks like whatever 2043 did at 17:30-18:00, 2045 just did from 20:00-20:30 [20:45:11] probably pointless but randomly interesting, 2043 pushed both disks to 100% utilization, 2045 pushed one to 100% and the other to only ~60% [20:47:56] I wonder if the new pool counter limits we just put in place will help the cluster itself [20:48:13] hmm, only a guess but i think our root cause is going to be retry spam from a missing commonswiki_file index [20:48:30] oh, hmm [20:48:38] regular updates combined with saneitizer is probably spamming out the job queue? But i don't know how that can be relatd to the instances having crazy IO :P [20:48:44] maybe two problems ... [20:49:41] ebernhardson: Oh, I didn't think about the sanitizer. Would it make sense for us to disable that so we can get the cluster restarted with less pain? [20:50:13] yes, should have some time ago. I think it's pausable from a command line job, checking [20:51:59] meh, i have to figure out how this works :P Apparently i'm also one of those horrible developers with not enough context in error messages: `eqiad is not in the set of writable clusters` [20:53:18] oh, this is using the "wrong" cluster selector, it's choosing between elasticsearch clusters and not cirrus clusters [20:54:34] ok, this will be more work than expected :P It seems like the script validates cluster names like `eqiad-chi`, but then of course when asking cirrus for that it doesn't have a cluster named eqiad-chi, it only has eqiad [20:55:43] Classic [20:56:10] FWIW we want to disable the saneitizer on codfw as well [20:56:21] Since it's the codfw restarts that we're blocked on right now [20:56:38] the more nuclear option is to stop the cron jobs, then it wont enqueue anything. It will just pick where it left off whenever we turn it back on [20:56:46] (they are probably systemd timers though :P) [20:57:03] Nuclear option sounds great to me actually [20:57:39] Where do those live, mwmaint or something? [20:57:42] puppet [20:57:44] * ryankemper will glance at puppet code [20:58:13] just looke for Saneitize in puppet, noone else spells like us :) [20:58:59] The name has its perks :P [20:59:09] Ah it's a `profile::mediawiki::periodic_job` [20:59:47] for timing, latency issues were 17:30-18:0 and 20:00-20:20. saneitizer fix rate spikes aren't overlapping. So probably not directly related to the latency stuff...still worth stopping for the moment [21:00:17] gotcha [21:00:40] ebernhardson: is codfw considered the `active` cluster right now? [21:00:53] the periodic jobs file mentions that it runs in the active dc [21:02:25] yes, the job queries all clusters at the same time [21:02:34] ah i see [21:02:40] so regardless of where it spawns from it runs everywhere [21:03:00] yea, this script enqueues a bunch of jobs, when those jobs run they fetch from sql and compare all known clusters [21:03:24] BTW not sure if you saw it but https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/740674/1/wmf-config/PoolCounterSettings.php#66 / https://phabricator.wikimedia.org/T296224 is the new values for pool counter [21:04:21] We should see if that's where we feel comfortable having the poolcounter limits be set going forward...I think the approach of having it be slightly more than the number of workers is probably sane, but we'll ofc want to look at a few days worth of data to see if we end up w/ rejections during normal peak load [21:04:56] My gut is telling me it was probably tuned too high previously so these new values might actually be perfect if we end up wasting less work thrashing [21:05:13] well thrashing and/or just having a lot of workers sitting on long-running synchronous requests [21:06:25] in general, as long as the pool counter isn't dropping legitimate requests it can probably be lower. I don't know that we have good stats to see what regular steady states are though [21:06:43] maybe some have been added, it's been a few years since i looked [21:10:21] if we need to cut load more longer term, we can re-tune the saneitizer profiles as well to do less work each round and spread it out over more weeks [21:10:47] it's already at 8+ weeks to rerender everything, so it's already somewhat long time periods tohugh :) [21:14:44] lunch [22:15:13] as for the topic of re-tuning the saneitizer to spread the work out even further, I think the current 8+ weeks is fine. if we find ourselves needing to do that kind of tuning even after we add the new hosts next quarter, that will be a big indication that we need to scale the cluster up even more :) I suspect the extra 38% capacity (36->50) will be sufficient for us not to need to fiddle with it [23:00:25] I don't think we have any way of examining/monitoring the queue size of poolcounter short of capturing all traffic and parsing the keys to get an estimate