[08:14:56] o/ [08:45:48] o/ [10:38:37] hello everyone [10:41:30] My build has seems to fail on selenium [10:41:32] https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/128759/console [10:42:10] I cant seem to be able to pinpoint what's wrong [10:46:49] ejoseph: hey, looking [10:47:11] I'm seeing: 2022-01-03T13:11:31.898Z ERROR @wdio/runner: Error: connect ECONNREFUSED 127.0.0.1:45287 [10:47:26] and it probably means that the failure is unrelated to your patch [10:48:17] I'd look at other errors and make sure they all seem unrelated and then ask CI to recheck by replying to your own CR with the simple comment: "recheck" [10:50:16] yes, this is sole failure I can identify I think it's unrelated [10:50:56] if this problem occurs too frequently we should file a task in phab (but since it's core it's likely that someone already filed one) [10:52:48] ejoseph: looks like I already replied "recheck" yesterday [10:53:00] and then you patch went green [10:54:07] Oh i see [10:54:14] Thanks [10:58:57] dcausse: I scheduled another meeting for 1pm GMT+1 [10:59:12] ejoseph: sure [12:44:14] lunch [12:44:18] break [14:06:58] Greetings, Searchers! Happy New Year [14:07:06] o/ [14:08:50] Happy New Year, everyone! [14:09:55] happy new year! :) [14:10:03] happy new year! [14:12:00] Wearing my favorite gift today, maybe you'll get to see it if you're lucky! https://www.pennersinc.com/collections/guayaberas-short-sleeve/products/original-guayabera-short-sleeve?variant=26459440087140 [14:14:23] nice! [14:18:25] fighting with guava dependencies mess with hadoop is no fun... (java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkNotNull(Ljava/lang/Object;Ljava/lang/String;Ljava/lang/Object;)Ljava/lang/Object) ... [14:20:15] It's Ljava/langs all the way down! [14:22:09] dcausse: is that during the build? or at runtime? [14:22:22] during tests [14:22:23] in the "refinery" project? [14:22:26] so I get the above forcing v22 [14:22:32] and com.google.common.base.Stopwatch.()V from class org.apache.hadoop.mapred.FileInputFormat [14:22:36] forcing 16.0.1 [14:22:54] he, he, he... [14:22:56] which project? [14:23:11] gehel: it's since I bumped the eventutilities project which uses this new Preconditions [14:23:17] in the rdf repo [14:23:44] the refinery project must be having the same problem tho... [14:23:53] but they force 16.0.1 [14:26:56] let's upgrade hadoop! [14:27:31] inflatador: did you see a bunch of BlazegraphFreeAllocator alerts poping up in you emails? [14:27:45] I won't find a guava that's compatible for both... [14:27:56] yes [14:28:19] "Blazegraph is misbehaving and will rapidly corrupt its journal." [14:28:37] I can check on these [14:28:46] this is a somewhat new alert, I assume that we need to tune it. Could you open a phab task and have a look at the graphs, see if you understand what's going on? [14:29:36] no emergency on that one, but we should reduce the noise [14:29:51] I have meetings for the next 1.5h, but I can give you an overview after that [14:30:06] Sure, will get started [14:30:34] dcausse: should we downgrade the version used in eventutilities? Or at least make sure it only uses methods compatible with 16.0? [14:31:02] gehel: it's a solution but I wonder how it's working in refinery [14:31:28] unless there's no test and guava is somewhat forced to a newer version by spark or the like [14:41:06] https://phabricator.wikimedia.org/T298525 created to check out the Blazegraph alerts. Hit me up if anyone has any further advice on this one [14:44:42] inflatador: the alert is managed by alertmanager in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-search-platform/blazegraph.yaml [14:45:36] the "-0.02" has been handpicked and might be prone to false positives [14:48:09] what it must do is detect actual problems which occurred for wdqs1006 & wdqs1004 on 2021-10-22 [14:48:21] c.f. https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=32&orgId=1&from=1634432957833&to=1635863164943 [14:49:03] Thanks dcausse , knowing what "actual problem" looks like is crucial. I also noticed the "Source" link in the alert email points to prometheus.svc.codfw.wmnet , which doesn't seem to exist [14:51:15] hm indeed... never noticed this, not sure we can control this from the alert definition tho [14:53:59] inflatador: there's an excellent doc about all this at https://wikitech.wikimedia.org/wiki/Alertmanager [14:54:36] Cool, will read it. I know a little Prometheus, but obviously not our particular setup [15:20:12] dcausse: are you available [15:20:20] ejoseph: yes [15:38:06] ^^ Wow, a coherent doc for alerting. Wish my old job had stuff like that... ;P [16:00:23] \o [16:06:02] o/ [16:23:00] meh, paging through a few weeks of emails. Guess i should take a few days and cleanup the regular failuires we've been ignoring [16:30:24] puppet hadn't run on wcqs-beta since late oct, rather than fixing i've turned on `disable-puppet` and expect that instance will be deleted around the end of the month [16:31:21] sounds like a good plan! [16:31:31] inflatador: want some more context on those alerts? [16:31:38] IRC or meet? [16:31:55] gehel sure, let's do a meet [16:32:14] meet.google.com/zgi-yuzu-tna [16:32:34] * ebernhardson isn't clear if you needed me or not, seems unrelated but maybe not :P [16:32:41] ebernhardson: nope, unrelated [16:32:57] but feel free to join if you are feeling lonely! [17:07:14] redeployed mjolnir with the fat refinery jar, query clustering has been failing because the default jar is no longer a fat jar (and we take the "latest" instead of hard coding a version number) [17:07:24] looks to be running (at least, doesn't fail in 30s) [17:08:59] dinner time [17:10:23] ended up upgrading few things... (spark 2 -> 3, scala 2.11 -> 2.12) not a bad thing but I hope that'll work well on the cluster [17:13:40] well maybe a bad thing... how will I run a spark3 app from an-airflow1001? [17:19:49] hmm, well when we tested spark 2 when spark 1 was deployed, all it required was an uncompressed spark 2 archive [17:19:56] maybe spark 3 will work the same? I can test [17:20:30] basically spark is "smart" enough to ship everything into hdfs on every run if it's not provided an hdfs deployment to pull from [17:22:08] for airflow maybe it would work by creating a second spark connection and setting the SPARK_HOME in that, as long as the cli arg's are still compatible (probably) [17:22:24] I never took calculus...need a little help translating https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-search-platform/blazegraph.yaml#5 into English . I interpret as "Alert if the average number of free allocators over 10 minutes continues to drop more than 2 percent for longer than 2 minutes" , does that sound right? [17:23:00] ebernhardson: so I'd need a spark3 deb to install to the airflow machine? [17:23:31] dcausse: yea i think so [17:23:39] ok [17:23:45] inflatador: hmm, usually i derive the units through things and make sure it matches. Just a sec :) [17:26:54] inflatador: have to play with the numbers, but i'm suspicious of what exactly the [10m] does. I'm also not sure that it's a percentage being given. Going to take a moment longer :P [17:27:07] hm.. 0.02 is the derivative so not really a percentage [17:27:29] i dunno if you're familiar, but i play with the queries in https://grafana-rw.wikimedia.org/explore to figure out what they mean [17:27:40] deriv(v range-vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression. [17:28:28] dcausse: what i'm not clear about is if that means every point is (X(t-10m) - X(t)) / t or if it's doing something else [17:28:50] * ebernhardson probably has the wrong equation, but i mean is it just taking the endpoints or is it doing something fancy) [17:29:00] I hope it takes the oldest and newest point and draw a line but I might be wrong [17:30:14] i hope so too :) looking [17:33:27] OK, so the derivative basically means the rate of change, and is basically self-contained within this equation? I can certainly play around with the .02 value, thanks for the clarification [17:34:35] so for an example, 1003 dropped from 233026 so 233025 at 16:56. That shows up in the derivative aas 0.0025 at 17:00 (4 minutes after change), going back to 0 at 17:05. The awkwardness here is going to be it's desire for per second numbers [17:35:48] using [5m] instead changes the peak to .0050. Not sure what equation would give that, 1/(5*60) would give .0033 [17:38:24] inflatador: right its the rate of change, i guess what i was trying to figure out is what the exact rate being provided is. The number output seems fairly arbitrary, but i don't have better ideas currently :P [17:41:21] delta(blazegraph_free_allocators{instance="wdqs1003:9193"}[10m]) seems to provide a number more aligned with what we wanted: `calculates the difference between the first and last vlaue of each time series element in a range vector v` [17:41:41] (probably has other equally awkward edge cases :P) [17:42:36] agreed, can't make sense of the deriv output when varying the time range [17:42:51] We don't really need to understand the exact meaning of those numbers for this particular alert (even if yes, it's always nicer to understand what we do) [17:43:28] Playing with a limit that seems to match when we were having trouble but not alert when we were not is sufficient [17:46:12] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221634874303575%22,%221635118517708%22,%22eqiad%20prometheus%2Fops%22,%7B%22exemplar%22:true,%22expr%22:%22delta(blazegraph_free_allocators%7Binstance%3D%5C%22wdqs1004:9193%5C%22%7D%5B10m%5D)%22,%22requestId%22:%22Q-97269a55-c8c4-4b8b-9752-82d070c3bf28-0A%22%7D%5D [17:46:17] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221634883748390%22,%221635104820285%22,%22eqiad%20prometheus%2Fops%22,%7B%22exemplar%22:true,%22expr%22:%22blazegraph_free_allocators%7Binstance%3D%5C%22wdqs1004:9193%5C%22%7D%22,%22requestId%22:%22Q-f270164d-b148-46c3-8aca-a7152cd4811b-0A%22%7D%5D [17:49:21] For the 10-22-10-24 issue, did we take any steps to remediate? [17:50:26] inflatador: we most likely restarted the machines [17:50:39] checking sal [17:52:02] yes "16:40 dcausse: restarting blazegraph on wdqs1004 and wdqs1006 (free allocators alert)" [17:52:15] c.f: https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_46 [17:53:43] we restarted the blazegraph service "wdqs-blazegraph" not the machines sorry [17:54:22] No worries, I see it in the log [18:03:44] Correct me if I'm wrong, but it looks like restarting the service didn't free up any allocators? [18:05:01] inflatador: correct, it just stops the bleeding, it does not restore the allocator that were burnt too quickly [18:05:59] * ebernhardson notes, if you havent noticed, blazegraph isn't the most fun software to operate :P [18:06:17] :) [18:07:10] sigh... I realize that upgrading the pom to spark3 was actually the easiest part [18:07:25] not sure where I'll get a viable spark3 deb [18:07:53] hmm, can we stuff the archive in archiva, git-fat it into the deployment? [18:08:09] (assuming analytices will solve the how-to-deploy-spark3 problem for us some day) [18:08:12] and looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/spark2/ I'm not sure I'm up to replicate that for spark3 [18:09:01] is it ok to untar something in say "/tmp" just before running spark-submit? [18:09:15] we could have the scap scripts untar it into the deployment [18:09:47] i dunno if that will do anything funky with disk space, spark is big [18:10:05] hmm, we could untar when running i suppose. It's just some waiting [18:10:20] ah right an an-airflow does not have much disk... [18:10:42] 42G size, 31G used [18:10:43] I can look at how to untar something during scap [18:12:17] what should I take in https://archive.apache.org/dist/spark/spark-3.2.0/ ? [18:12:34] with or without hadoop? [18:13:04] i think in the past i used the with-hadoop versions. Not sure if it was necessary [18:13:13] * ebernhardson is probably encouraging jar hell [18:13:18] :) [18:13:38] well I suppose I can easily try from a stat machine [18:14:16] the main question will be how compatible the config is, if you can point spark3 at our spark2 deployed configs and it works [18:14:58] dcausse got it, thanks for that context [18:16:47] I was hoping that it would detect the hadoop config automagically like flink but I'm probably optimistic :) [18:39:55] it can reuse the hadoop conf but not the spark2 conf [18:41:39] it has things like, spark.yarn.archive=hdfs:///user/spark/share/lib/spark-2.4.4-assembly.zip which is probably not good for spark3 [18:41:39] lunch [18:42:22] dcausse: hmm, yea probably not going to work with the spark2 conf. Maybe the hadoop conf is enough though. It's been so long i forget how exactly spark2 was done, i'm thinking maybe analytics had already made a spark 2 conf or something [18:43:42] I can probably copy/paste most of the conf if the defaults are not great [18:44:58] also I wonder if the jar built against spark3 is not simply going to just work with spark2 [18:45:06] I changed nothing in the code base [18:45:40] well, almost nothing [18:46:47] hmm, couldn't hurt to try [18:51:40] meh, should have checked earlier. I restarted the mjolnir dbn job but it doesn't run because we have split traffic between eqiad and codfw [18:51:55] also, i forgot we have split traffic :P Will check into why and see if we can put it back to normal today [18:52:25] (well, it runs but the msearch daemons never respond so it's waiting around) [19:23:37] I'm changing discovery-alerts mailing list to discard emails from unknown sources. I think it means two things: a) no more manually rejecting spam email (it's low volume, few times a month, but annoying none the less) for list admins, but also b) failure emails that we haven't whitelisted get discarded and it's easier to not notice [19:27:03] and back [19:41:43] Having trouble finding the alerts from 30 Dec in https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly , is my query wrong or is there somewhere else (logstash?) where I can find older alerts? [19:44:06] something suspicious there, i only see team=sre|perf|performance [19:45:30] good point, I manually put in "team=search-platform" [19:45:45] Let me check if that actually returns anything [19:47:30] @state=!active doesn't seem to return anything, I guess old alerts are somewhere else? [19:51:08] It seems like they should be there, the website is an app called karma which is supposed to be dashboarding for alertmanager. I wonder if we can see the direct alertmanager ui somewhere [19:51:53] been digging around https://logstash.wikimedia.org/ , let me know if that's a bad idea [19:52:21] per the karma readme, "Alertmanager doesn't currently provide any long term storage of alert events or a way to query for historical alerts, but each Prometheus server sending alerts stores metrics related to triggered alerts." [19:52:33] so, they are somewhere [19:53:23] logstash is probably fine, i dig around in there regularly :) [19:59:08] best guess is alerts.wikimedia.org can only show current state, the `alerts overview` dashboard in logstash seems to be the place to find historical alerting information [19:59:43] Cool, I'm bumbling my way through logstash. If you have any example queries let me know. Trying to figure out prometheus labels vs logstash fields [20:00:13] inflatador: https://logstash.wikimedia.org/goto/0f894d3902bb5cfe0cd36ee235102068 [20:01:23] yea the impedence mismatch between our various systems makes translating a pain :( [20:07:59] Thanks ebernhardson , exactly what I needed [20:35:26] lunch [21:23:20] back [22:49:03] https://phabricator.wikimedia.org/T298525 PR for your perusal! (Hope I did this right) [22:49:39] Gerrit link https://gerrit.wikimedia.org/r/c/operations/alerts/+/751513/ [22:50:46] inflatador: commit message needs a bit more, the overview is https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [22:51:32] Who's got 2 thumbs, just read that article, and STILL got it wrong? [22:51:43] inflatador: at a basic level, needs a first line that describes something, and no empty line between Bug and Change-Id. I'm not entirely sure but suspect the parsing for those pieces looks for the last block of `key: value` pairs without a separating newline [22:51:52] does Blazegraph have a query execution timeout independently of the nginx proxy? [22:52:16] hare: sadly no, which prevents lots of very typical approaches to solving related problems :( [22:53:16] hare: or at least, not one that is particularly reliable. You'd have to check back in during eu-afternoon to get details but when i've suggested approaches to resolving problems that required leaning on blazegraph to kill long queries was told it can't reliably do that [22:53:57] Actually that's interesting because I am having a query seemingly... give up, or at least that's what it looks like from the error my script is giving me. The error resembles the one given when the proxy just stops responding and it cuts off in the middle [22:54:21] https://www.irccloud.com/pastebin/0YyXuLq1/ [22:54:52] inflatador: its all good, gives practice to submit multiple PR's :) I usually edit commit messages directly from the gerrit UI, but you can also update the local commit and re-send for review [22:55:14] ebernhardson just updated commit message via the UI, lmk if it looks OK [22:55:31] What's happening there is, the script is attempting to take the text of the response and turn it into JSON, and that error appears to imply that the "JSON object" randomly cut off [22:56:06] So the question is, is there a failure scenario where Blazegraph's response is just too big? [22:56:30] hare: hmm, indeed that does look like it's simply cutting the response off in the middle. Are you sure it's blazegraph and not nginx cutting it off? [22:56:52] hare: mostly, can you query blazegraph directly without the nginx proxy (probably from the host itself) and get a full response? [22:56:55] I'm accessing the query service from localhost:9999 as opposed to the proxy at https://experimental.orb.rest [22:57:04] ok, yea that's direct. hmm [22:57:36] hare: sadly i'm not sure, you're best bet would be to catch david tomorrow [22:59:15] I tried again with `time` and the script fails after <7 minutes. So it's not running into the 10 minute timeout, which isn't relevant anyway since we're not going through my Caddy proxy [22:59:32] inflatador: better, but i think still have to drop the newline between Bug and Change-Id. [23:00:34] inflatador: i suppose it might not matter in practice, it looks like gerritbot still found the patch and mentioned it on the ticket, but generally it's a single block at the end [23:01:16] no worries, it's better for me to follow everything strictly, at least to start with [23:01:34] Just fixed it, let me know if that one's OK [23:04:13] inflatador: seems reasonable to me. I perhaps have a tendency to be more verbose in the body but there isn't necessarily a bunch to say here. I'll also note it's not required to have a body, if it's a one line patch and the title says what it does thats fine. [23:21:59] I'm out, thanks ebernhardson and friends for the help today! [23:30:35] I manually ran the query outside the context of the script and downloaded the output with curl. Toward the top of the stack trace: com.bigdata.rwstore.sector.MemoryManag [23:30:35] erOutOfMemory [23:31:01] Now it's not like the system is out of memory. Probably just out of memory in that specific context [23:32:02] For reference I configured the stack size as 128 GB [23:40:59] hare: the code for that exceptions references http://jira.blazegraph.com/browse/BLZG-42 for `per-query memory limit for analytic query mode` but sadly that jira doesn't seem to exist anymore [23:42:05] *facepalm* [23:43:42] hare: guessing from reading things, the default limit is 0 which means no limit, use everything the system as available so might not be relevant [23:44:28] not sure really, haven't seen before :( [23:46:01] Amazing, thank you for your help [23:46:08] It appears I have discovered a *new* limitation [23:46:21] hare: as a total guess, adding a `hint:Query hint:analytic "true"` to the `where { ... }` section might turn on analytic query mode which is probably intended for expensive queries (which i suspect you are running :) [23:46:39] http://jira.blazegraph.com/browse/BLZG-42 [23:46:50] errhttps://github.com/blazegraph/database/blob/master/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/QueryHints.java#L198 [23:47:11] suggests there are lots of options to tune you may not want to bother with though :P [23:49:24] I'm going to try that analytic mode hint. (Do you think this could help speed up other expensive queries?) [23:50:57] hare: not clear, i don't really know what it does. It seems to allow giving the query more control over how it's executed, so it can try and avoid limits of the system [23:51:17] i suspect it might also reduce some limits it normally applies, but not sure [23:57:35] Unfortunately that didn't appear to be effective [23:59:54] :(