[07:56:49] * gehel is in a talk from Gradle [07:57:04] * gehel is not suggesting we switch to Gradle ! [07:57:40] https://scans.gradle.com/#maven seems interesting. I might have a look at how it works for our projects. [08:13:51] Example of a build for Spring: https://ge.spring.io/s/3ohjbp7npidi2/timeline [08:14:31] Note that not all features are in the free version. And I don't think any of that is OpenSource [08:26:13] More links: https://gradle.com/enterprise-customers/oss-projects/ [08:26:29] Yes, this is just marketing [09:18:49] ebernhardson: about ssh, don't you have keys configured per hosts in your .ssh_config? [09:55:55] And an example report for WDQS: https://scans.gradle.com/s/j25xqqppqs7xk [09:56:28] Nothing super surprising in this report. Still: [09:57:07] 6'30'' of build time is not insane, but it is clearly way too much (this was done from my laptop with a crappy 4G internet connection) [09:57:51] there are 2 tests on RdfClientIntegrationTest that together take a minute. Might make sense to optimize (or maybe even disable) [10:02:01] surprisingly (to me) the most expensive modules to build are rdf-spark-tools (3'00'') and Wikidata Query Service Streaming Updater - Producer (2'58''). Those are not shown in the report (or I did not find them), but are available i the logs. [10:02:22] The report does not seem to show time per module or aggregated time per plugin, which is disapointing. [10:05:25] I'm wondering if it would make sense to split this in 2 (or maybe 3) projects, so that it is easier to not build everything each time [10:07:15] build logs are way to verbose, it's borderline impossible to find anything, and I'm wondering if my console buffer isn't the bottleneck on part of the build [10:19:25] spark & streaming updater are scala projects maybe that's why? [10:19:33] lunch [13:03:14] greetings [14:02:05] o/ [14:13:45] which tech team would we be working with for update pipeline/jobqueue? [14:21:44] mpham: hard to tell, but I'd guess the data platform, core platform, dse and some service ops would be involved [14:25:49] cool thanks [15:00:37] hello everyone, https://meet.google.com/eki-rafx-cxi [15:01:23] dcausse: ^ [15:01:31] inflatador: ^^ [16:31:42] not finding any good answers :S in python land before every test starts the logger is replaced with a buffer, and if the test passes we throw the buffer away without printing it. So that test output only includes failures. But is this possible in junit? [16:32:27] the underlying problem is that the wikidata.query.rdf repo now generates ~85MB of stdout per CI run [16:42:12] 85mb even when all the deps and cached in .m2? [16:42:38] dcausse: yes, giving jetty a proper logger instead of StdErrLog increased output by ~60MB [16:42:47] maybe 70 actually [16:43:20] basically thats the effect of moving the logging from the war to jetty itself [16:43:28] sigh... [16:43:48] we could perhaps configure that logger for the build to be more silent [16:44:03] i guess that would work too, although in principle i like full logs for test failures. [16:44:36] it's currently at debug log level for the root i think, so it can certainly be made quieter [16:46:23] perhaps i'm just more surprised that something i thought was pretty standard is actually a foreign concept in java. Reading stack overflow threads it seems like at least some sub-section of programmers want all those stack traces for passing tests to print [16:48:52] in general you always see test ouput indeed, if the test fails you generally have an exception. But yes with java in general the nice ASCII art progress bars is replaced with a wall of text :) [16:51:07] but successful tests should rarely dump exception stack traces [16:51:20] unless they test actual failure modes [16:53:45] dcausse: even before this, my output is filled with things like ' o.w.q.r.u.c.KafkaStreamConsumer - Failed to commit offsets (retrying)\njava.lang.Exception: simulated failure' followed by 100+ lines of stack trace from 4 nested exceptions [16:54:37] i guess in php or python if the tests are running and i see an exception, something went wrong. In java that seems more likely to mean something went right (the exception was expected) [16:56:06] * ebernhardson should submit to the java way and never look at real command output, instead making an ide intermediate everything :P [17:01:34] :) [17:06:50] haha [17:07:56] also wow ~85MB of stdout per CI run is insane [17:16:14] sigh... all these elastic responses we captured in unit tests have to regenerated... [17:16:22] *to be [17:19:38] I might look at a couple and if only "total hits" has changed then I'll hack them probably [17:19:51] dinner [17:23:41] we should ponder how to replace tests that are hard to regenerate, there are some specific ones i've manually hacked (or `sed -i ...`) instead of regenerating because it's painful [17:24:30] i wonder about some abstraction layer like how elasticsearch does their integration testing with yaml files. but maybe for another time :) [17:38:03] lunch, back in time for SRE pairing [17:53:14] * ebernhardson learns that cirrus used to have a safeifier. That i apparently helped remove in 2016 so not the first time I'm learning about it :) [18:32:01] mpham: i can't seem to comment on the doc, but for 'Sanitizer fixed: Checked ratio based on 30 day rolling average is <0.0001', a better metric is probably the total number of documents fixed over 2 weeks. There are actually two nested cycles in there, every 2 weeks it visits all the pages. If a page is wrong we always fix it, otherwise 1:N pages (currently 8, previously 4) are reindexed [18:32:03] anyways to ensure changes to how mediawiki/cirrus represents things are reflected within the indices [18:32:41] ebernhardson: oops, i changed the edit permissions [18:33:16] mpham: thanks! works now [18:33:39] if we use a metric of total number of docs fixed over 2 weeks, do we have a goal? i imagine lower is better? [18:37:44] mpham: 0 :P [18:38:06] lemme check historically, i think when i looked before we were at multiples of where we were a couple years ago [18:38:48] gehel: pairing? [18:50:25] mpham: mpham not as bad recently as it was before everything got shut off, surprisingly: https://grafana-rw.wikimedia.org/d/2DIjJ6_nk/cirrussearch-saneitizer-historical-fix-rate?orgId=1 [18:50:29] (had to make a new dashboard) [18:51:10] when did we switch if off again? [18:51:19] dec 2021? [18:51:56] mpham: that flat spot late 2021 to beginning of 2022 [18:53:17] it suggest to me, without strong proof, that job queue reliability which was somewhat increased as part of getting it turned back on, relates to how incorrect our indices are [18:56:51] Superficially it looks like things are currently in better shape than before we turned it off. But I think I remember hearing that we think things are actually worse? [18:58:41] mpham: you're reading it right, actually the current cycle looks to be the among the best it's done since 2018 (hard to say if 2018 data of ~500/2wk means the same thing without more checking). [18:59:44] mpham: I can't say entirely why, as mentioned above i suspect recent changes to increase job queue resources seem to have improved things for the better, for now [19:01:52] mpham: i'm not expecting those to last though, the fixes to job queue were band-aids and only improved things minorly, the architecture of jobqueue can't really be scaled much more without significant development effort (re: https://phabricator.wikimedia.org/T300914#7704502) [19:02:18] mpham: i suppose whats been worrying me is the long term trend, o [19:02:48] i've re-saved the dashboard with a different y-max that perahps makes it more clear, 2019 through jan 2021 its fairly low and things were building ever since [19:03:21] thanks. makes sense. You said you think the long term trend is cumulative in nature? as opposed to something drastic crashing suddenly [19:03:53] It seems like this is a lagging indicator, and I wonder if there's a better leading indicator that we can use as a KR [19:06:10] mpham: this is certainly a lagging indicator, a direct indicator might be the cirrus update error rate (CirrusSearchChangeFailed log topic) but because those are logs instead of metrics we only have 90 days [19:06:18] we could add something to make a metric there i suppose [19:06:43] but there is also the suspicion that not everything that fails fails there, we don't have strong proof but it doesn't seem like job queue reliably runs every job we send to it [19:07:01] (but 1 lost in 1 million sent is hard to notice) [19:08:24] mpham: in terms of cumulative, mostly i mean that the trend line on the historical fix rate graph was flat for a few years up to 2021, then over 2021 it aimed up, with it becoming more pronounced 2021-11, until we finally turned the whole thing off [19:09:56] i guess thats a bit confusing above, but turning it off wasn't related to the fix rate. Just why the graph flatlines [19:10:15] why did we turn it off again? i forget [19:11:21] mpham: it was backlogging the job queue, job queue wasn't keeping up with the rate of jobs it wanted to run iirc [19:13:42] https://phabricator.wikimedia.org/T266911 is the direct ticket, altohugh little info. It links https://phabricator.wikimedia.org/T266762 as further info [19:14:51] i suppose if we wanted to still use the total number of documents fixed within the 2 week period as a metric, could we use the average of that 2019-jan 2021 period to be a reasonable goal? [19:15:28] mpham: yea, that makes sense to me and aligns with how i think my worry about the saneitizer came about [19:26:46] mpham: actually i mispoke, it seems that occurance of turning off the saneitizer we turned it back on after a few weeks. It was turned off again when the commonswiki index was deleted since it was trying to fix a missing index, attempts to turn it back on failed due to job queue backlogging and it stayed off for some time [19:27:20] was turned off here: https://phabricator.wikimedia.org/T295705#7522547 [19:28:09] this is all kinda coming back to me. Is that big spike afterwards when we had to rebuild the index from scratch again? [19:30:51] mpham: i'm not entirely sure why it would have spiked so quickly there when we turned it on in late jan, the rate there is quite high [20:10:22] quick workout, back in ~30 [20:14:28] lunch [20:57:06] back [21:23:25] sorry, been back [22:12:28] ryankemper just a heads-up, banned deployment-elastic05 from the ES cluster . I'm out for the day, will continue decom stuff tomorrow