[07:42:48] o/ [07:43:05] started enwiki import in codfw [09:23:14] errand [09:45:00] o/ [09:49:06] o/ [09:50:00] importing to eqiad now [11:19:42] lunch [13:14:56] enwiki is loaded, running some load test to warm it up but not getting great perfs yet [13:16:55] p50 at ~1.5s [13:35:28] Thanks. At what rate are you firing? [13:35:55] \o [13:36:00] o/ [13:36:23] no sure if we ever applied the readaheads there after starting pods [13:36:49] might not matter though :P unknown [13:40:36] o/ [13:41:02] 5 concurrent requests, was mainly interested in warming up [13:41:55] double checked, indeed it's still all on 8mb readaheads, and the ceph io graph shows capping at 4gb, i suspect that's a limit on the ceph side since it's the same with more pod's [13:42:18] oh ok, good to know [13:42:46] 1 user I'm around 0.5s [13:44:23] and was looking at the wrong ceph dashboard, pretty obvious when I open the right one :) [13:44:52] yea, that'll do it [13:49:49] random question, do we know why it's set at 8mb? it seems pretty massive esp. for the SSD tier [13:50:05] afaict that's just the ceph default. [13:50:10] oh wow [13:51:27] although looking around now, i'm finding some things that suggest the defaultis 512kb, but it's a ceph blog post from 2015, so who knows [13:52:39] yes... remember seeing values in that range as well when looking at ceph doc [13:53:58] https://docs.ceph.com/en/octopus/rbd/rbd-config-ref/#read-ahead-settings [14:07:29] looks like query_clicks_hourly finished backfilling (re-ran 3 months), clearing out query_clicks_daily now to regen the click data [14:07:50] nice! [14:22:49] ebernhardson: why did we have to backfill? Was data missing? [14:38:48] pfischer: for T414103 found the reason we had no data is because in query_clicks_hourly the timestamps were coming out null [14:39:14] * ebernhardson was expecting a bot to expand that: https://phabricator.wikimedia.org/T414103 - [14:39:17] Mjolnir feature collection failing in mjolnir_weekly Airflow DAG [15:03:48] somehow the ltr plugin isn't released for 2.19.5 :S There is a branch, but not having luck building it yet... [15:05:45] oh no i just failed reading...the readme says you have to run gradle with -Dtests.security.manager=false [15:23:48] oh sheesh, the reason it's not separately released is because it's shipped by default :P i should pay more attention [15:26:36] in theory thats all the plugins in 2.19.5. sudachi fails a dictionary load though, probably more issues to work out: https://phabricator.wikimedia.org/P89865 [15:27:37] (and a bunch of default plugins we might consider removing) [15:32:18] daily backfill completed as well, mjolnir seems to be working. it collected ~40M rows in query_clicks_ltr this time [15:42:45] So.. we have a tornado heading in our general direction over the next 10 minutes, possibly with more later today. Not expecting anything to happen, but we are taking precautions. I may be late to triage or miss it. [15:43:41] Trey314159: ouch, take care! [15:52:08] Trey314159: Sure, take care! [15:53:14] it's probably past us, but we're going to keep watching the radar until the next update in ~7 minutes. [16:11:29] CategoriesQueryServiceUpdateLagTooHigh does not look good: https://grafana.wikimedia.org/goto/dfg7esn01jb40e?orgId=1 Does someone has an idea of what we should do? [16:52:30] gehel: had a quick look earlier today but found no logs... will try to run the thing manually to see [17:19:22] gehel: blazegraph seems to restart every 10mins on the categories endpoint, this seems to cause the loading of some wiki to fail and thus living an old timestamp in the categories graph [17:22:14] seems related to wdqs-deadlock-remediation sudo journalctl -u wdqs-categories-deadlock-check.service -> "RESTART: wdqs-categories restart issued successfully" [17:22:21] ryankemper: ^ [17:23:03] dcausse: thx, checking [17:23:15] prob a bug with the new refactor [17:33:37] yeah, i see the problem. the way `undef` gets passed through the chain isn't working properly, so the bit of logic that disables lag-checks on categories hasn't been working. working on a fix [17:40:20] thanks! [17:42:20] alright https://gerrit.wikimedia.org/r/c/operations/puppet/+/1253583 is up. will look at pcc to be sure, but this should fix it [18:03:34] shipping it [18:08:30] thanks for the quick fix, lag alerts should resolve themselves tomorrow morning [18:08:33] dinner [18:12:10] ran puppet on all affected hosts, and I manually ran the check script on wdqs2007 and verified it didn't restart. this is all fixed now [18:12:46] DOG_WHINING check is critical right now so stepping out to address the canine production incident, now that the wdqs incident is all resolved [19:06:44] unsurprisingly...opensearch doesn't like when sudachi does this: final Class unsafeClass = Class.forName("sun.misc.Unsafe"); [19:23:34] thankfully, it seems thats only invoked when unloading a dictionary. If it managed to load the dic it (probably) wont do that [19:57:39] Sudachi is better at language, but seems a little less good at software engineering. [20:45:05] got sudachi working, some PEBKAC there as well :) integration suite mostly passes on 2.19.5. Couple failures, will have to look into them. Looking promising enough i suppose i'll work up gerrit patches for the various plugin changes first