[08:42:02] pfischer: o/ would you be able to look into T378983 today? [08:42:03] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [09:07:43] I wonder if we put too many jobs in the sequential pool, dags are having a hard time to recover from the mjolnir task that got stuck for several days [10:29:04] gehel: sorry to bother with this again but could you do another sonarcloud rename for https://gerrit.wikimedia.org/r/c/search/highlighter/+/1080384? [10:29:19] errand+lunch [10:55:47] dcausse: already on it, discussed it with urbanecm [11:35:07] dcausse: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1087432 [13:19:04] pfischer: thanks! looking [13:27:01] dcausse: thanks. Shall I back-port it? According to urbanecm Growth could live with the reverted config change that enabled stream processing [13:31:37] pfischer: I'm fine waiting too, the branch cut already happened this means it'll go live only thursday next week [13:43:34] pfischer: is it a good idea to regenerate the broken link suggestions at this point? or would it be more helpful to wait for the patch above to get live? [13:43:57] i admit i'm not sure what the difference is between the two versions [13:58:36] the more i'm thinking about it, the more backporting seems appealing, as it would allow us to test the eventgate system is working, without risking a similar issue reoccuring at a later point. if we wait for train and then switch the flag back at the beggining of a week (so we have more time to test), [13:58:41] that seeems also ok to me. [13:58:49] ultimately, your call, just adding my 2c. [14:15:54] urbanecm: sure, I’ll hurry to create a backport [14:44:05] o/ [14:44:39] \o [14:45:40] o/ [14:46:05] dcausse: hmm, we could skip a week of mjolnir if it fixes the sequential queue, but we might need to ponder it in the future [14:47:32] ebernhardson: I haven't looked too much to see what's pending yet [14:48:00] not sure I'll get a precise view tho, I suspect depends-on-past dag won't be scheduled yet... [14:49:33] yea we don't have great visibility in there [14:50:44] hm... can't filter the tasks by queue... [14:51:16] dcausse: http://localhost:8600/taskinstance/list/?_flt_3_pool=sequential [14:51:53] ebernhardson: thanks! [14:52:28] it's only subgraph_query_*_daily so far (5 of them) [14:54:08] ebernhardson: relatedly I wonder if mjolnir feature selection is leaving some states behind that might cause subsequent runs to fail [14:54:26] recent runs all failed quite quickly with T379045 [14:54:26] T379045: mjolnir fails with: Partition not found in table 'labeled_query_page' database 'mjolnir' - https://phabricator.wikimedia.org/T379045 [14:55:02] hmm, feature selection shouldn't have changed labeled_query_page contents, it should only read from there [15:54:27] getting an alert for morelike latencies in CODFW, do y'all think this is cebwiki again? [15:56:24] inflatador: most likely yes, filed T379002 hoping that it might help a bit... [15:56:25] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [16:57:25] workout, back in ~40 [17:19:52] dinner [17:53:26] abck [17:53:28] or back [18:00:58] ebernhardson ryankemper sorry for the short notice, but I've been asked to put together a rough hardware forecast for the next 5 yrs for DC Ops/finance ( https://phabricator.wikimedia.org/T379079 ) . Anything in particular I should be aware of? Was planning on just looking at dashboard trends [18:11:33] inflatador: dashboard trends are probably the best we can do, and then some estimate for future work that varies from historical like vector queries. iiuc querying the vectors isn't too expensive, but updating them can be. So we don't know what it needs :P [18:28:24] ebernhardson ACK, thanks...I know it's not the easiest request to fulfill ;) [18:50:08] lunch, back in ~40 [19:30:34] back [19:54:24] curious, `scap pull` on snapshot1012 asked me for an mwdeploy password, although `sudo -l` shows NOPASSWD: ALL for mwdeploy [19:57:51] I'm wondering what splitting the graph means in terms of HW forecasting...you've got 2 graphs, one will probably get used more than the other, one's bigger than the other... [19:59:31] dr apt, back in ~90m [21:25:48] back [21:26:27] back [21:57:18] hmm, using this promQL query ( https://paste.opendev.org/show/826017/ ) it looks like the eqiad hosts average ~30GB more disk usage than the codfw hosts. Not that it really matters, but I thought it was interesting [22:00:14] also possible my query is goofed [22:43:45] ryankemper not sure how far back you're looking, but I can't find anything in Thanos older than 1 year. According to this (which I wrote myself, but was told by godog) there should be 5 years of retention: https://wikitech.wikimedia.org/wiki/Thanos#Metrics_Retention [22:56:59] inflatador: I was wondering if maybe we just hadn't been collecting the metric for longer than that. but it does seem suspiciously close to 1 year. I can get results 370 days out but not much more than that [22:57:50] yeah, the Thanos page says 54 weeks [22:58:21] Even if we didn't have 5 years, you'd think we could go back a bit farther [23:02:24] ryankemper: still around? Shipping a mw fix for dumps in a few minutes and will need to turn it back on in puppet [23:03:59] in theory, we revert this one and it starts working: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1083136 [23:05:16] ebernhardson: yeah [23:05:32] ebernhardson: i'm taking the dog out real quick, any issue if i revert that in 10 mins or so? [23:06:43] ryankemper: it'll be longer than that to get the mw patch out anyways [23:06:49] perfect [23:13:58] ryankemper I neglected to mention, the due date for this forecast is EoD tomorrow, so don't feel like you have to finish right away [23:40:34] ebernhardson: back in action [23:43:49] ebernhardson: I got up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087598, lmk when I should merge [23:54:38] ryankemper: hopefully not too long, scap is still deploying, but at least it's up to building k8s images now