[07:25:29] I started to download the dumps on wdqs1009 & wdqs2008 [07:26:14] I'll be bootstrapping flink with this new state [07:28:30] regarding the data-reload cookbook I think we'll be setting offsets manually for these two machines this time [07:42:38] sure [07:47:24] dcausse: for offset transfer, I know that I need to add that to the data-transfer cookbook [07:47:41] for manual offset set based on timestamp, that's a new cookbook? [07:48:01] or data_reload? [07:48:21] manual I meant run existing code we have [07:48:38] ideally it would have been part of the data-reload cookbook [07:49:26] I think I'll copy/paste some of the spicerack module you wrote in a python script and do that manually from e.g. stat1004 [07:50:12] I was about to suggest that :) [07:50:54] but yes the next cookbook it'll be useful is data-transfer [07:51:20] actually without your new module it'll be huge pain [07:51:53] I hacked some version of it [07:53:58] it's /home/zpapierski/kafka_cluster.py and requires /home/zpapierski/config.yaml [07:54:21] it doesn't take argument's, I just added a function call at the end [07:55:33] cool, thanks! [07:55:50] you're going to use a manual timestamp set up - right? [07:56:06] I have tests for it, but I haven't tried that one manually [07:56:42] should be fine, it's just a part of the process that I did try manually [07:58:25] yes I'll need this so this a good test [07:58:54] both for the flink consumers and the blazegraph consumers [08:00:20] hmm, I wish I knew more about those cookbooks earlier, I'd cannibalize data-reload one for wcqs data reload [08:01:58] yes but cookbooks have constraints, I doubt they can be used in wmcs and still require root [08:02:56] but with wcqs moving to prod we'll definitely need new (or adapt) ones [08:03:38] by cannibalizing I meant copy all the logic there, extracting it from spicerack/cookbooks, but good point on the update [08:04:35] btw - is there a way in cookbooks that's there already to understand which updater is running? To know if we need to do offset transfer during data-transfer? [08:05:16] hm.. maybe? [08:05:56] well for the "transition" the target machine will still be running the old one [08:06:04] so we might want to force that [08:06:54] target, yeah - but the source is what I'm interested here [08:06:59] but we could assume that if the source is running the new one it means we want to copy offsets [08:08:00] if the cookbook can inspect the puppetdb they it should be able to know [08:09:50] zpapierski: if you have time and are interested in testing your module you could set offsets for the flink consumers? [08:10:27] sure thing, I'd need timestamps for topics, though [08:10:48] yes [08:10:51] I'll also hack the newest version (this one is before the refactor [08:13:38] P17389 [08:13:54] https://phabricator.wikimedia.org/P17389 [08:14:53] ok, I'll take care of it [08:15:54] thanks! [08:37:42] ok, dry run seems ok [08:37:43] I hope [08:39:02] huh, actually it works for eqiad and fails for codfw [08:39:11] weirdly I might add [08:42:01] hmm, offsets_for_times doesn' not return None on failure to find a message, it will return {topic_partition: None} [08:42:58] dcausse: is it possible that there are no messages on codfw topics? [08:50:13] no, there are messages [08:50:30] I'm not sure when will offsets_for_times return None [08:55:47] ah, greater or equal - codfw hasn't received any new messages probably long before that [08:55:55] I wonder what should we do in that situation [08:56:14] anyway - I can set the offset for eqiad [08:59:19] and it's done [08:59:25] I think :) [09:11:48] break [09:12:01] zpapierski: sorry got distracted [09:45:35] zpapierski: something's weird with the offsets [09:46:33] let me guess - off times 1000? [09:47:15] commited offsets for wdqs_streaming_updater is 2349952093 for kafka-main@eqiad in eqiad.mediawiki.revision-create and this is a message from 2021-10-01T07:53:17Z [09:47:20] ah perhaps checking [09:47:52] I used miliseconds timestamps, perhaps those have a resolution of seconds? [09:48:30] offset_for_times returns 2341301388 for me [09:48:52] for timestamp 2021-09-24T23:00:01 [09:49:11] https://www.irccloud.com/pastebin/8YoYYGga/ [09:49:17] 2021-10-01T07:53:17Z seems like today the time I stopped flink [09:49:39] seems ok? [09:49:53] ah, you say they weren't commited [09:50:49] zpapierski: this paste if for kafka-main@eqiad and eqiad.mediawiki.revision-create ? [09:51:19] yes on the cluster and site, but it's a listing for all topics [09:51:50] 2341299974 seems correct it's a bit earlier apparently (2021-09-24T22:58:00Z) but probably on purpose [09:52:03] yeah [09:52:08] it's because if DELTA [09:52:27] ah yes I remember [09:52:29] so apparently it didn't commit [09:52:39] yes seems like it did not [09:52:49] I don't understand, auto commit is on [09:53:07] auto commit is perhaps only when consuming? [09:53:23] not to read up on it [09:53:55] in previous python scripts I explicitly called commit after calling seek [09:54:13] you might be right, from what I read [09:54:24] let me hack it real quick [09:57:33] Lunch [09:57:57] dcausse: how about now? [09:58:00] looking [10:00:44] zpapierski: sounds good now, thanks! [10:00:55] zpapierski: did you run it on codfw brokers? [10:00:57] awesome, it turns out we don't need seek at all [10:01:00] I did, it failed [10:01:13] no offsets were found [10:02:18] zpapierski: I mean eqiad.* topic on the codfw kafka brokers [10:02:35] oh, I can't do that [10:02:49] I mean I can, I can hack the script [10:03:02] ah ok I see [10:03:17] but will this be a part of normal operations? The current spicerack module assumes a correlation between site and prefix [10:03:56] not needed for the streaming-updater-consumer offsets indeed [10:04:26] but for flink running in codfw it needs to read (eqiad|codfw).* topics [10:04:38] but we did not plan to have a cookbook for that yet [10:04:47] it's only needed for bootstrapping [10:05:03] and there might be better ways to do that within flink actuallly [10:05:06] in any case, I can hardcode and run it [10:05:20] zpapierski: if you can that'd be awesome [10:06:59] done, I think - can you verify? [10:09:07] sure [10:10:39] zpapierski: sounds perfect thanks! [10:11:00] yw :) [10:11:06] going to start the pipeline and the backfill [10:11:26] it's on :) [10:11:57] 1.5 years of our work finally going to production, albeit a bit slowly :) [10:14:03] :) [10:26:34] done [10:26:37] lunch [10:39:49] backfill is done? that was fast [10:58:03] ok, now really a break [11:13:29] 5 [12:05:11] dcausse: hi, I don't know if you're aware but your jobs seem to be failing at a large scale rate https://phabricator.wikimedia.org/T292048#7394439 [12:14:17] Amir1: no? looking [12:15:01] https://phabricator.wikimedia.org/T292048#7394443 it's creating half a million failed job per hour [12:15:38] ouch [12:16:47] thanks for the ping [12:16:52] somehow it's at the same time we deployed something for wikidata but the timing doesn't match [12:17:13] I mean the timing matches but it doesn't make sense [12:17:42] let me know if I can help on anything [12:18:20] sure thanks! [12:20:11] seems to be cloudelastic not prod, probably why nothing else screemed louder [12:21:52] it increased the rate of logstash intake by 50%, that triggered an alarm :D [12:22:27] ah ok :) [13:02:35] gehel: would have some time to launch a cookbook? [13:02:58] sure [13:03:01] what do you need [13:03:36] it's running the data-reload cookbook with some special options on wdqs2008 and wdqs1009 [13:04:13] dcausse: sure, do you have the options? [13:05:23] gehel: I don't know the usual options but the special ones I'd need are: --reuse-downloaded-dump --skolemize [13:05:47] and perhaps --reload-data wikidata as we don't need to reload categories [13:07:12] do we have a phab task? [13:07:23] yes: T288231 [13:07:24] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [13:07:40] sudo cookbook sre.wdqs.data-reload --reason "data reload for new streaming updater" --reuse-downloaded-dump --skolemize --reload-data wikidata --task-id T288231 wdqs1009.eqiad.wmnet [13:08:15] no depool ? [13:08:28] not for wdqs1009, it's a test server, not behind LVS [13:08:37] I'll add --depool for 2008 [13:09:07] ok [13:13:26] dcausse: started for both [13:13:33] gehel: thanks! [13:14:07] do you have an estimate of how long this will take? [13:14:31] the usual time ~10days if we are lucky [13:15:14] if you still have some time this one has to go as well https://gerrit.wikimedia.org/r/c/operations/puppet/+/721281 [13:16:11] better doing it right now than forgetting to do it before the reload is completed! [13:17:21] done [13:17:22] thanks! [14:50:43] time to go learn some signs! [14:50:48] Enjoy the weekend! [14:51:52] enjoy! [15:15:41] \o [15:19:37] o/ [15:21:43] zpapierski: i was curious after looking over the mw-oauth-proxy code, what do we need to do to tie the session stores and caches together? We should expect that the first request to get an oauth token will hit one server, and the next request with the validation token will hit some other server [15:21:56] o/ [15:22:49] hmm [15:23:05] this sounds like a need for a sticky session or distributed cache [15:23:22] well, yes :) But that should be super generic web stuff we just plug something in? [15:23:41] I'd think so [15:23:54] i guess, i was hoping you would know what :P I know nothing of java webdev [15:24:16] not really, it isn't specific to Java, at least to my knowledge [15:24:46] session store should be super specific to java ? [15:24:56] unless we are talking about writing a new session store with a KV api [15:24:56] I'mI'm not sure I understand [15:25:19] oh, ok, now I do [15:26:01] if we do a sticky session with LVS, this will still be problematic because users will be tied to a single server and if they hit another, they will reauthenticate [15:26:10] i mean i can google and choose something random, but it wont be an intelligent decision :P [15:26:36] I'm thinking JWT [15:27:02] we can solve the authentication part with sticky session for that part (I hope) and then JWT doesn't require a session store [15:27:02] yea i don't think LVS sticky sessions are enough, ideally no hiccups when a server restarts or whatever [15:27:48] hmm, ok i can see what can be done about swapping session store to JWT. For the cache indeed anthing generic, i'll have to check if generic apps are allowed to use the cluster memcache/redis instances [15:28:11] you could, potentially, but I'd rather simplify the code [15:28:16] we have https://www.mediawiki.org/wiki/Kask [15:28:41] zpapierski: don't you still have to hold the requestToken in a cache regardless? [15:29:03] zpapierski: i suppose i'd have to check the oauth spec, but you generate a token, give it to the user, then the user returns. I think we still need to know at that point we created the initial token and what it was? [15:29:27] you mean service <-> mediawiki auth? [15:29:54] service should never talk to mw directly? app generates a token, user visits mediawiki, mediawiki redirects back to app with verification token [15:30:08] in the current code at least you lookup that initial token in the cache [15:30:14] so i'm assuming we have to know that token exists and we created it [15:31:05] oh, does this actually talk to mw in the background? Haven't loked into what ServiceBuilder is doing. I don't think when i did oauth1 years ago there was any pingback [15:32:06] I need to confirm it, but I think it doesn't need to outside of the initial authentication [15:33:25] we could replace checkLogin (that checks if access token is present) with JWT verification [15:33:48] I'm not sure about requestTokens though yet [17:54:42] ebernhardson: anything I can help with regarding that oauth stuff? [17:58:49] also, there's a ticket for that JWT stuff - T290299 [17:58:50] T290299: Replace token store in MW OAuth WCQS proxy with JWT - https://phabricator.wikimedia.org/T290299 [18:02:50] zpapierski: hmm, not sure how much. Checking the oauth protocol we do need to keep the initial requestToken somewhere. In discernatron that's using php's session mechanism [18:03:23] do we need to replicate it, or is it fine that each service has its own? [18:03:44] has to be replicated, i don't think we can expect two requests from a user to always go to the same host [18:04:03] we can set it up to sometimes do that, but it would still fail sometimes [18:04:07] what if we sticky up auth calls? [18:04:42] sticky is a best case, it still fails on server restart, pool/depool, etc. [18:04:49] cost of infrequent failure doesn't seem terrible - you'll restart the authentication session [18:05:11] and this is mostly invisible to user, at least web one [18:05:35] and on restart you loose the session we have now anyway, so it's not worse than it is now [18:06:24] hmm, will it start another round of redirects or just give them a forbidden? [18:06:50] I think the former [18:07:08] it's the same as user session isn't present [18:07:24] (at least it's how it is right now) [18:07:41] hmm, reading it /oauth_veify returns forbidden if the token isn't in the cache [18:07:49] yep [18:09:16] this forces checkLogin, which starts the whole redirect dance [18:09:45] since you're already logged in and have a cookie, it doesn't even show you a login page [18:10:22] hmm, ok then if it just sends them around in redirects i suppose thats not so bad [18:11:05] so then all we need is a jwt token signed with an expiration date of the sessions. And then i guess something to refresh that occasionally? [18:11:36] it should refresh itself automatically, if we implement this the same way - just another redirect loop when expired [18:12:01] it means we have no way to expire sessions forcefully, do we care? [18:12:26] probably not? [18:13:02] not really, I guess [18:13:14] I mean, if the need be, we can block users [18:13:34] but apart from that, it's all or nothing [18:15:31] so remaining question, how do we sticky auth sessions? [18:16:18] or do we have to sticky all incoming request routing? [18:16:29] this I don't know :( [18:17:09] SRE's probably should though [18:17:43] making all requests sticky seems problematic, easily get unbalanced loading. But reading the nginx config /check_login subrequest can be called from basically any incoming request [18:19:07] does it need to be sticky though? [18:19:28] if you check_login with JWT, any instance can authenticate [18:19:33] check_login creates the token, the returning oauth_verify req has to go to whatever called check_login [18:19:57] ah, sorry I was thinking about check_auth [18:21:00] probably easier to write a kask wrapper and avoid changing much of this, just replace the two Cache's with shared remote cache's [18:21:22] which will be called on each call? how performant it is? [18:21:37] it's a small api sitting in front of dedicated cassandra cluster [18:21:48] i dunno what the latency expected would be [18:22:04] i suppose envoy is proxying it for mw, there ought to be some stats in prometheus somewhere [18:22:19] are you sure about check_login, though? [18:22:28] check_app is actually the one being called each time [18:22:50] check_login is only called on 403 [18:23:15] Anything that fails auth will 403, nginx is configured to subrequest that, so any incoming request that fails auth (all first time requests) go to check_login [18:23:32] any call can trigger it, but maybe we can set it up so that only 403 trigger sticky session? [18:24:52] also, we might just go for a very short sticky sessions - auth calls are very close together [18:24:52] hmm, i guess need to ask sre. From looking arround i think lvs calls it persistence [18:25:08] true [18:25:21] and I think it can be configured for a time period [18:25:28] 10s would be literally enough [18:26:12] sorry to push for this so much, I was really hoping to make this proxy simpler and less prone to failure, and I'm afraid of all I/O [18:27:12] hmm, also you can configure sticky session via headers apparently, so maybe we cn leverage nginx proxy_set_header for that [18:27:56] via http headers? I don't think lvs sees those, it's all https [18:28:35] sorry then, random googling [18:29:17] in any case, short sticky session would probably be enough? [18:31:31] need to take care of the younger generation and probably go to sleep afterwards, I hope there is a solution within LVS realm for this