[00:11:19] I put a banner in place on wcqs-beta.wmflabs.org, I imagine someone will ask to rewrite that copy though so let me know :) [00:11:45] it says "Please try the new WCQS Beta deployment, now with more capacity and live updates. The old beta instance in wmflabs will be decomissioned March 1, 2022." with WCQS Beta as a link [00:12:34] (it's also a total hack job, and wont get i18n translations like a normal banner would if we went through the full process) [00:27:36] * ebernhardson realizes while answering someones question the feature sets used in mjolnir in prod don't exist anywhere except the prod instances. [00:53:58] i guess they mostly exist in featuresets.py, but for some reason i suspect there are custom hax [08:27:09] ebernhardson: the banner looks good to me. Thanks! [09:03:04] ejoseph: around? https://meet.google.com/ukb-kgxq-gvq [10:02:52] adding the "Patches for Review" section to standup notes is a great idea [10:04:17] volans: mind looking at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/753426 when convenient? [10:05:02] zpapierski: around? available for a chat? [10:05:09] https://meet.google.com/sxf-uyhu-cyp [10:05:13] yep, just let me get my headphones [10:05:23] zpapierski: sure, no prob [10:05:32] volans: thanks! [10:07:48] zpapierski: just to clarify, you don't "need" my review for cookbooks, it's totally ok to self-review/merge them within your team. I'm always happy to review if you're looking for suggestions or have questions on how to do things or if there are other ways to do them [10:08:39] (by self-review I meant review by members of the same team without me/j.ohn being involved) [10:15:41] zpapierski: {done} [10:18:24] volans: thx! I'm not able to +2 there, actually... [10:18:37] I can do the +2, let me read it first [10:38:53] lunch + errand, I might be back slightly later than usual (covid vaccin for Oscar) [10:39:55] zpapierski: I +2 the cookbook [10:40:29] The comment from volans about the class API should be addressed separately and might be a good exercise for inflatador to get more familiar with our cookbooks [10:41:16] gehel: thx [11:32:27] lunch [11:38:55] Hi, We got email for DegradedArray event on /dev/md/0:elastic2035 [11:39:06] let me know if you want the email [11:46:19] Amir1: that's T298853 AFAICT [11:46:19] T298853: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 [11:50:57] awesome [11:51:09] one less root@ email to worry about [11:51:39] lunch [13:33:21] dcausse: i'm not sure why , but eventgate-main didn't have stream config for rdf-streaming-updater.reconcile [13:33:31] i'm roll restarting it, that seems to be fixing it [13:33:38] :| [13:33:44] i'm not sure why though, it is in stream config correctly afaict and has been since you deployed it [13:33:53] eventgate-main requests stream config on startup [13:34:05] ah [13:34:15] ...also, btw, do you think we should move the rdf-streaming-updater stuff into schemas/event/primary [13:34:21] so we've made a mistake [13:34:23] it is a 'production' service [13:34:25] dcausse: oh? [13:34:56] dcausse: i'm not sure what mistake you've made, from what I can tell you did everything right. [13:35:02] I thought it was: (1) schema -> (2) eventgate (restart) and (3) mw-config [13:35:15] oh, but hasn't mw config for this been there? [13:35:20] thought i looked at git blame [13:35:20] OH [13:35:21] deploy [13:36:20] I thought somehow that after deploying a mw-config patch then some stuff will be starting to produce data to it (canary events) and then eventgate will be in a bad shape without the schema reloaded [13:36:55] did not occur to me that eventgate would pull data out of meta [13:37:32] hm... thinking about this, it seems they need to happen almost at the same time [13:37:43] what does trigger canary events to be sent? [13:38:12] just updated eventgate docs to indicate what happens with stream config for each eventgate [13:38:13] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#EventGate_clusters [13:38:26] eventgate-analytics-external is the only one that requests stream config dynamically [13:38:34] ok [13:38:47] dcausse: declaration of stream will cause canary events to be sent to it [13:38:58] the canary producer just reads all stream configs and produces [13:39:41] ok so deploys to mw-config (stream config) + eventgate restart must occur at the "same" time [13:40:32] yeah, doesn't have to be exactly same time [13:40:37] but shortly after [13:40:49] yes I get it now [13:41:05] it isn't ideal, its just done that way to avoid runtime coupling [13:41:14] sure [13:41:29] and yes it's a prod like service, is this for primary vs secondary ? [13:41:46] maybe we could make it request streams it doesn't have from mw api, buuuut then if someone asks for an invalid stream name, it will just ask every time [13:41:52] we do have some caching in eventgate-analytics-external [13:42:14] ottomata: I don't think it's a big deal once you know what to do [13:42:15] erg. i dunno, i think the way it is is good, it is not expected that streams change often for anything but eventgate-analytics-external [13:42:17] okay [13:59:30] inflatador, ryankemper: could you take care of the decommissioning step for elastic2035 (T298853) to ensure that we don't keep alerting people for nothing? See Amir1_ message earlier today. [13:59:30] T298853: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 [14:03:38] gehel ACK , will take a look [14:05:14] Also, Happy Groundhog Day! https://www.farmersalmanac.com/when-groundhog-day-winter-forecast [14:58:14] dcausse: what MW state does streaming updater ask get from MW API currently? [14:58:23] wikidata revision content? commons revision content? [14:58:24] is that it? [14:58:38] do you ask for a specific format from the API? [14:58:45] (or, point me to code that is dong this, I will read) [15:01:27] oh I think I found it, WIkibaseRepository? [15:02:06] ottomata: yes its main MW dependency is /wiki/Special:EntityData [15:02:45] it's not under "api" but it really falls under the MW (wikibase) APIs [15:03:50] example call is: https://commons.wikimedia.org/wiki/Special:EntityData/M114759078.ttl?flavor=dump&revision=625508650 [15:12:43] perfect [15:12:55] okay that's a wikibase specific thing then [15:12:57] cool [15:42:26] inflatador: there is no one for my refactoring workshop tonight, let's use that time to continue our earlier conversation (invite sent) [15:46:54] gehel ACK, accepted [16:02:55] ebernhardson: Thanks for putting the banner in place. If you don't mind changing the copy to: "WCQS beta 2 is now available at https://commons-query.wikimedia.org/ and has live updates and better service reliability (release notes). This WCQS beta 1 instance in wmflabs will be decommissioned March 1, 2022, and all traffic will be automatically directed to WCQS beta 2 following that date." [16:03:02] release notes link: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/WCQSbeta2-release-notes [16:39:28] andreaw remind me to ask you about your bass sometime! My son just started playing [17:00:26] inflatador: we are holding an office hours now, wanna join (we forgot to add you to this one,sorry!) - https://meet.google.com/vgj-bbeb-uyi?authuser=0? [17:01:30] zpapierski gotta run errands but should be back in time for "unmeeting" [17:02:01] no problem! Trey314159 added you to the next one [17:32:50] unmeeting! [17:57:49] and I missed that it was moved :( [18:22:11] answering random question from earlier, Webb has a 2kw solar array [19:01:24] lunch, back in ~30-45 [19:43:54] back [19:58:50] * ebernhardson would love to see an auto-formatter (as an open source tool that can be used in the CI pipeline to verify) for java instead of errors like 'Wrong import order' (with no hint of the correct order). I just push the magic button in idea and hope it passes next time :S [20:13:50] * ebernhardson notes that almost every change to cpjobqueue config since april was someone increasing concurrency somewhere [20:14:28] what i'm not finding is a good metric about per-queue concurrency, or even how data gets from cpjobqueue to prometheus :S has to be somewhere... [20:21:29] * ebernhardson apparently asked per-job concurrency graphs in 2020 and the answer is T255684, still pending [20:50:33] mpham: banner updated [20:51:18] lunch [20:58:07] out for the day, see y'all tomorrow [21:30:00] baclk [21:30:42] so these job queue graphs, i guess we can derive concurrency from the job completion rate and 50th percentile latency? Something doesn't feel right about it but maybe... [21:41:57] ebernhardson: I imagine that should get us to a reasonable approximation [21:42:36] I should catch back up on reading your (excellent) writeups on that phab ticket tho, sec [21:44:00] ebernhardson: btw what are the graphs you're referring to here? [21:44:07] > We suspect though that ElasticaWrite jobs are already consuming 300 concurrent runners and the graphs don't really show any different activity before/after saneitizer deployment. [21:44:24] I imagine that's a jobrunner specific graph or something along those lines? [21:46:08] Ah must be https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [21:46:46] ryankemper: there is a graph, but i asked the same thing a few years ago and this was response: https://phabricator.wikimedia.org/T266762#6625756 [21:47:25] oof, I see [21:47:26] oh i totally missed first question :) Yea it's JobQueue Job graph [21:48:57] ebernhardson: thanks! [21:50:30] I'm dubious on whats really going on here, i feel like we aren't using 300 concurrent runners but can't find proof yet [21:53:58] just echoing what you were saying in your phab comment, there's no noticeable spike in job insertions during the period in which saneitizer was activated / subsequently disabled [21:54:10] here's the cirrus side of things for eqiad: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=85&orgId=1&from=1643182091936&to=1643848004131&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:54:21] and the corresponding time range for the `JobQueue Job` dashboard: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&from=1643182091936&to=1643848004131 [21:56:04] if we use a simple derivation, p50 is pretty constant at 250-300ms, job retirement rate peaks around 200. That would imply significantly less than 300 concurrent requests [21:56:13] 300 concurrent requests at 300ms should retirn 900/s ? [21:56:25] *retire [21:57:27] if we suspect that cpjobqueue isn't giving the desired concurrency, i suppose i'd be less worried about bumping it up [21:57:27] Math checks out (minus a sig fig) [21:57:56] I guess technically that napkin math could lie to use if the p99 or p01 massively shifted after the intervention...but there's no way [21:58:30] yea i'm not super thrilled with deriving instead of logging concurrency directly. There is a Set() object per-job that they could report the length of and have real direct concurrency. Not clear why they prefer to derive it [22:03:45] ebernhardson: btw where do we actually set the elastica jobrunner concurrency parameter (or whatever it's properly called)? my first guess would be `mediawiki-config` but my (likely insufficient) greps aren't finding it [22:07:10] ryankemper: operations/deployment-charts repository in helmfile.d/services/changeprop-jobqueue/values.yaml [22:08:49] I don't think of that one naturally either, but searching for ElasticaWrite turns it up since i had it cloned already [23:18:04] * ebernhardson wonders what the rate limiting in cpjobqueue does, never saw that before