[00:11:11] mjolnir is only running through ~8.5k files per hour but has 53k more files to go through (~6 hours) in the hourly updates before it gets to the updates that caused writes to get stuck the other day. Not expecting it to get stuck again, but if someone could check that cirrus error counts dont go up in EU morning would be great :) [05:52:40] (the high error counts earlier were all the cross-cluster from chi->omega again, unrelated to writes) [08:16:57] error count seems sane (both cirrus and the mjolnir bulk update daemon) [08:17:59] mjolnir-bulk-update@codfw does not seem to be running tho (perhaps on purpose?) [08:24:24] I see a mixture of extra in version 6.8.23-wmf2 and 6.8.23-wmf1 in codfw but everything is wmf2 in eqiad so that probably explains [10:44:38] lunch [12:46:44] greetings [12:48:02] dcausse that's right, we haven't finished the plugin deploy in codfw but I'll start on that shortly [12:48:14] o/ [12:48:16] inflatador: thanks! [12:48:22] hold that thought, the PDU maintenance is still ongoing , so we have to take care of that first ( https://phabricator.wikimedia.org/T310070 ) . About to shut down B7 hosts [13:01:45] ok [13:23:17] dcausse, inflatador: do you have a sense of when T314078 will be fixed and you'll be able to work on T314473, which is blocking the latest round of image suggestions notifications that were supposed to go out today? [13:23:19] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [13:23:20] T314473: Ingest new image suggestions index diffs - https://phabricator.wikimedia.org/T314473 [13:24:39] cbogen_ the fix for T314078 should be done by EoD [13:25:23] ok thanks. so is T314473 something that can be completed by EoD tomorrow? [13:26:24] and gehel is it okay if I put it at the top of ready for development? [13:26:44] cbogen_: yes, best case is late today, worst might be tomorrow [13:26:54] okay, thank you! [13:26:56] cbogen_: sure! (cc: ebernhardson, dcausse) [15:03:03] Search Office Hours are opened: https://meet.google.com/vgj-bbeb-uyi [15:32:43] can someone see the message from jayme in -sre about maint in codfw today as a wcqs server hasn't been shutdown [15:36:21] inflata.dor replied in there [15:52:09] RhinosF1 I responded to pa-paul in #wikimedia-dcops already, LMK if I need to do anything else [15:53:06] inflatador: believe that's all. [15:55:33] inflatador: i would make sure you are ready for thursday and friday though [15:55:41] I see a few elastic nodes in the list [15:55:55] https://phabricator.wikimedia.org/T309956 [15:57:15] Thanks RhinosF1 ! AFAIK our team was never notified of this maintenance. I requested that we get added to the SRE Google Group yesterday, hopefully we'll start to get these. If there's a calendar we need to subscribe too or something let us know [15:58:33] inflatador: i'm nosey so i watch far too many tags. you can see the plans for this set on that task though and it's subtasks. [16:00:03] RhinosF1 ACK, subscribed myself and ryan-kemper to the master task [16:00:11] kostajh: We'ce discussed T312198 quite a bit during our office hours. I've added a few notes on that ticket. I think the main point is that this needs an owner who has the focus to make it move forward and coordinate all the different work / teams that need to collaborate on this [16:00:11] T312198: Developer productivity: Shared ElasticSearch instance for local development environments, Quibble CI, and Patch Demo - https://phabricator.wikimedia.org/T312198 [16:00:46] inflatador: it was tagged with #sre on phab so you could maybe watch that workboard for future although 90% of mail from #sre you probably wouldn't work on so you don't want to drown either. [16:01:57] C8 August 9th 09:30 am CT/2:30pm UTC is only one left after this week [16:02:24] ACK, I'll take a look at the workboard. Too much info is always better than too little ;) [16:19:31] working out, back in ~40 [16:53:29] going offline [16:55:06] back [17:23:32] I wanted to let folks know that I reopened https://phabricator.wikimedia.org/T306899 [17:24:33] Following Ryan's advice in the ticket to reopen and report timestamps if 500 errors in WCQS continue to occur, and mention it in here. I had experienced errors yesterday and today (just a few minutes ago). [17:27:24] (^ping ebernhardson) [17:56:01] DominicBM: thanks! It sounds like the missing piece was that it doesn't fail until after you've been logged in for some time, so perhaps it's about token refreshes? [17:56:43] DominicBM: i've been doing all my testing by opening the page and running a few example queries, does it perhaps require leaving the same page open for some hours and running more queries from that same sessoin? [17:56:59] (my testing hasn't been able to reproduce the issue, which has made coming up with a fix difficult) [17:58:20] That would align with what I remember of past times. It feels like it is only after I've used it for a while, which is why I wondered if it was about load—but it's not necessarily about intense usage, so that makes sense too. [17:59:10] DominicBM: one thing that would push back against that idea (token refreshes) though is that i would suspect you try to refresh the page when re-running, and for refreshing to give a page load it would have to have refreshed the token too. Any memory of if refreshing the ui causes things to start working / doesn't help at all? [18:00:18] i can certainly try leaving a page open and running a few examples queries hours apart, see if i can reproduce something. [18:02:09] i suppose just thinking about how token refreshes interact with the CORS work i did recently, plausibly the problem could be that when redirected to the mw Special:OAuth page to re-auth the token the browser rejects it as a cross-site request. I'm not entirely sure how we would fix that though, as we can't change the site UI or Special:OAuth [18:02:13] but can ponder [18:04:13] your timestamp from the ticket also aligns exactly with this errormessage from one of our servers: /oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms [18:05:19] For what it's worth, I'm taking this issue to the WMF Platform team, as they are building out a framework to more generally handle authentication across the board. But this will probably not resolve issues in the short term [18:06:08] hmm, sadly the network logging i setup on the wcqs servers on jun 1 isn't running anymore, so we don't have the raw network logs to correlate with. [18:07:03] Oops, I dropped off. Hm, I don't know how to explain this, but it's working for me in the UI, but giving me 500 still using the endpoint, even though I'm using the same session token for both... [18:09:28] the idle timeout from /oauth/check_auth is quite weird, i wonder if there is some way we can force a jstack dump or something else that gives more information on what was happening when that timeout occured [18:09:54] I'm just doing some experimenting to see if it's not a different issue. But I'm just getting the same general 500 error, so it's indistinguishable from the error message. [18:10:52] lunch, back in ~1h [18:11:09] the really weird thing about a timeout from /oauth/check_auth is it doesn't do any network activity, all that does is validate the token provided with a secret and give a yes/no answer. [18:12:31] because it doesn't do any network activity and is run on every possible request i suppose i had been assuming previously that it was something to do with network between the servers and whatever was making requests and it was possibly random crawlers that didn't bother to read the response in time. But with a direct correlation of timestamps to your error that means something is happening [18:12:33] there [18:14:16] maybe some edge case with nginx<->jetty, probably worth looking into that proxying layer (nginx terminates ssl internally and the forwards to jetty, running blazegraph and oauth, over localhost) [18:14:55] ebernhardson, related to the general issue of authentication via API, I am also just wondering how long these session tokens last, and, since the app I am building can run under its single login (i.e. it's not needing to authenticate individual users to perform write operations), maybe I can just hardcode and manually refresh the token every X [18:14:55] days, for now. [18:15:49] DominicBM: the previous lead on this project wanted to do stateless session tokens via JWT, that necessitates short lifetimes in the hours category, might be only one hour. Sec i can find that bit of config [18:16:39] Ah, that would be a bit much. :D [18:17:56] meh, out codesearch finds the code that reads the config but not the config :P more secs to find the actual config... [18:18:14] the default value if not set is 2 hours, [18:19:56] DominicBM: poking the live servers it doesn't look like we configure this lifetime, so i'm reasonable convinced it's using the default value of 2 hours. I could probably change that to 12 hours without it being a big deal [18:20:26] This is a read-only service so it probably isn't that big a deal that a session token once issued is valid to anyone anywhere that uses it [18:20:32] How about 30 days? '=D [18:23:01] DominicBM: hmm, i would have to talk to our security team and see what they think i suspect. The high-level guidelines for JWT token expiration is "hours, not days" because there is no way to invalidate a JWT token once it's been issued. But on the other hand as a read-only service that auth's against a site where anyone can create an account in < 5 minutes, we don't have the same level [18:23:03] of strictness required [18:24:47] And probably not a ton of unique users (compared to other services), too. [18:25:53] indeed [18:28:41] There was also a suggestion at the last meeting (which is why T307596 was created) that there is actually a way to generate a session token over API, it's just not obvious. I forget who on the team was saying that, but since I don't really know anything about how OAuth works, I wasn't sure still if there is a way to do it in the intended way, if [18:28:42] T307596: User documentation for authentication on WCQS - https://phabricator.wikimedia.org/T307596 [18:28:42] you have a registered app that can actually authenticate to Wikimedia first. [18:30:30] DominicBM: assuming you read the redirect from the wcqs-beta.wmflabs.org request to Special:OAuth and have appropriate mediawiki cookies it should "work". The problem is you can't get those mediawiki cookies from the mediawiki action api, it has to go through the typical browser-based login [18:32:07] Right, I've been able to successfully do that. I just thought someone on the team was saying there should already be a way to get the initial token via API, but it might have just been off the cuff. [18:34:20] Btw, using the SPARQL endpoint, since the CORS issue is resolved, I get this error message back as plain HTML even when the request was for JSON format. It would be nice if all API responses were structured, not just successful queries. (Does this happen for WDQS errors as well?) [18:37:08] DominicBM: was plausibly me, but probably i was unclear or directly incorrect. I don't think there is anyway to interact with oauth 1 over api. It looks like oauth 2 does provide rest api endpoints, but sadly wcqs integratoin is over oauth1 [18:37:57] i'm also not familiar enough with mediawiki's rest api's to know what auth looks like there [18:39:22] DominicBM: i'm not super famliar with the wdqs side of things, but in general i would expect error messages that come from the nginx proxy or from jetty totally failing the request come back as html or plain text [18:40:13] You can use action=clientlogin for MediaWiki auth, there is no equivalent REST API. But you'd need a password to authenticate, and not needing that is the point of OAuth. [18:41:16] In theory the way OAuth 1 works is, you redirect the browser to Special:OAuth (so no CORS involved) and if the user has already given permission in the past it gets redirected back to you without anything user-visible. [18:41:25] tgr: will that return appropriate cookies that allow interacting with Special:OAuth (which when working correctly only issues a redirect back to wcqs-beta) [18:41:34] Also, our OAuth 1 access tokens never expire. [18:42:11] I have never done any app with authorization before. Usually I'm just doing Pywikibot stuff. So I'm muddling through. I thought Special:BotPasswords was going to help me, but I'm not so sure [18:42:16] tgr: sadly this doesn't directly use the OAuth 1 access token in the application, rather after the oauth redirect loop comes back with success a JWT session token is issued [18:43:23] It won't return the cookies - since this is a normal request from the browser's POV, it will send the cookies to Special:OAuth, which will verify that the user belonging those cookies has authorized the application (and show an authorization dialog if not), and redirect the user back with an extra token in the URL that can be exchanged for the access token. [18:43:54] tgr: a related problem we have with CORS that maybe you are familiar with, the browser sends an XHR request to /sparql, the JWT token has expired so it redirects to Special:OAuth, and then the browser rejects the redirect returned by Special:OAuth because it's not cors-enabled [18:44:23] In general the application is not supposed to know about MediaWiki cookies or any other auth details, that's a concern between MediaWiki and the browser. [18:45:09] in theory we actually only want to protect /sparql endpoint and not the full UI, but we had to put the full UI behind the auth because we couldn't figure out how to get the XHR requests to auth through Special:OAuth [18:45:37] Yeah, if you use an XHR request that won't work. [18:46:22] OAuth 2 is somewhat better in that regard, although it will still redirect to Special:OAuth if the user hasn't given authorization previously. [18:47:55] i suppose the longer term hope here is that the /sparql endpoint eventually goes behind an "api gateway" and we don't manage any auth at all, the user would only need the gateway to approve their request and forward it onto the backend. But i think that's still some time away :) [18:48:26] I don't think there is a great solution to that (other than adding OAuth 1 REST endpoints to MediaWiki, or maybe making Special:OAuth output CORS headers). You can use a popup instead of an XHR request; I think that's conceptually more correct anyway since it's up to the OAuth server whether it wants to require explicit user authorization. But more complicated. [18:50:50] sadly we also don't control the UI, it's all wmde and while we didn't push particularly hard they didn't seem to be interested in creating special flows for wcqs [18:51:57] Is this a problem you only have for Commons, not Wikidata? [18:53:14] yes, the general theme is that because wikidata is already deployed it's not possible to add an auth layer. Because commons was a new deployment it could start with something. The issue being solved is that wdqs ends up under heavy load from arbitrary queries from the internet and the only hope we have is that they are using an identifiable user-agent that we can block (see history with [18:53:16] dailymotion, probably others) [18:54:21] by putting an auth layer in there a username exists attached to all queries, giving a place to both block queries and notify the user that their requests are overloading our systems [18:55:36] unfortunately queries that lead to overloads are a bit of an unsolvable issue when the query language is as expressive as sparql [19:02:29] back [19:06:33] Not sure how much control you have over the application; what I would maybe try is 1) put a login button somewhere, have it do an OAuth handshake and store the access token in a long-lived browser cookie 2) when /sparql needs authentication use the access token to sign a server-side API request to action=query&meta=userinfo (not 100% safe for identification, mind you, but you don't really care who the user is) 3) if that request fails tell the user to log [19:07:12] OAuth 1 acess tokens don't expire so that would work as long as the user's browser keeps the cookie. [19:09:56] hmm, maybe there is some way this could be exposing the oauth 1 access token directly instead of having the result of the oauth 1 access flow result in a separate jwt token being issued. I wasn't part of that decision making process so not sure what was considered at the time [19:11:08] You will need your own JWT token (or whatever auth method) anyway because you don't want to hit the MediaWiki API every time someone makes WCQS query service. [19:11:59] MediaWiki OAuth 2 access tokens are self-signed, in theory you just need the public key and you can verify who the user is. OAuth 1 access tokens are essentially just keys to DB records. [19:12:59] (But OAuth 2 tokens expire so if you used OAuth 2 you'd have to deal with refresh tokens. On net, probably worse.) [19:13:36] ahh, ok yeah that makes sense. the oauth 1 access tokens last "forever" because they can be invalidated in the db if needed [19:25:30] Are we planning on doing retro tomorrow? We've got monthly staff meeting so I'd think not but just checking [19:27:45] ryankemper: thanks for the reminder. I've canceled. [20:13:38] ahha, so it turns out leaving the same commons-query.wikimedia.org session open in the browser long enough does reproduce the 500 error in wcqs. Doing a couple other things today so might have time to debug, but having a reproduction should make it possible to come up with a solution [20:13:50] s/might have/might not have/ [20:14:10] ( ^_^)o自自o(^_^ ) CHEERS! [21:18:27] ebernhardson: I can ship that, are you around for any post-deploy checks [21:18:49] or alternatively, is there something I should look at to make sure all went well [21:19:00] maybe just looking at the throughput of elasticawrite events on grafana? [21:20:41] ryankemper: yea i'm around, lemme pull up the right graph that should show them [21:22:44] ryankemper: this graph should split into three lines with roughly equal sizes: https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&viewPanel=34&forceLogin=true [21:23:29] and hopefully this one should increase: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi&viewPanel=12 [21:24:58] hmm, actually the by-topic one might not split. It's by-topic and then by-server, so each partition might end up on a different server, or might not [21:25:00] hmm [21:28:15] wow, found another graph but this makes it look significantly worse than we were thinking. The rate of commited offset increment should be how fast the job queue is processing the jobs through mediawiki: [21:28:17] https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&var-consumer_group=All&from=now-15m&to=now&viewPanel=2 [21:29:22] the kafka-mirror-* jobs run at full speed, and the cpjobqueue-* ones run the jobs through mediawiki. Suggests it doesn't even run a job every minute [21:38:32] oh i'm totally misreading that merged graph, https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-15m&to=now&viewPanel=2 is better [21:38:56] ebernhardson: merged the patch and applied to staging. interestingly when I run the diff command listed here https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Seeing_the_current_status `helmfile -e $CLUSTER diff` I don't see any diff for `eqiad` or `codfw` [21:39:54] ryankemper: hmm [21:40:27] ryankemper: maybe staging doesn't have partitioning turned on? I'm not familiar with how staging varies [21:40:49] this config is inside a block that gates on: if .Values.main_app.jobqueue.partitioners [21:41:22] well staging *did* see a diff, whereas `eqiad`/`codfw` don't when I try to do `helmfile -e eqiad diff` [21:41:46] https://www.irccloud.com/pastebin/mhheRsHw/ [21:44:46] hmm, sadly i don't know enough about this to say. I expect that these templates get rendered into some config.yaml file that gets deployed and then read in by the runners [21:45:16] i suppose when i looked at things before that config.yaml ends up inside a special config volume that is shared between things and mounted into the containers [21:48:15] Yeah probably appears in this volume [21:48:18] https://www.irccloud.com/pastebin/y4p5CB7R/ [21:49:05] i see that separately the codfw restart finished and everything reports -wmf2 for the extra plugin in all 3 codfw clusters, re-enabling search-loader2001 [21:51:05] ebernhardson good catch...it did just finish, you can enable mjolnir in codfw now [21:54:27] ebernhardson: maybe it's the other way around, the `.Values.main_app.jobqueue.partitioners` is only set for staging? that seems odd but might explain it [21:57:48] ryankemper: hmm, seems unlikely [21:58:18] ryankemper: that partitioning isnt only us, it also partitions the jobs that handle mediawiki core edit related tasks into per-database partitions. I can't imagine that would be broken without someone noticing [21:58:23] (but i've been surprised before :P) [21:58:33] yeah, that makes sense [21:59:23] okay, we'll need some help from service-ops w/ understanding why helm doesn't see a change for `eqiad` or `codfw` [21:59:32] kk