[07:38:00] jhathaway: your puppet change has been waiting to be merged for around 8h now :) [07:38:30] I think this is just fine to merge https://gerrit.wikimedia.org/r/c/labs/private/+/895880 but I would appreciate if someone else checks it [07:38:31] I can merge [07:39:59] <_joe_> marostegui: go on [07:40:05] <_joe_> it's mergeable :) [07:40:28] done [07:40:28] <_joe_> and yes this is the typical "sre forgets labs/private needs puppet-merge too for $reasons" [07:40:34] <_joe_> I do that all the time :) [07:40:36] jhathaway: merged [07:40:48] thanks _joe_ [08:44:02] Folks I'll be doing some work on the CR routers in codfw this morning (T331601) [08:44:16] Unfortunately need to reset a line card to enable the next port on it [08:44:36] No services should be affected, but certain links will go down, they will be drained before/after [08:47:52] do you expect this to page (which is fine, just wondering what to expect)? [08:48:34] moritzm: I'm confident it won't, I'll downtime the routers in question [08:48:52] but that said there is an outside chance I may have overlooked something in the alerting path [08:49:02] ack! [09:01:55] <_joe_> topranks: it seems like a weird time to act on codfw's core routers. Is this non-delayable? [09:02:21] <_joe_> right now I don't think we can switch to eqiad if something goes awry in codfw [09:02:28] <_joe_> marostegui: ^^ is my understanding correct? [09:02:33] _joe_: there is no great urgency [09:02:58] <_joe_> topranks: sorry I don't wnat to be a PITA, but if we have no backup plan [09:03:01] it's not going to cause any problem, really just a port bounce [09:03:28] like even if the card doesn't come back up, which is extremely unlikely, things will be ok in terms of traffic [09:05:23] I guess if we had to do this any time of the year in eqiad we'd just proceed same way [09:05:35] it's not the kind of change I'd drain the site for [09:07:18] <_joe_> topranks: ack :) [09:07:47] <_joe_> from your previous announcement it seemed like you had to do a more radical change [09:09:02] No it's fairly BAU. [09:09:06] I'm more worried something will page and cause noise that I hadn't anticipated, in term of the actual change its straightforward [09:09:12] hence the announcement [09:10:03] <_joe_> ack ack understood :) [09:16:44] _joe_: you mean WR? [09:17:07] <_joe_> marostegui: I assume you're running heavy maintenance in eqiad [09:17:16] <_joe_> so we can't just switch back if codfw has issues [09:17:30] _joe_: You mean writing in eqiad? [09:17:36] <_joe_> yes [09:17:43] Yeah, I would prefer if we don't have to [09:17:53] I'd need to enable replication eqiad -> codfw now just in case [09:18:01] <_joe_> no need, really [09:18:01] and then disable it if all goes well with the router [09:21:54] marostegui: no need thanks, the links are disabled now everything looks fine [09:21:55] thanks! [09:22:02] thanks :) [09:50:24] To update my works completed on cr1-codfw without issue, will move on to cr2-codfw shortly [10:17:00] hi, I have a hopefully quick question for someone from SRE https://phabricator.wikimedia.org/T328288#8679125 about job queue + jobReleaseTimestamp [10:27:08] <_joe_> kostajh: I *think* it should just work, but i have to verify [10:27:27] <_joe_> there are definitely other jobs that do the same [10:34:41] _joe_: OK. When I added a job using jobReleaseTimestamp, we needed a patch to operations/deployment-charts https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/636078 [10:35:40] But I am not sure about the need for re-enqueue delay for the job in T328288 [10:35:41] T328288: Leveling up: "Keep going" notification - https://phabricator.wikimedia.org/T328288 [10:35:55] *when I added a job in the past [10:36:10] <_joe_> I would assume you would [10:36:33] <_joe_> but I'll check [10:37:57] ty [10:44:20] <_joe_> kostajh: ok so [10:45:12] <_joe_> the answer is interestingly it would "just work", because the parameter is apparently supported, but it blocks execution [10:45:27] <_joe_> so it will also need to be put in a separate topic [10:46:10] <_joe_> or - we can unset the parameter and add it as in your previous patch [10:46:23] <_joe_> what will be the volume of these jobs? [10:46:40] <_joe_> (also sorry, I need to go afk in a couple of minutes - I will reply later) [10:48:30] _joe_: the volume would be 1 for every account creation (local account creation, not autocreated). For now, we are deploying to {ar,bn,cs,es} wikis. [10:48:46] <_joe_> so very low [10:49:00] I don't know how to place in a separate topic; is that something done in MW side? [10:49:10] <_joe_> no, on the changeprop side [10:49:16] <_joe_> but let me look a bit deeper [10:50:10] ack [10:50:48] <_joe_> gotta go, but I think I have an idea of how to do this. [10:52:17] ok, could you add your comment in T328288 later, please? [10:52:18] T328288: Leveling up: "Keep going" notification - https://phabricator.wikimedia.org/T328288 [11:00:40] <_joe_> kostajh: sure [11:13:46] Ok all my works are complete, no drama thankfully :) [13:57:33] hi folks - there seems to be some complaints about HTTP 412 on officewiki when editing with VE (which prevents editing) - would anyone know anything about that? (also: is this the right place for raising this?) [14:11:01] ihurbain: I don't know anything about this specific issue, but I will say 412 is an unusual response code. [14:12:01] it's supposed to be for basically an ETag mismatch on a POST. It's basically a mechanism to prevent overlapping edits of the same resource, at the HTTP level [14:13:49] at the HTTP level anyways, the basic idea is that you're sent the existing content (to edit) and a special header with a hash of the current contents. [14:14:18] Then later when you submit your edit, it double-checks the hash still matches at the server, and rejects with 412 if they don't match, which implies someone else edited it while you were busy editing it. [14:14:37] at least, that's the HTTP explanation of it. I have no idea to what degree or how Mediawiki supports this stuff [14:17:15] best I could find on a quick google search is a random reference to someone else with a similar problem on probably a different private wiki: [14:17:17] that's consistent with my vague understanding, and also consistent with "hrm, it's definitely not systematic" [14:17:18] https://leo.leung.xyz/wiki/MediaWiki#Error_contacting_the_Parsoid/RESTBase_server_(HTTP_412) [14:18:25] the hints there are basically: something auth-related, and/or something rest.php -related? [14:22:18] hmmm: https://phabricator.wikimedia.org/T236837 [14:23:11] not much hint in there either, other than more hints towards rest.php maybe [14:23:23] (although that specific ticket is about older restbase) [14:24:13] the related commits in that old ticket though, were about parsoid "mirroring" [14:24:25] makes me wonder if this is somehow related to the DC siwtchover [14:25:15] (as in, maybe parsoid/rest has different data at some level in the two DCs now, and this causes the 412 mismatch due to ro-vs-rw or whatever) [14:33:06] hey, ihurbain: do we have any more information about this in a task or something? [14:33:47] akosiaris: no, there's been a thread on #content-transformers (on slack) started this morning, but afaik we haven't philed anything yet [14:34:15] marostegui: thanks for merging! [14:34:33] :) [14:35:10] i have personally seen the 412 this morning on my first edit on officewiki, and then when i retested it worked - so i assumed it was fixed, but maybe it was "only" sporadic, or editing the same page slapped the cache enough, or or or. [14:37:12] duesen seemed to have opinions, but i don't know if he's around [14:38:34] I see no 412 being emitted by mediawiki in the last 24H [14:38:49] so, it's something else along the path and this smells like RESTBase [14:39:50] considering daniel just sent https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/896104 to review: legit. [14:42:10] yeah, it's RESTBase alright, https://logstash.wikimedia.org/goto/1074f7ce5c0fff6395231ef155ec9b9e [14:43:01] although... those are all for changepropagation [14:43:08] not end user browsers... [14:44:55] and I just ended up receiving that 412 [14:46:50] I just filed https://phabricator.wikimedia.org/T331629 [14:48:40] <_joe_> nemo-yiannis: ^^ did we release restbase today already? [14:48:41] thank you!! [14:48:42] just looped in the restbase devs [14:48:58] <_joe_> heh ok hnowlan preceeded me :) [14:49:44] yeah like 15 mins ago [14:50:03] * nemo-yiannis checks [14:51:14] this has been happening for longer (starting around ~11 UTC) [14:51:21] <_joe_> ok [14:51:46] request are to /w/api.php btw not directly to RESTBase [14:52:14] <_joe_> these seem all to be in eqiad [14:52:16] I get a 200 with that error body embedded [14:52:49] mw2358.codfw.wmnet served me the one I a looking at right now [14:53:35] <_joe_> so it's not a http 412 from restbase [14:53:55] Not to me directly. It's probably a 412 that is received by mediawiki [14:54:57] (in case it's relevant: first complaint I've seen was a bit before 7AM UTC) [14:55:12] From the url attribute on the logs it looks like a good amount of the errors comes from `File:` namespace [14:55:19] <_joe_> akosiaris: what was the url where you got that? [14:55:51] * _joe_ mutters something about tracing [14:57:02] <_joe_> ah ok, so a VE edit goes via api.php on private wikis [14:57:18] yes, not RESTBase [14:57:24] the error is red herring [14:57:32] but still... who emits that 412 [14:57:53] <_joe_> so what is the URL where you see that error? [14:58:04] <_joe_> I can't seem to repro :/ [14:58:10] https://office.wikimedia.org/wiki/Sandbox [14:58:23] I just tried to do a simple edit [14:58:38] I even clicked "Try Again" a couple of times [14:59:34] <_joe_> got it now [15:02:53] <_joe_> it looks like a problem with some token because if I get it once on an edit page, it stays consistent [15:03:19] <_joe_> every other request will fail, but if I re-open VE, it works [15:03:47] <_joe_> to me this smells of a bug somewhere in mediawiki [15:05:12] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=parsoid-php&viewPanel=4&from=now-7d&to=now [15:05:28] <_joe_> started around midnight today [15:05:35] <_joe_> did we roll the train forward? [15:06:03] <_joe_> (on private wikis we call parsoid directly, because restbase can't deal with authn/authz) [15:06:27] _joe_: there was a patch posted earlier, about switching officewiki *to* direct, it apparently wasn't as of shortly ago [15:06:29] worth checking if https://gerrit.wikimedia.org/r/c/mediawiki/core/+/814861 is the reason. i've asked daniel on slack already about this. [15:06:35] yeah that one [15:06:57] oh not that one, sorry, I was thinking of: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/896104 [15:07:00] <_joe_> 23:50 zabe@deploy2002: Finished scap: T308932 (duration: 07m 15s) is the only thing in SAL near to that time [15:07:00] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [15:07:12] daniel proposed that direct access config patch to make it more stable because of the bug you all are investigating. [15:07:40] ok [15:08:23] <_joe_> subbu: did you take a look at parsoid's errors? [15:08:34] <_joe_> the 412s are coming from parsoid [15:10:49] <_joe_> I see a lot of Iterator page I/O error. [15:10:54] <_joe_> swift again? [15:10:56] not yet .. i just woke up and saw the slack thread. let me look. [15:11:11] T318941 [15:11:11] T318941: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941 [15:11:25] (an older ticket, I added a comment today when I noticed an uptick..) [15:11:59] <_joe_> Emperor: how is swift doing? :) [15:12:55] <_joe_> ihurbain: are these the errors you were referring to earlier? [15:13:20] <_joe_> I *think* they're unrelated to what we're seeing, which isn't a fatal error [15:13:30] the filebackenderror has been around for ages .. and are probably not related. [15:13:49] i don't see anything logged for officewiki in parsoid's logstash. [15:14:31] is there a way to consistently reproduce this 412 on officewiki? Or is it not easily reproducible? [15:15:01] <_joe_> subbu: happens episodically to me, but once it happens, I cannot submit the edit [15:15:11] _joe_ https://phabricator.wikimedia.org/T331629 is the one i was referring to when i first flagged this. [15:15:53] I strongly suspect it is the ETag patch I referenced above ... because if a bad ETag in VE is causing the 412 .. then no matter how many times you try to resubmit that won't work because the etag will continue to be bad. [15:16:24] <_joe_> yes, I concur [15:16:30] we need to find daniel. [15:17:24] duesen: ^ [15:17:54] I always forget daniel is not daniel on irc. ;-) [15:20:42] <_joe_> I pinged him on slack :) [15:20:57] <_joe_> subbu: we're all the 20-yrs-old-cool-kid version of ourselves here [15:21:27] :-) he is also on a slack thread in the #content-transformers channel there as well. [15:22:02] <_joe_> ok, I think there's a high probability the bug is there; I don't think it's worth rolling back right now, if you all think you can find a way out of it [15:22:29] <_joe_> but we can also take the "rollback first, ask questions later" approach I usually am in favour of :) [15:22:51] if we cannot debug this easily we can +2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/896028 and backport it. [15:23:00] (the revert of the etag patch) [15:24:37] <_joe_> ack, are you and duesen on top of it? [15:24:53] yes. [15:31:25] andrewbogott: You may know this already but cloudservices[2004-2005]-dev.wikimedia.org still has puppet disabled [15:39:03] In 20 minutes I am switching over m5 db master, which will affect toolhub, mailman and some other WMCS related databases. Impact: RO for around 1 minute, reads unaffected https://phabricator.wikimedia.org/T330847 [15:54:51] mailman3 has a large queue since March 7th 14:12 which is apparently when some network switch got upgraded. The 7 days queue has https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=now-7d&to=now [15:55:22] wow [15:55:23] that got filed as https://phabricator.wikimedia.org/T331626 cause some bot is no more receiving emails through `mediawiki-commits@lists.wikimedia.org` [15:55:59] I am wondering if I should not proceed with the DB switchover that also hosts mailman database [15:56:19] I have looked at the switch upgrade task and nothing mentions Mailman but some bot reported on the task it has mute alerting for a host `lists1001.wikimedia.org` which sounds like it could be Mailman [15:57:49] I am going to go ahead with the switch, as the database will be on read-only so there's no split brain possible anyways. I will restart mailman service too just in case [15:58:12] I know nothing about mailman unfortunately :] [15:58:36] bd808: ready for the switch in 2 minutes? [15:59:20] hashar: probably worth asking Amir1 or legoktm about mailman [15:59:41] Amir is on vacation at the moment [15:59:43] marostegui: o/ yes [15:59:54] bd808: I will coordinate in -operations! [16:00:02] I can look in an hourish [16:00:08] unless it's an emergency? [16:00:49] switchover done [16:00:53] RO was around 15 seconds [16:01:15] legoktm: its been broken 2 days so i think it can wait [16:01:24] See https://phabricator.wikimedia.org/T331626 [16:01:58] legoktm: I can restart the service if you want me to [16:02:13] I can't imagine that would make it worse [16:02:19] so please :) [16:02:33] ok! [16:03:33] done [16:04:44] funnilly icinga on lists1001 states `OK: mailman3 queues are below the limits` [16:04:47] thanks brett, I will enable now [16:07:26] marostegui: I think that might have fixed it :) [16:07:54] as for the root cause who knows, maybe loosing the network connection caused the mailman runner to get lost [16:08:53] It looks like the trend is now going down [16:08:58] https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=now-12h&to=now [16:09:04] oh fantastic :) [16:10:15] there is certainly a follow up needed to fix up the icinga alert [16:10:38] Maybe for future network maint, restarting mailman should be done after [16:11:51] afaik icinga is checking the exim queue, and that grafana panel is counting the size of a directory on the filesystem, yeah seems like a good idea to notify about that [16:14:58] marostegui: I have resolved the task. Thank you for the mailman restart! [16:15:54] ok cool [16:30:11] T331633 is UBN! - report of someone not getting emails... [16:30:13] T331633: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 [16:30:58] Emperor: should be resolved now [16:31:06] Emperor: Just commented there [16:32:19] Thanks, it came across my clinic duty desk [16:41:10] that can be marked as a dupe of T331626 which has lead to the mailman restart [16:41:10] T331626: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 [16:41:24] the queue is draining albeit very slowly [16:46:11] _joe_, The revert merged ... should I schedule a backport in a regular backport window or is there someone who can help with an out-of-band backport of the reverted patch? [16:50:38] <_joe_> I'm in an interview, sorry [16:52:50] looks like the next backport window is in 4 hours ... anyone able to do an out-of-window backport of the cherrypick ( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/896030 )? [16:54:39] I can take a look [16:59:15] ty [17:06:43] Can someone from collab team help debugging a phabricator issue? [17:33:34] heads-up: Traffic is moving the authdns servers to dns1001 and dns1002 from authdns1001 and 1002. you might seem DNS related alerts [17:33:54] we are around and probably aware but if you see something wrong, feel free to shout from the rooftops :) [18:32:53] I updated the mailman ticket a bit, it'll probably take ~5 hours to recover [19:09:35] who looks after https://phab.wmflabs.org ? (down, `Unable to connect to MySQL!`) [19:22:02] TheresNoTime: it is part of the https://openstack-browser.toolforge.org/project/devtools project, but I don't know that anyone expects the instance to actually work most of the time. [19:22:33] the one time I'm testing phab API things... :p [19:37:21] I think mutante looked at it once [19:39:35] Or maybe I’m thinking of the phorge tests [22:08:47] Mailman down to less than 1k emails [23:06:54] Traffic has completed T330670. if you see any issues running authdns-update, please let us know here thanks [23:06:55] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [23:11:03] sukhe: congratulations completing that non-trivial migration [23:12:00] few things have more potential to break everything than touching DNS servers, kudos