[00:47:38] * bd808 off [11:11:44] XioNoX: topranks: just to confirm, amd I good to merge/deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970850? [11:18:01] * dhinus paged: #wikimedia-cloud-admin ToolsToolsDBWritableState tools [11:18:42] hmm, OOM again? [11:18:52] hmm I can't SSH at the moment [11:18:59] I'm in [11:19:16] some DNS issue on my side apparently [11:19:43] please check the logs and if it's OOM give it a restart [11:20:40] no, something different this time [11:20:43] Nov 08 08:21:49 tools-db-1 mysqld[119707]: 2023-11-08 8:21:49 19727620 [Note] Detected table cache mutex contention at instance 1: 52% waits. Additional table cache instance activated. Number of instances after activation: 2. [11:20:53] mariadb automatically restarted, but came back read-only as expected [11:21:25] oh, no, that was a few hours ago [11:21:47] yeah it was an OOM kill [11:21:51] I'll set it as RW [11:22:01] done [11:24:56] thanks [11:37:00] something is wrong with my router's DNS relay... bypassing it I can ssh again [11:37:34] * dhinus out for lunch, back in a bit [12:31:29] hi all could some one review the following spicerack change and confirm WMCS cookbook sone use spicerack.puppet.get_puppet_ca_hostname() [12:31:35] https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/971957 [12:35:23] jbond: sure, replied [12:36:10] thanks taa [12:36:18] *thanks taavi [12:52:36] taavi: sure, go for it, let me know if it doesn't get accepted on any device (or the diff includes other bits) [12:53:24] taavi: for your comment, let me know if you would prefer read-only, but I'm find for super-user [13:00:01] XioNoX: thanks, I think the current super-user is fine [14:05:34] dhinus: in case you haven't already rebuilt cloudservices1006, I'm around until our meeting [14:06:33] I haven't, shall I start it now? I'm not sure I understand the implications of the master records problem [14:07:51] I updated the docs (which were right before but unclear) starting with 'You will also need to update pdns on all nodes to set the 'master' field. ' [14:07:57] I am pretty sure I know how to do it right this time [14:08:04] So yeah, let's do it [14:09:05] I'll start the reimage then [14:09:22] Also, the thing with the dump/import wiping the table means that we actually /can/ use the packaged schema after all. [14:09:23] I guess my question is: can we modify our setup so that we always use the same IP/interface? [14:09:39] dhinus: that's a question for topranks, I think he said no last time I asked [14:09:50] would sure be better though! [14:10:13] and possibly I didn't ask the question correctly last time :) [14:11:30] can you try explaining it in detail? this is about traffic from designate to pdns, right? [14:12:31] and designate needs to connect to pdns both on the same node, and on a different nod [14:12:41] that's right [14:12:58] and currently it's using the private IP for the same node, and the public IP for the other node [14:12:59] and when traffic happens between nodes it uses the (newish) .private ip [14:13:07] nope, other way 'round [14:13:10] right [14:13:14] public IP for same node, private IP for other node [14:13:50] this is internal openstack traffic, so ideally it should use the new private subnet (I think) [14:14:03] definitely agree [14:14:03] which it does across nodes, but it doesn't work on the same node [14:14:08] But that's where my knowledge ends and 'cathal said so' begins [14:14:19] yep, that's correct [14:15:07] so the upshot being that the master records are different between the two hosts, since 'myself' has a different address depending on if local or remote [14:15:35] which is somewhat terrible partly because 1) it's inconsistent and 2) designate isn't really set up to make that distinction [14:15:41] random thought: could it use 127.0.0.1 for "myself" if the .private IP is not working? [14:15:43] although somehow 2) turns out to not be a big problem [14:16:22] though ideally we would set exactly the same 2 private IPs on both hosts [14:16:23] dhinus: so right now we're talking about the master record in pdns, which means "expect traffic from this IP" [14:16:38] It's in designate where we set "send updates to these IPs" [14:16:46] gotcha [14:17:06] (in the meantime, cloudservices1006 is being reimaged) [14:17:09] Where we can only have one set of destination IPs. [14:17:41] dhinus: I think this is the issue you were talking about? [14:17:42] https://phabricator.wikimedia.org/T346385 [14:18:19] I think it is [14:18:19] topranks: similar but not the same, I believe that one is resolved [14:18:23] the issue was the IP address used for local comms within the same host [14:18:24] oh ok [14:18:53] like you said above "it's using the private IP for the same node, and the public IP for the other node" [14:19:08] yeah it seems to me it's describing the same problem [14:19:16] the question as to whether we can control what IP it uses to talk to itself I don't know the answer to [14:19:27] So in pools.yaml we have things set to [14:19:31] https://www.irccloud.com/pastebin/KXgX4nLc/ [14:19:39] it's a question of it the software making the requests has a configuration option we could use [14:19:44] nothing 10. there (which it was, previously, which is what that bug was about as I understand it) [14:19:57] I think they are the nodes it connects to though [14:20:01] i.e. the remote IPs [14:20:07] this is when it sends a packet to 127.0.0.1 [14:20:17] and what source IP it decides to put in the source field of that packet [14:20:31] My recollection is if I'm literally ON 172.20.1.5 and I telnet to 172.20.1.5 the originating IP is the 185 IP instead [14:20:55] or sorry - maybe the dest isn't 127.0.0.1 (re-reading the tticket ) [14:21:14] but yeah when it sends it to its own IP what controls the IP? when it's to a remote the outbound interface controls the source IP by default [14:21:55] hm, ok sorry, I guess that ticket maybe used to refer to a different issue (the 10. misconfiguration) but now it describes today's problem :) [14:23:07] rereading that... [14:23:43] telnet it used the private one when I tried [14:23:46] https://www.irccloud.com/pastebin/b4aBCBe2/ [14:24:51] oh! Then I know even less about what's happening :/ [14:26:08] surely this is going to end with me reading designate code... but maybe we can experiment a bit more in codfw1dev [14:26:33] dhinus: the designate config for all this is in /etc/designate/pools.yaml [14:26:45] topranks: what was the exact telnet command that resulted in that tcpdump? [14:27:11] "telnet 172.20.2.4" - which I think is wrong [14:27:18] It's confusing because there's "contact X host for updates" and "contact Y host to make sure things are updated" and "include host Z in record as SOA" so there's lots of duplication and the fields aren't named very well [14:27:42] "telnet 185.15.56.162" produced this: [14:27:58] https://www.irccloud.com/pastebin/dw7i2gdA/ [14:27:59] /and/ the pathway to create domains is different from the pathway to create records [14:28:15] So basically when the system connects to one of its own IPs, it uses the same IP as the source [14:28:33] When it connects out to the other one it uses the 172.x IP as that is the outgoing interface [14:28:51] wait, is ^ true even if you specify the public IP? [14:30:46] yep, for instance a ping [14:30:52] ok [14:30:53] https://www.irccloud.com/pastebin/1DIb2ANO/ [14:31:11] bear in mind that is for a new connection the host makes, and thus needs to choose which IP to use as source [14:31:28] If I connect to a host that's running a deamon on one of its IPs it will use that IP to reply always [14:32:51] andrewbogott: are the SQL values linked to pools.yaml? (i.e. does updating the yaml update the db?) or are they independent? [14:33:35] My understanding is that they are not updated for record updates but are inserted from pools.yaml when a new domain is created [14:33:48] ok [14:33:53] which means new domains may be a bit broken in our current setup [14:34:03] but mercifully new domains are pretty rare [14:34:41] doesn't each new project get a new domain/zone? [14:35:09] yes, but... [14:35:19] well, they get a new public zone which is seldom used [14:35:50] the private .wmcloud zone is kind of pseudo-zone iirc where the project name is part of the record [14:35:54] but I'd need to doublecheck that [14:36:39] I am very close to falling back on empiricism, 'the status quo seems to work ok' [14:36:42] even though I hate it [14:37:22] so dhinus topranks has convinced me that the problematic bit in pools.yaml is [14:37:24] https://www.irccloud.com/pastebin/tZ8O3yYQ/ [14:37:48] agreed, what happens if you put the private ips there? [14:38:03] good question, let's try it on codfw1dev [14:38:20] ns_records is pointing (correctly) to the public hostnames, maybe everything else can use the private ones [14:38:37] which kinda maps what I see in the example here https://github.com/openstack/designate/blob/master/etc/designate/pools.yaml.sample-multiple-pools [14:38:47] but our config is more complex so who knows :) [14:38:52] I assume it will try to connect to the private IPs, which are running the dns recursor and not the auth server [14:38:52] dhinus: want to learn how to update pools.yaml or should I just do it? [14:38:55] https://www.irccloud.com/pastebin/LjsW7WMU/ [14:39:09] oh right, do we even have private IPs for the auth server? [14:39:28] yeah, topranks, you're right [14:39:42] no, because both servers listen on UDP 53 they can't both listen on all IPs (0.0.0.0), which would be ideal [14:39:50] so recursor is bound to private, auth bound to public [14:40:01] * andrewbogott nods [14:40:08] and auth needs to be public of course [14:40:15] the recursor is bound to the private VIP, not the host-specific address, iirc [14:40:27] sorry taavi very true [14:40:41] so does that mean... [14:40:43] * andrewbogott tries it [14:41:03] so I guess you could put the private IPs in pools.yaml. The connection will fail, but it might add the right allowed list to the DB [14:42:03] the comment above "nameservers:" in the example says "We use these to verify changes have propagated to all nameservers." so maybe they're used just for that [14:42:17] topranks: why will the connection fail? [14:42:35] (I mean, I can confirm that it will) [14:43:07] dhinus: that's why I figured it was OK to use the public IPs there... [14:43:25] but it seems like it must be using them for xfr as well [14:43:51] right, makes sense. let's see if it's true [14:46:07] andrewbogott: pdns is only listening on the public IPs [14:46:26] oh yeah, /that/ should be easy to fix [14:46:49] or maybe not [14:46:51] https://www.irccloud.com/pastebin/K0zZhNFC/ [14:46:57] yeah you can't change it there [14:47:07] you could have it listen on all IPs, but that would conflict with the recursor [14:47:14] grrrrrr [14:47:40] Ok, so this reminds me of a long-neglected task, "move the recursors to VMs" [14:47:54] which would solve /that/ problem which might ultimately solve the problem we're talking about now [14:48:09] Although I'm quite worried about chicken/egg issues with recursors on VMs. [14:49:26] dhinus: have we convinced ourselves that the current dumb config is the only possible config yet? :D [14:49:46] I'm definitely convinced it's complicated :D [14:51:02] for your reference, if you want to try changing pools.yaml in codfw1dev be warned that pools.yaml isn't live config, you have to do a 'designate-manage' command to have it read the changed config [14:51:24] ok [14:51:49] I remember a time years ago when I complained to the designate devs that they kept adding more and more services and "why are you making this so complicated?" [14:52:02] I guess that was before I knew that DNS has uncountably many edge cases [14:52:44] in any case I'm pretty sure I know how to make this work as well post-reimage as it worked pre-reimage [14:52:57] what if we always put both public and both private in the allowlist? [14:53:09] i.e. https://phabricator.wikimedia.org/T346385#9182344 [14:54:44] iirc that makes pdns throw a bunch of "I'm upset because I'm not getting updates from xxx.xxx.xxx.xxxx" warnings [14:54:52] I see [14:54:54] but maybe that's harmless? I don't remember for sure if it actually broke things [14:55:27] I guess that would still require a manual SQL query every time you add a new domain? [14:55:57] likely [14:59:18] taavi: (unrelated) do you have a second to help me understand what's happening with the web proxy API? I'm getting failures when creating things in horizon and iirc you're the current expert on that code [14:59:34] sure [15:00:01] it says 'Unrecognized schema in response body. (HTTP 403) ' [15:00:13] which I guess I should try reading the logfiles before bugging you [15:00:15] in eqiad or codfw? [15:00:57] eqiad [15:01:16] trying to create proxies in tf-infra-test [15:02:43] dammit is there a way to make journalctl wrap lines? [15:03:11] Oh! it's upset because the designate server is down :) [15:03:21] taavi: I guess nevermind for the moment [15:03:22] oh, might be! [15:03:38] although why is the api down? is the other cloudservices node not serving it? [15:03:47] it ought to be [15:04:34] 'openstack zone list' sure works [15:05:10] does the proxy API use keystone catalog discovery? [15:05:17] yes [15:05:46] maybe it doesn't like that 1006 is "half up"? (I'm initializing the db after the reimage) [15:06:21] maybe but I should see cli failures too in that case [15:07:29] did you get failures when listing things or when editing things? [15:08:16] editing [15:08:55] works for me, I just created tf-infra-test-taavi.wmcloud.org [15:10:31] me tries the script again [15:11:46] hmmm [15:11:57] previously I couldn't create anything [15:12:17] now I just can't create the one thing I want to create, tf-infra-test.wmcloud.org which the terraform test suite seems to expect [15:12:55] which is understandably an edge case [15:17:44] the db in cloudservices1006 should be ok [15:18:07] I imported the schema from /usr/share/pdns, then imported a dump from 1005 [15:18:44] and then changed the master records? [15:18:54] I haven't no [15:19:00] andrewbogott: does that name exist in the designate database? [15:19:31] / in DNs [15:20:34] seems yes [15:20:39] let's see if I can remove it [15:21:09] yeah that would be it [15:21:50] andrewbogott: I've updated the master records with "update domains set master..." [15:22:03] dhinus: great! [15:22:06] so possibly you're done [15:22:12] we'll see what works and what doesn't [15:22:38] sigh, the logs very helpfully indicated the problem, it's just that the log format is horrible to use [15:22:39] Rejecting can_use_hostname (tf-infra-test tf-infra-test.wmcloud.org.), found existing records: tf-infra-test.wmcloud.org. [15:24:18] oh yeah, there it is :/ [15:24:28] taavi: working now and tf tests seem happier [15:24:29] thanks [15:24:39] great [15:28:08] hi all anyone around who is comfortable with wikireplicas? [15:28:32] lol no one is /actually/ comfortable. I think milimetric is the closest [15:28:45] (who is probably in a data-engineering channel) [15:29:09] he might send you back to us depending on the question [15:29:29] well is something sdimple i want to deploy a change restart mysql and ensure things are not broken [15:29:50] the change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961829/ [15:31:55] you have a +1 from ben and he knows as much as anyone. If you want to test things after the fact https://superset.wmcloud.org/superset/ would be a good tool for that [15:32:02] jbond: what else can I tell you? [15:32:14] andrewbogott: ack sgtm [15:32:23] i have pinged ben [15:32:42] andrewbogott: while i have you there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/968665/5 [15:35:31] jbond: looks ok other than the apparent typo taavi flagged. I can merge and apply when it's ready [15:40:51] btw, jbond, thanks for all the work you've done with us over the years. You've been an enormous help and I'm very sorry to see you go! [15:41:24] thanks andrewbogott :) [15:47:21] jbond: going to wait to merge that until post-meeting just in case [15:50:33] andrewbogott: yes sgtm there is also the wikireplicas one which needs merging first [16:12:25] hm, I will see if I can restart mariadb on clouddb1021, seems to have crashed [16:14:09] seems like it's running but the unit shows a failed state [16:17:10] jbond: seems mariadb won't start on clouddb1021 now. Not sure if it's related to your change, I'm still investigating, but the timing is suspicious [16:18:39] andrewbogott: the main mariadb is not ment to start [16:18:46] you need to start the @s{1..8} versions [16:18:53] yeah... I wonder why it's alerting then :) [16:18:58] but I just figured that out [16:19:09] ann that would be me i tried to start it [16:19:14] just need to reset [16:19:31] oh great [16:19:44] i have done it it should clear [16:19:47] I think I made things worse in the meantime though [16:19:57] lol [16:20:46] seems like ps -ef | grep maria should be showing more than nothing [16:21:13] and yet puppet seems happy [16:21:28] jbond: if you have those systemctl commands handy can you start what needs starting? [16:21:28] andrewbogott: the binary is still mysqls [16:21:37] sudo systemctl status mariadb@s{1..8} [16:22:39] systemctl is happy but 'PROBLEM alert - clouddb1021/mysqld processes is CRITICAL ' [16:23:16] andrewbogott: i just started them all agaiun they had died about 5 mins ago [16:23:27] thanks [16:23:47] I'm going to disconnect before I make things even worse [16:24:06] :) [16:26:14] I have a bunch of work to do on my DD application so I'm going to call it a day for today [16:33:24] * andrewbogott waves [17:36:37] Is horizon down for others, or mostly me? [17:40:23] Rook: seems to be working for me, you mean in eqiad? [17:40:32] What are you seeing? [17:40:48] Yes, I'm getting a 504 when I try to login [17:41:20] I'll log out and in again [17:43:53] yeah, seems to be failing for me too. moritzm, any chance that's related idm changes? I'll also see what I can see, just covering all options [18:18:37] seems so, at least it's the 2fa that's failing [18:22:33] 2fa in Horizon should be only about talking to wikitech's action api [18:22:55] unless you've changed that recently? [18:26:31] Hm... I'm used to thinking that the 2fa on wikitech is another random wmcs-only feature. Is that wrong and it's normal now? [18:26:42] but yeah, it's not in osm [18:27:35] It's just https://www.mediawiki.org/wiki/Extension:OATHAuth [18:28:07] ok, then I guess I have to look elsewhere for this breakage. [18:28:30] I have 2FA on all of my staff wiki accounts (SUL, officewiki, foundationwiki) [18:29:38] heh, I'm so used to wikitech being weird that I don't notice when it's normal. [18:31:04] I'm about to start reviewing functionality for the IdM to see what if anything is left before we can rip out OpenStackManager (would still need LDAPAuth). [18:32:26] hm, striker logins are also broken but in a slightly different way... [18:33:08] oh, well, it's probably not actually different [18:37:48] for me striker is hanging at login. it could be related to oathauth as there is an api check before prompting for 2fa [18:38:05] * bd808 tries to see if there are error logs [18:40:25] lots of "Bad OAuth request from {ip}" and "{key} exists, so nonce has been used by this consumer+token" [18:40:36] that could be memcache problems [18:41:44] we would have just gotten the 1.42.0-wmf.4 train today too, right? [18:42:09] yeah. wikitech is in group1 [18:43:15] you're seeing that in the wiki logs or the striker logs? [18:43:32] think I should just bounce memcached on the cloudwebs? [18:44:01] wiki logs via https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK [18:44:22] I think its likely that something in the train busted wikitech oathauth [18:45:48] "Retrying due to nonce error, see https://phabricator.wikimedia.org/T106066" [18:45:55] T106066 [18:46:36] I looked at that but didn't clue into it being a new warning. The ticket hasn't been touched in > a year [18:46:45] there might have been mcrouter config changes causing issues per T106066 [18:48:30] Stashbot's error logs are full of "2023-11-08T17:43:10Z mwclient.client WARNING : Retrying due to nonce error, seehttps://phabricator.wikimedia.org/T106066" now too. [18:49:08] that may have been why the bot looked dead before [18:50:47] * andrewbogott tries to summon timo [18:50:55] o/ [18:51:02] hi! [18:51:06] https://github.com/wikimedia/operations-mediawiki-config/commits/master doesn't seem to have anything relevant other than the group1 promotion [18:51:33] what's the observable problem we're talking about? [18:51:34] Krinkle, as of https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK we can't use 2fa on wikitech anymore [18:51:40] lots of memcached complaints [18:51:41] https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK [18:51:55] whoops, sorry, same paste twice [18:52:05] what I mean is, 'as of today' presumably the train roll-out [18:52:46] It's not obvious what broke it but I'm guessing you're the expert on debugging mcrounter/memcached things [18:52:50] "PHP Notice: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.4/includes/config/EtcdConfig.php on line 218" is emdemic too [18:53:26] hm, why would etcd be involved at all? [18:53:56] database routing and I think some service things are in etcd [18:54:17] ok [18:54:23] EtcdConfig is fine, that's new logging for prexisting and expected behaviour [18:55:09] great, that's one less dead end for us to follow :) [18:55:24] we cache etcd data for 10s, then we acquire a graceful lock in apcu from the next web request to refresh the data, which takes about 1s during which concurrent requests re-use the previous data until the new one is ready. [18:55:38] long version: https://phabricator.wikimedia.org/T349376#9303579 [18:55:51] short version: it's expected, but we should not log unless it's been more than N seconds since the last refresh [18:56:14] what error am I looking for in relation to 2FA failure? [18:56:24] > Retrying due to nonce error, see https://phabricator.wikimedia.org/T106066 [18:56:25] this one? [18:56:58] * Krinkle clicks (-) on level:INFO and level:NOTICE [18:57:13] Krinkle: the primary problem is inability to log in on wikitech and striker. That seems to be caused by a timeout when validating the 2fa token against wikitech's api [18:57:35] the nonce error is suspected to be related [18:57:59] bd808: that's about what we know so far, right? [18:58:07] https://logstash.wikimedia.org/goto/9515974b7d17d0e4666ab57a88dcd006 -- started at 09:00 UTC today. that's when the train ran [18:58:15] is the wikitech database primary eqiad or codfw? [18:58:21] i.e. did it follow the switchover? [18:59:08] it is group1 today, and labswiki is in group1 [18:59:25] the nonce error started about 6 hours ago [18:59:35] ~7h now [18:59:51] Krinkle: as far as I know there is no wikitech db in codfw [19:00:32] it should be a "normal" s6 db these days I thought? [19:00:32] mcrouter-primary-dc did switch to codfw, but that shoudl be fine to access from eqiad, and we do for various use cases (i.e. rate limiter, CA tokens) [19:00:44] does wikitech still use nutcracker? [19:00:51] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/db-production.php#L132 [19:00:56] sorry, I'm frantically trying to eat lunch and saying lots of wrong things [19:01:25] switchign that to mcrouter, even in keeping with its own custom memcache host, might make for a small incremental improvement towards less novelty [19:01:31] Krinkle: we might actually still use nutcracker, yes. Let's figure out how to check that. [19:01:56] it's been like 5y since we used that in prod so not a lot of knowledge around that [19:02:13] there was some long ago reason why we didn't migrate, but I think it sort of boiled down to "nobody who knew what to do helped" [19:02:36] > Retrying due to nonce error [19:02:43] This is not logged by MediaWIki, but by striker. [19:04:22] https://logstash.wikimedia.org/goto/b61bb63672a331b31c1e533b175b252d [19:04:37] ok, so we need to correlate it with MW debug messages prior to the striker message [19:04:50] "include ::profile::openstack::eqiad1::nutcracker" -- we are using nutcracker [19:05:10] we have mcrouter and nutcracker both running, I'm not 100% sure that nutcracker is used by wikitech and not just horizon [19:05:22] I'm still trying to track that down [19:05:37] comment in the role says "# Currently we run both nutcracker and mcrouter. Nutcracker is for [19:05:37] # the soon-to-be-moved wikitech/mediawiki install; [19:05:37] # Mcrouter is used by Horizon. They share the same [19:05:37] # memcached backends." [19:05:58] ok then [19:06:09] * andrewbogott sad about 'soon-to-be-moved' [19:06:51] Are both of those transparent-ish memcached proxies? Like, could we just change the ports to point at mcrouter and be done? [19:07:00] Or more likely I guess move mcrouter to where mw is looking [19:08:12] andrewbogott: mw is currently configured to "if wikitech" connect to the port we use for nutcracker instead of the port where appservers run mcrouter. [19:08:26] quite possibly you already run mcrouter on the same port as appservers [19:08:36] ok. [19:08:44] the difference if I remember is that nutcracker also routes for redis and wikitech still uses redis to cache sessions and some other stuff [19:08:49] I'm not actually proposing we change this unless it will fix today's breakage [19:08:57] yeah, not likely the cause. [19:09:14] so the messages we have in logstash before each error from striker is: [19:09:19] > Bad OAuth request from {ip} [19:09:32] > OAUTH:labswiki:nonce:*:oauth_token=*&oauth_token_secret=[REDACTED]:* exists, so nonce has been used by this consumer+token [19:09:37] > Persisting session for unknown reason [19:09:42] those three basically [19:10:25] The "Bad OAuth request from {ip}" one, if expanded has "Nonce already used: xxxxx" as reason [19:10:32] fwiw I have not tried clearing memcached yet, so lmk if/when you start to think this is just a poisoned cache issue [19:10:39] given it can find the data in memcached, suggests that maybe it's not a memcached problem [19:10:56] if the error was that it seemed that tokens are invalid / not stored in memcached, that woudl seem more likely [19:11:15] * andrewbogott nods [19:11:32] I'm guessing nothing changed in striker in the last 12 hours? [19:11:54] nope [19:11:57] nope, and horizon also stopped working [19:12:06] and not explicit changes to OATHAuth in https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4 [19:12:27] and Stashbot is also failing to auth to wikitech with no codebase changes [19:14:09] i see, so that's why stashbot went MIA [19:17:13] * Krinkle invites MatmaRex [19:17:33] MatmaRex: https://logstash.wikimedia.org/goto/b61bb63672a331b31c1e533b175b252d [19:17:50] this shows log messages from mediawiki and from striker (toolsadmin) from cloudweb hosts (mixed). [19:18:07] the observable problem is "Retrying due to nonce error" as logged by striker. [19:18:16] um, hi [19:18:18] note that wikitech/cloudweb is in eqiad (no codfw equiv) [19:18:19] * MatmaRex reads channel logs [19:18:31] bryan and andrew suspect a memcached issue [19:18:49] > andrewbogott: lots of memcached complaints [19:18:53] what did this refer to btw? [19:19:06] Krinkle: just the nonce log messages [19:19:24] since when are things broken? [19:19:30] ~7h ago [19:19:40] andrewbogott: striker.py saying "Retrying due to nonce error" is a memcached complaint? [19:20:01] MatmaRex: pretty much exactly aligned with the 1.42.0-wmf.4 group1 deploy window [19:20:05] best to assume I don't know anything at this point [19:20:48] I don't know striker's code base but I'm assuming it logs "Retrying due to nonce error" in response to its HTTP oauth request to Wikitech api.php being denied [19:21:11] Krinkle: yeah, its a message from the mwclient python library. Let me see if I can track down more. [19:21:13] so it's 99% a wmf.4 bug? i can go read the changes [19:21:23] https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4 [19:22:07] wmf.4 + mwclient seems likely at the moment, although I think horizon uses raw requests calls to talk to the action api [19:22:22] No changes: LdapAuthentication, OATHAuth, OAuth, OpenStackManager, WebAuthn [19:23:35] here's where the none error log messages come from -- https://github.com/mwclient/mwclient/blob/98af129e2af5475d75caf464ac4fae332e8f35fc/mwclient/client.py#L389-L397 [19:24:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/zed/keystone/wmfkeystoneauth/wikitechclient.py <- what Horizon does [19:24:36] ah, so it does use mwclient too. [19:24:47] i know that there have been recent changes in CentralAuth memcached use https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/960643 and in mcrouter config https://phabricator.wikimedia.org/T342201#9308357 . this is probably not relevant, right? [19:25:07] CentralAuth prolly not, unless it's already enalbed on wiktiech? [19:25:30] Krinkle: CA is not installes on wikitech [19:25:33] (i honestly don't know how OAuth interacts with anything, and how wikitech interacts with anything :) ) [19:25:33] *installed [19:25:42] mcrouter isn't used yet either, wikitech is still on nutcracker with its own memcache host (not mcrouter). [19:25:57] the timing with _joe_ 's change to mcrouter-primary-dc is interesting thoguh [19:26:09] it added cross-dc support for mw-on-k8s to connect to mcrouter-primary-dc [19:26:23] but wikitech isn't served from k8s right? or does it have a small % already? [19:26:48] that was rolled out like 7h ago [19:26:48] not served fvrom k8s. only from cloudweb100* hosts [19:27:07] * bd808 wishes that were not the case [19:27:41] <_joe_> Krinkle: 0% for now, something might change in a near future [19:29:39] The action api query that is failing per the logs is https://wikitech.wikimedia.org/w/api.php?action=query&continue=&format=json&meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg -- and that works fine without OAuth being involved [19:30:20] right, it preusmably wants the 'userinfo' part there for the currently loggedin user [19:30:57] that's part of the mwclient.Site() constructor actions [19:31:23] first error: Nov 8, 2023 @ 12:35:01.406 [19:31:49] that's 35min after group1 https://sal.toolforge.org/production?p=1&q=&d=2023-11-08 [19:31:52] so... [19:31:55] all of the things that are failing are using mwclient to authenticate to wikitech via OAuth owner-only tokens [19:32:07] well, can we rollback wikitech only to the group2/previous version? [19:32:12] at least to rule it out? [19:32:25] i.e. same as enwiki is running still [19:32:57] it's technically possible. I don't know how thcipriani and co feel about partial group changes these days [19:33:29] MatmaRex: any other hunches? [19:33:29] we could hack it locally on cloudweb* too [19:33:45] sure, let's do that first. in wikiversions.json [19:34:05] afaik rolling back some wikis is not uncommon, esp if you're not aware of a specific dependency on this weeks's train roll-out in the past 12 hours [19:34:15] Krinkle: none [19:35:12] bd808: I think most of the striker errors are cascading errors, not original ones. I don't know why it retries, but the error most of them are reporting is that the nonce token was already consumed, and from a quick glance it seems striker/mwclient is responding to that by waiting 2s and then trying it again, thus consuming the same nonce a 3rd and 4th time etc [19:35:18] I stand by ready to hack on cloudwebs but will need more info [19:35:39] I'm not seeing any errors in MW other than "nonce is invalid, already used" [19:35:49] so I can explain all errors but don't know how it started [19:35:57] has the process been restarted? [19:36:17] stashbot has twice. not sure about anything else [19:36:25] ah right, stashbot affected too [19:36:47] alright, can one of you pin wikitech to prev mw branch? [19:36:58] * Krinkle back in 10min [19:38:01] I think I can. It should just be editing /srv/mediawiki/wikiversions.php on both hosts [19:41:22] ok, wikitech is 1.42.0-wmf.3 again now. [19:42:02] and I can login to https://toolsadmin.wikimedia.org/ again [19:42:20] so it is something in 1.42.0-wmf.4 causing the issue [19:43:21] * bd808 steps away to find a sandwich [19:43:43] o/ can I help? [19:45:31] taavi: find out what is breaking OAuth owner-only auth for wikitech on 1.42.0-wmf.4 I guess. Figuring out if it is wikitech specific of OAuth generally would be a good step at this point. [19:45:39] * bd808 really goes to find food [19:46:33] I guess we're back to reviewing commit log [19:47:03] I'm guessing that there's no good way to bisect? [19:49:28] i'm not sure how annoying it is to bisect in production, i've done it on the beta cluster before [19:51:21] could you sync to mwdebug servers after every bisect step to test? [19:51:35] you can't test wikitech stuff on mwdebug unfortunately [19:54:33] There's a couple of login hook changes. Given no CentralAuth, that's pretty much the only changes I can see that vaguely relate to oauth [19:55:11] want to link me to the diffs? I can hand-apply them and see what breaks [19:55:28] oauth doesn't seem to use those hooks [19:55:38] taavi: hmm good point. I figured if it affected other wikis we'd have heard but maybe not [19:56:05] (i looked at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/969096) [19:56:45] (and at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/970771 . i don't think it's either of those) [20:02:05] Right, they seem specific to the GUI login page [20:04:31] https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4 [20:04:59] Any backports not covered here? [20:05:35] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+log/refs/heads/wmf/1.42.0-wmf.4 [20:05:42] one language backport, and a couple of extension ones [20:05:56] nothing that looks immediately suspicious [20:07:02] https://gerrit.wikimedia.org/r/q/branch:wmf/1.42.0-wmf.4+is:merged yeah [20:09:27] Oauth logic is mostly in a library [20:09:43] Vendor updates aren't on that wiki page it seems [20:10:11] https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4#Vendor lists that a parsoid bump was the only vendor change [20:11:05] Ah I see. Missed that [20:11:45] > Remove check for irrelevant setting: wgDisableCookieCheck (task T349791) [20:11:50] T349791: Remove UserLoginComplete hook calls with $direct === false - https://phabricator.wikimedia.org/T349791 [20:11:51] One more login related change [20:12:31] Using mwclient with owner-only oauth credentials seems to work fine against mediawiki.org which is on 1.42.0-wmf.4. [20:13:02] It's a stretch but given andrewbogott asked, maybe we can apply a couple of these and see what happens. (Or revert them out of wmf.4 on cloud web hosts in the other direction). [20:13:18] sure, which one should I start with? [20:15:01] are we sure that it actually has a problem with nonces already being used? the code throwing that error is somewhat sloppy [20:15:08] if ( !$this->nonceCache->add( $key, 1, $timestamp + 300 ) ) { [20:15:18] add() can probably fail for a dozen different reasons [20:15:58] MatmaRex: not sure at all. and anything that implicates memcache/mcrouter should be looked at [20:16:26] Ah, it's not actually asserting existence [20:16:31] Good spot [20:16:58] What backend does nonceCache use for labswiki? [20:17:29] i think that's wgMWOAuthNonceCacheType which is memcached-pecl [20:17:41] (well, that's what the config says. who knows what really happens) [20:18:21] If it failed, I believe memcbag would log something [20:18:46] Although DI around there isn't perfect for Logger so could be lost [20:19:00] is wikitech using mcrouter or nutcracker? [20:19:12] nutcracker we think [20:19:19] Nutcracker last i checked wmf config mc.php [20:19:54] It working after rollback and working in prod mediawiki.org is making it hard to come up with ideas [20:21:02] What are some extensions notably present or absent from wikitech that an innocent change might have an effect through on memc or auth? [20:21:12] It is possible that what fixed it was not the rollback itself but the restart that followed the rollback. Unlikely since there was /already/ a restart with the upgrade earlier today... [20:21:24] andrewbogott: there was no restart [20:21:36] ok then [20:21:38] how did the rollback apply then? [20:21:43] php reads config on each request [20:21:46] Restart of? [20:21:48] I guess because of how php works... [20:21:50] yeah [20:22:13] it's all front router magic :) [20:22:19] In prod fpm is configured to no longer check mtime on disk to recompile [20:22:36] Requires restart for changes to apply on most app servers [20:23:01] Special:Version on wikitech changed, that was what was desired [20:23:05] But cloud web probably doesn't reuse enough puppet for that to apply so it's probably closer to mwmaint [20:23:45] bd808: stash bot is fine too now? [20:24:14] T12345 [20:24:15] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [20:24:17] yup [20:25:28] I have a tiny script to test the mwclient authn. It is working against wmf.4 on mediawiki.org and wmf.3 on wikitech. [20:26:02] any objections to me putting wikitech back to wmf.4 to verify that the test script fails? [20:26:24] go for it [20:26:38] please do [20:27:28] wikitech is back to wmf.4 [20:27:49] "WARNING:mwclient.client:Retrying due to nonce error, seehttps://phabricator.wikimedia.org/T106066" [20:27:58] so it's back [20:28:04] Okay, so no anti-climax [20:28:20] * andrewbogott cheers + also boos reproducibility [20:29:11] do we still have a quick config switch for verbose as hell logging on wikitech? [20:29:13] * bd808 looks [20:29:57] eh, it's for the ldap channel [20:30:00] bd808: can you trace which requests Mw client makes? Seeing the one(s) it makes and it's responses before the failure (and comparing to a working version) maybe be telling [20:30:13] Ie from the Mw client POV [20:30:34] Headers and body or so [20:31:10] Krinkle: the request that fails is ` https://wikitech.wikimedia.org/w/api.php?meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg&continue=&action=query&format=json` with OAuth signing for authn [20:31:30] bd808: yeah, you can probably force "log" attribute in wmf config src WikimediaDebug to act as if it's always there, or rig the consumer in wmf config logging [20:32:53] bd808: right, but before that it's logging in somehow, and getting a response and then it prepares this request. [20:33:16] Maybe something during the login happens differently in a non fatal way already [20:33:20] no, that's the "login" this is OAuth so each request is signed to add authn [20:34:07] Ah so it's using a long term credential that requires no login per session [20:34:18] correct. there isn't an initial handshake to create an authenticated session. you add the authn header data to every request instead [20:34:21] And the same credential is valid on wmf.3 [20:34:29] correct [20:34:29] Interesting [20:34:52] What about the nonce though. That's allocated just before right? [20:36:30] hmmm.. mabye? where's that code that MatmaRex was point out as sloppy? [20:36:53] https://gerrit.wikimedia.org/g/mediawiki/extensions/OAuth/+/22c8cdbf35b8275c71ea59cba77b812cb7f61fcb/src/Backend/MWOAuthDataStore.php#148 [20:36:56] the nonce is generated by the client, it's there to prevent replay attacks [20:37:55] taavi: ah, right. [20:38:18] i'm still thinking that the add() may be failing for some random stupid reason, not because the key already exists [20:38:24] ok, so let's hack some state inspection into that failing method I guess [20:38:33] i've been trying to read that code to see what it does, and it does a million things [20:39:30] I guess we should be able to use shell.php to poke at $this->nonceCache->add() directly? [20:39:50] https://gerrit.wikimedia.org/g/mediawiki/core/+/f869de7aad176ba8ad083cbcd46538e82bf3b4e0/includes/libs/objectcache/MemcachedPeclBagOStuff.php#302 whatever these error codes are, any of them will cause the add() to fail, and that's just one of the things [20:40:16] https://wikitech.wikimedia.org/w/index.php?title=Debugging_in_production&mobileaction=toggle_view_desktop#Ad-hoc_log_messages [20:40:41] let's hack the memcached logs to be more verbose on a cloudweb server? [20:41:06] I think bd808 was enabling verbose log so it does that for all channels [20:41:23] Then we can check Logstash for the next Mw client call [20:41:41] It indeed already has some logging in code [20:41:44] core* [20:41:51] I didn't turn up logging yet, but yeah that's maybe a good next step [20:42:33] if anyone else wants to drive, my brain is feeling out of practice at this stuff today [20:45:22] ok, wikitech should be acting like XWikimediaDebug logging is turned on for all requests [20:45:41] I jsut set `} elseif ( /* XWikimediaDebug::getInstance()->hasOption( 'log' ) */ true ) {` in logging.php [20:46:27] * bd808 fires off the failing test script [20:50:46] I've got a meeting with cdanis at the hour [20:50:49] it logged "MemcachedPeclBagOStuff debug: result: NOT STORED" [20:51:11] https://logstash.wikimedia.org/goto/ee4248306b5d04268c27b6d11abb065a [20:51:59] https://www.php.net/manual/en/memcached.add.php "will return Memcached::RES_NOTSTORED if the key already exists" [20:52:10] so i guess it really is seeing the nonce twice? [20:53:27] cdanis meeting cancelled from his side so I can keep "helping" here [20:54:02] becauseā€¦ it is trying to add the same key twice, in the same request [20:54:19] (search for "MemcachedPeclBagOStuff debug: add" in that link) [20:54:20] Right, so not a false false but actually being triggered twice [20:54:42] During the same reqId? [20:55:23] yeah -- https://logstash.wikimedia.org/goto/a6bc5a26a42fd2c6e99686dc5e9d6d56 [20:56:58] in between the two add() calls it logs the noce has been used warning [20:57:03] *nonce [20:59:38] want to make it log wfBacktrace() in there somewhere? i don't see why it's being called twice [21:00:47] MatmaRex: in lookup_nonce? [21:01:12] yeah [21:01:23] unless that's expected, and i'm misunderstanding this [21:07:31] ok, let's run the test again to see if that logging add worked... [21:07:45] heh. probably not [21:08:08] requests.exceptions.HTTPError: 500 Server Error: Internal Server Erro [21:10:31] swat window is clobbering things... [21:11:06] swat should be done soon I think [21:17:28] MatmaRex: https://logstash.wikimedia.org/goto/a6e63fafa376b188c29c78186dbb6e04 [21:19:13] ugh, i forgot that function returns html [21:19:23] hmm, just one backtrace? [21:19:53] apparently? Is the other message still doubled? [21:20:33] yeah [21:20:39] oh, did you add it inside the if() ? [21:20:54] i know wikitech 2fa was not working there for a bit, but are there any other issues with vps auth atm? i'm having trouble getting into an instance i just spun up. ssh is running on the host but it just hangs up upon connection [21:21:18] but um, anyway, the backtrace looks nasty [21:21:33] it calls MediaWiki\Session\SessionManager->getSessionForRequest(), then a couple other functions, and then MediaWiki\Session\SessionManager->getSessionForRequest() again [21:21:36] that can't be right [21:21:51] MatmaRex: oh, yeah I did put it just after the existing log for the bad state [21:22:01] dduvall: no, there are no other known issues. I can look at that, but let's move to -cloud to keep this channel clutter-free? [21:22:16] taavi: sure, thanks [21:24:23] * bd808 tries again with the backtrace in a different spot [21:25:42] MatmaRex: https://logstash.wikimedia.org/goto/1a7587292be549c908da95e8b3f8fd91 [21:26:08] I put the backtrace at the top of lookup_nonce this time [21:28:49] right [21:29:02] so it looks like it's trying to get the login session while getting the login session [21:29:47] probably caused by this change? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/966657/10/includes/user/User.php [21:29:51] that isAllowed() call is in the backtrace [21:30:17] i'm not sure what is going on here, though (or whether we can revert it) [21:32:43] MatmaRex: let me hack that check out (I'll make it always `$isExempt = false;`) and check again. If that fixes then we can think harder. [21:33:32] If that's the cause, I'd say obtain a similar trace in wmf.3 for reference and then dump on a train block task for Tim to look at in a few hours [21:34:43] maybe this check needs to be done in a different way: https://gerrit.wikimedia.org/g/mediawiki/extensions/OAuth/+/22c8cdbf35b8275c71ea59cba77b812cb7f61fcb/src/SessionProvider.php#230 [21:35:13] tim changed some core code accessing BlockDisablesLogin in that patch [21:35:52] hm, could be, wikitech is one of the few prod wikis with $wgBlockDisablesLogin as true [21:36:02] so in theory it should also be broken on all private wikis? [21:36:05] As is Office wiki [21:36:07] https://logstash.wikimedia.org/goto/4c9288bf8780a7983d44b69aaba793e6 -- hardwiring `$isExempt = false;` fixes it [21:40:34] maybe we actually should just change that `$localUser->getBlock() ` to `$localUser->getBlock( Authority::READ_NORMAL, /*$disableIpBlockExemptChecking*/ true ) ` [21:40:44] becauseā€¦ "@param bool $disableIpBlockExemptChecking This is used internally to prevent a infinite recursion with autopromote. See T270145." [21:41:05] and we also have an infinite recursion here. it just crashes in the 2nd loop, because of the nonce being already used [21:41:58] MatmaRex: you are talking about this line, correct? -- https://gerrit.wikimedia.org/g/mediawiki/extensions/OAuth/+/22c8cdbf35b8275c71ea59cba77b812cb7f61fcb/src/SessionProvider.php#230 [21:42:08] yes [21:42:19] ok. I'll give it a shot :) [21:43:33] that's basically the same as setting `$isExempt = false;`, like you did earlier [21:43:43] *nod* [21:45:45] MatmaRex: what `use` statement do I need for that `Authority::READ_NORMAL`? [21:46:30] use MediaWiki\Permissions\Authority; [21:46:34] ack [21:47:21] yup. that fixes it too [21:47:48] MatmaRex: you should get the honors of the patch and backport. and thank you! [21:48:36] i'm not *that* confident that this is the correct fix :D [21:49:06] i mean, it's not too wrong. in the worst case, block-exempt IP-blocked users won't be able to login in [21:49:52] but there are several things here i don't understand. why does OAuth code have to check this, why does checking this try to access the current session, and so on [21:51:07] unstubbing the User while unstubbing the User is a classic authmanager blunder, so I think you are at least on the right trail [21:51:40] and oauth is a place that finds it exactly because of the replay attack protection that was tripping here [21:56:38] MatmaRex: I bet Gergo and Tim could figure out on review if it is correct to hack that way or not [21:59:58] probably [22:49:37] bd808: without blocking train, Thu evening will break wikitech again as we can't stay on the old version [22:50:08] But yeah I suppose we could roll forward to enwiki without issue. [22:51:09] Krinkle: things will break on wikitech as soon as a scap touches extensions/OAuth/src/SessionProvider.php, but I don't know if that's a train holding condition or not. I'll defer to experts. [22:51:55] bd808: ah you mean because the wikiversions change isn't committed [22:51:59] wikitech is back on wmf.4 with a local hack at the moment [22:52:25] and yes, none of it is committed work [22:52:27] That's worth committing properly I think. (Incl json file) [22:52:52] By tomorrow Gergo/Tim will hopefully review fix and/or deem the cause safe to revert [22:53:21] I was hoping MatmaRex would write the patch and have them review it. I guess I could sa he seems to have vanished form irc. [22:53:23] Via Gerrit that is [22:53:43] It's getting late for him/me/Gergo. [22:54:14] I'll post a patch in gerrit then and add folks to review [23:29:46] sorry missed a lot of scrollback. Our latest abstractions don't handle partial group rollbacks well at the moment. But "scap train" what we run now is wrapping the same underlying mechanisms, so it's still possible. We wouldn't want it to linger beyond the end of the week (and in general it's nice to avoid when we can). And if we need to do that, we should link a task to the train [23:29:48] blocker https://phabricator.wikimedia.org/T350080 and leave a comment there for jnuche so he's aware in the morning (train's running in the European timezone this week since jnuche is conductor) [23:37:14] thcipriani: https://phabricator.wikimedia.org/T350080#9318024 :) [23:38:01] bd808: perfect, thanks! [23:52:16] bd808: clarification, wikitech says it's running wmf.4 at the moment, is that right? I saw something about local hacks. [23:52:53] thcipriani: it is running wmf.4 and it also has local hacks. Wikitech is weird and has 2 dedicated servers still. [23:53:09] it runs on cloudweb1003 and cloudweb1004 [23:53:36] ah, ok, the local hack isn't rolling back the wikiversion. But a rollout would clobber any local hacks. [23:54:14] correct. [23:55:03] we started with rolling back the wiki version to confirm it was a wmf.4 problem. Then we went back to wmf.4 and started working on a hack fix to figure out what really broke it. [23:55:46] I didn't really want to commit a wmf.3 partial rolloback because eww [23:56:39] so options are (afaict): (1) continue rollout, but exclude labswiki (clobbering local hacks, but obviating the need for them) oor (2) rollback to group0 and pause there. Since any subsequent deploy would clobber them. [23:57:31] or (3) hope no one touches files that contain the local hacks :P [23:57:38] (1) would need to first roll labswiki back to wmf.3, but otherwise yeah probably correct [23:58:11] I can put hacks back too if they get nuked. I kind of doubt they will though honestly [23:58:50] right re:(1) needs to ensure labswiki stays on wmf.3 [23:59:00] ok, thanks, just making sure my mental model is right. [23:59:36] ...one final dumb question. If we rollback to wmf.3, will those local hacks be ok?