[00:47:38] * bd808 off [11:11:44] XioNoX: topranks: just to confirm, amd I good to merge/deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970850? [11:18:01] * dhinus paged: #wikimedia-cloud-admin ToolsToolsDBWritableState tools [11:18:42] hmm, OOM again? [11:18:52] hmm I can't SSH at the moment [11:18:59] I'm in [11:19:16] some DNS issue on my side apparently [11:19:43] please check the logs and if it's OOM give it a restart [11:20:40] no, something different this time [11:20:43] Nov 08 08:21:49 tools-db-1 mysqld[119707]: 2023-11-08 8:21:49 19727620 [Note] Detected table cache mutex contention at instance 1: 52% waits. Additional table cache instance activated. Number of instances after activation: 2. [11:20:53] mariadb automatically restarted, but came back read-only as expected [11:21:25] oh, no, that was a few hours ago [11:21:47] yeah it was an OOM kill [11:21:51] I'll set it as RW [11:22:01] done [11:24:56] thanks [11:37:00] something is wrong with my router's DNS relay... bypassing it I can ssh again [11:37:34] * dhinus out for lunch, back in a bit [12:31:29] hi all could some one review the following spicerack change and confirm WMCS cookbook sone use spicerack.puppet.get_puppet_ca_hostname() [12:31:35] https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/971957 [12:35:23] jbond: sure, replied [12:36:10] thanks taa [12:36:18] *thanks taavi [12:52:36] taavi: sure, go for it, let me know if it doesn't get accepted on any device (or the diff includes other bits) [12:53:24] taavi: for your comment, let me know if you would prefer read-only, but I'm find for super-user [13:00:01] XioNoX: thanks, I think the current super-user is fine [14:05:34] dhinus: in case you haven't already rebuilt cloudservices1006, I'm around until our meeting [14:06:33] I haven't, shall I start it now? I'm not sure I understand the implications of the master records problem [14:07:51] I updated the docs (which were right before but unclear) starting with 'You will also need to update pdns on all nodes to set the 'master' field. ' [14:07:57] I am pretty sure I know how to do it right this time [14:08:04] So yeah, let's do it [14:09:05] I'll start the reimage then [14:09:22] Also, the thing with the dump/import wiping the table means that we actually /can/ use the packaged schema after all. [14:09:23] I guess my question is: can we modify our setup so that we always use the same IP/interface? [14:09:39] dhinus: that's a question for topranks, I think he said no last time I asked [14:09:50] would sure be better though! [14:10:13] and possibly I didn't ask the question correctly last time :) [14:11:30] can you try explaining it in detail? this is about traffic from designate to pdns, right? [14:12:31] and designate needs to connect to pdns both on the same node, and on a different nod [14:12:41] that's right [14:12:58] and currently it's using the private IP for the same node, and the public IP for the other node [14:12:59] and when traffic happens between nodes it uses the (newish) .private ip [14:13:07] nope, other way 'round [14:13:10] right [14:13:14] public IP for same node, private IP for other node [14:13:50] this is internal openstack traffic, so ideally it should use the new private subnet (I think) [14:14:03] definitely agree [14:14:03] which it does across nodes, but it doesn't work on the same node [14:14:08] But that's where my knowledge ends and 'cathal said so' begins [14:14:19] yep, that's correct [14:15:07] so the upshot being that the master records are different between the two hosts, since 'myself' has a different address depending on if local or remote [14:15:35] which is somewhat terrible partly because 1) it's inconsistent and 2) designate isn't really set up to make that distinction [14:15:41] random thought: could it use 127.0.0.1 for "myself" if the .private IP is not working? [14:15:43] although somehow 2) turns out to not be a big problem [14:16:22] though ideally we would set exactly the same 2 private IPs on both hosts [14:16:23] dhinus: so right now we're talking about the master record in pdns, which means "expect traffic from this IP" [14:16:38] It's in designate where we set "send updates to these IPs" [14:16:46] gotcha [14:17:06] (in the meantime, cloudservices1006 is being reimaged) [14:17:09] Where we can only have one set of destination IPs. [14:17:41] dhinus: I think this is the issue you were talking about? [14:17:42] https://phabricator.wikimedia.org/T346385 [14:18:19] I think it is [14:18:19] topranks: similar but not the same, I believe that one is resolved [14:18:23] the issue was the IP address used for local comms within the same host [14:18:24] oh ok [14:18:53] like you said above "it's using the private IP for the same node, and the public IP for the other node" [14:19:08] yeah it seems to me it's describing the same problem [14:19:16] the question as to whether we can control what IP it uses to talk to itself I don't know the answer to [14:19:27] So in pools.yaml we have things set to [14:19:31] https://www.irccloud.com/pastebin/KXgX4nLc/ [14:19:39] it's a question of it the software making the requests has a configuration option we could use [14:19:44] nothing 10. there (which it was, previously, which is what that bug was about as I understand it) [14:19:57] I think they are the nodes it connects to though [14:20:01] i.e. the remote IPs [14:20:07] this is when it sends a packet to 127.0.0.1 [14:20:17] and what source IP it decides to put in the source field of that packet [14:20:31] My recollection is if I'm literally ON 172.20.1.5 and I telnet to 172.20.1.5 the originating IP is the 185 IP instead [14:20:55] or sorry - maybe the dest isn't 127.0.0.1 (re-reading the tticket ) [14:21:14] but yeah when it sends it to its own IP what controls the IP? when it's to a remote the outbound interface controls the source IP by default [14:21:55] hm, ok sorry, I guess that ticket maybe used to refer to a different issue (the 10. misconfiguration) but now it describes today's problem :) [14:23:07] rereading that... [14:23:43] telnet it used the private one when I tried [14:23:46] https://www.irccloud.com/pastebin/b4aBCBe2/ [14:24:51] oh! Then I know even less about what's happening :/ [14:26:08] surely this is going to end with me reading designate code... but maybe we can experiment a bit more in codfw1dev [14:26:33] dhinus: the designate config for all this is in /etc/designate/pools.yaml [14:26:45] topranks: what was the exact telnet command that resulted in that tcpdump? [14:27:11] "telnet 172.20.2.4" - which I think is wrong [14:27:18] It's confusing because there's "contact X host for updates" and "contact Y host to make sure things are updated" and "include host Z in record as SOA" so there's lots of duplication and the fields aren't named very well [14:27:42] "telnet 185.15.56.162" produced this: [14:27:58] https://www.irccloud.com/pastebin/dw7i2gdA/ [14:27:59] /and/ the pathway to create domains is different from the pathway to create records [14:28:15] So basically when the system connects to one of its own IPs, it uses the same IP as the source [14:28:33] When it connects out to the other one it uses the 172.x IP as that is the outgoing interface [14:28:51] wait, is ^ true even if you specify the public IP? [14:30:46] yep, for instance a ping [14:30:52] ok [14:30:53] https://www.irccloud.com/pastebin/1DIb2ANO/ [14:31:11] bear in mind that is for a new connection the host makes, and thus needs to choose which IP to use as source [14:31:28] If I connect to a host that's running a deamon on one of its IPs it will use that IP to reply always [14:32:51] andrewbogott: are the SQL values linked to pools.yaml? (i.e. does updating the yaml update the db?) or are they independent? [14:33:35] My understanding is that they are not updated for record updates but are inserted from pools.yaml when a new domain is created [14:33:48] ok [14:33:53] which means new domains may be a bit broken in our current setup [14:34:03] but mercifully new domains are pretty rare [14:34:41] doesn't each new project get a new domain/zone? [14:35:09] yes, but... [14:35:19] well, they get a new public zone which is seldom used [14:35:50] the private .wmcloud zone is kind of pseudo-zone iirc where the project name is part of the record [14:35:54] but I'd need to doublecheck that [14:36:39] I am very close to falling back on empiricism, 'the status quo seems to work ok' [14:36:42] even though I hate it [14:37:22] so dhinus topranks has convinced me that the problematic bit in pools.yaml is [14:37:24] https://www.irccloud.com/pastebin/tZ8O3yYQ/ [14:37:48] agreed, what happens if you put the private ips there? [14:38:03] good question, let's try it on codfw1dev [14:38:20] ns_records is pointing (correctly) to the public hostnames, maybe everything else can use the private ones [14:38:37] which kinda maps what I see in the example here https://github.com/openstack/designate/blob/master/etc/designate/pools.yaml.sample-multiple-pools [14:38:47] but our config is more complex so who knows :) [14:38:52] I assume it will try to connect to the private IPs, which are running the dns recursor and not the auth server [14:38:52] dhinus: want to learn how to update pools.yaml or should I just do it? [14:38:55] https://www.irccloud.com/pastebin/LjsW7WMU/ [14:39:09] oh right, do we even have private IPs for the auth server? [14:39:28] yeah, topranks, you're right [14:39:42] no, because both servers listen on UDP 53 they can't both listen on all IPs (0.0.0.0), which would be ideal [14:39:50] so recursor is bound to private, auth bound to public [14:40:01] * andrewbogott nods [14:40:08] and auth needs to be public of course [14:40:15] the recursor is bound to the private VIP, not the host-specific address, iirc [14:40:27] sorry taavi very true [14:40:41] so does that mean... [14:40:43] * andrewbogott tries it [14:41:03] so I guess you could put the private IPs in pools.yaml. The connection will fail, but it might add the right allowed list to the DB [14:42:03] the comment above "nameservers:" in the example says "We use these to verify changes have propagated to all nameservers." so maybe they're used just for that [14:42:17] topranks: why will the connection fail? [14:42:35] (I mean, I can confirm that it will) [14:43:07] dhinus: that's why I figured it was OK to use the public IPs there... [14:43:25] but it seems like it must be using them for xfr as well [14:43:51] right, makes sense. let's see if it's true [14:46:07] andrewbogott: pdns is only listening on the public IPs [14:46:26] oh yeah, /that/ should be easy to fix [14:46:49] or maybe not [14:46:51] https://www.irccloud.com/pastebin/K0zZhNFC/ [14:46:57] yeah you can't change it there [14:47:07] you could have it listen on all IPs, but that would conflict with the recursor [14:47:14] grrrrrr [14:47:40] Ok, so this reminds me of a long-neglected task, "move the recursors to VMs" [14:47:54] which would solve /that/ problem which might ultimately solve the problem we're talking about now [14:48:09] Although I'm quite worried about chicken/egg issues with recursors on VMs. [14:49:26] dhinus: have we convinced ourselves that the current dumb config is the only possible config yet? :D [14:49:46] I'm definitely convinced it's complicated :D [14:51:02] for your reference, if you want to try changing pools.yaml in codfw1dev be warned that pools.yaml isn't live config, you have to do a 'designate-manage' command to have it read the changed config [14:51:24] ok [14:51:49] I remember a time years ago when I complained to the designate devs that they kept adding more and more services and "why are you making this so complicated?" [14:52:02] I guess that was before I knew that DNS has uncountably many edge cases [14:52:44] in any case I'm pretty sure I know how to make this work as well post-reimage as it worked pre-reimage [14:52:57] what if we always put both public and both private in the allowlist? [14:53:09] i.e. https://phabricator.wikimedia.org/T346385#9182344 [14:54:44] iirc that makes pdns throw a bunch of "I'm upset because I'm not getting updates from xxx.xxx.xxx.xxxx" warnings [14:54:52] I see [14:54:54] but maybe that's harmless? I don't remember for sure if it actually broke things [14:55:27] I guess that would still require a manual SQL query every time you add a new domain? [14:55:57] likely [14:59:18] taavi: (unrelated) do you have a second to help me understand what's happening with the web proxy API? I'm getting failures when creating things in horizon and iirc you're the current expert on that code [14:59:34] sure [15:00:01] it says 'Unrecognized schema in response body. (HTTP 403) ' [15:00:13] which I guess I should try reading the logfiles before bugging you [15:00:15] in eqiad or codfw? [15:00:57] eqiad [15:01:16] trying to create proxies in tf-infra-test [15:02:43] dammit is there a way to make journalctl wrap lines? [15:03:11] Oh! it's upset because the designate server is down :) [15:03:21] taavi: I guess nevermind for the moment [15:03:22] oh, might be! [15:03:38] although why is the api down? is the other cloudservices node not serving it? [15:03:47] it ought to be [15:04:34] 'openstack zone list' sure works [15:05:10] does the proxy API use keystone catalog discovery? [15:05:17] yes [15:05:46] maybe it doesn't like that 1006 is "half up"? (I'm initializing the db after the reimage) [15:06:21] maybe but I should see cli failures too in that case [15:07:29] did you get failures when listing things or when editing things? [15:08:16] editing [15:08:55] works for me, I just created tf-infra-test-taavi.wmcloud.org [15:10:31] me tries the script again [15:11:46] hmmm [15:11:57] previously I couldn't create anything [15:12:17] now I just can't create the one thing I want to create, tf-infra-test.wmcloud.org which the terraform test suite seems to expect [15:12:55] which is understandably an edge case [15:17:44] the db in cloudservices1006 should be ok [15:18:07] I imported the schema from /usr/share/pdns, then imported a dump from 1005 [15:18:44] and then changed the master records? [15:18:54] I haven't no [15:19:00] andrewbogott: does that name exist in the designate database? [15:19:31] / in DNs [15:20:34] seems yes [15:20:39] let's see if I can remove it [15:21:09] yeah that would be it [15:21:50] andrewbogott: I've updated the master records with "update domains set master..." [15:22:03] dhinus: great! [15:22:06] so possibly you're done [15:22:12] we'll see what works and what doesn't [15:22:38] sigh, the logs very helpfully indicated the problem, it's just that the log format is horrible to use [15:22:39] Rejecting can_use_hostname (tf-infra-test tf-infra-test.wmcloud.org.), found existing records: tf-infra-test.wmcloud.org. [15:24:18] oh yeah, there it is :/ [15:24:28] taavi: working now and tf tests seem happier [15:24:29] thanks [15:24:39] great [15:28:08] hi all anyone around who is comfortable with wikireplicas? [15:28:32] lol no one is /actually/ comfortable. I think milimetric is the closest [15:28:45] (who is probably in a data-engineering channel) [15:29:09] he might send you back to us depending on the question [15:29:29] well is something sdimple i want to deploy a change restart mysql and ensure things are not broken [15:29:50] the change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961829/ [15:31:55] you have a +1 from ben and he knows as much as anyone. If you want to test things after the fact https://superset.wmcloud.org/superset/ would be a good tool for that [15:32:02] jbond: what else can I tell you? [15:32:14] andrewbogott: ack sgtm [15:32:23] i have pinged ben [15:32:42] andrewbogott: while i have you there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/968665/5 [15:35:31] jbond: looks ok other than the apparent typo taavi flagged. I can merge and apply when it's ready [15:40:51] btw, jbond, thanks for all the work you've done with us over the years. You've been an enormous help and I'm very sorry to see you go! [15:41:24] thanks andrewbogott :) [15:47:21] jbond: going to wait to merge that until post-meeting just in case [15:50:33] andrewbogott: yes sgtm there is also the wikireplicas one which needs merging first [16:12:25] hm, I will see if I can restart mariadb on clouddb1021, seems to have crashed [16:14:09] seems like it's running but the unit shows a failed state [16:17:10] jbond: seems mariadb won't start on clouddb1021 now. Not sure if it's related to your change, I'm still investigating, but the timing is suspicious [16:18:39] andrewbogott: the main mariadb is not ment to start [16:18:46] you need to start the @s{1..8} versions [16:18:53] yeah... I wonder why it's alerting then :) [16:18:58] but I just figured that out [16:19:09] ann that would be me i tried to start it [16:19:14] just need to reset [16:19:31] oh great [16:19:44] i have done it it should clear [16:19:47] I think I made things worse in the meantime though [16:19:57] lol [16:20:46] seems like ps -ef | grep maria should be showing more than nothing [16:21:13] and yet puppet seems happy [16:21:28] jbond: if you have those systemctl commands handy can you start what needs starting? [16:21:28] andrewbogott: the binary is still mysqls [16:21:37] sudo systemctl status mariadb@s{1..8} [16:22:39] systemctl is happy but 'PROBLEM alert - clouddb1021/mysqld processes is CRITICAL ' [16:23:16] andrewbogott: i just started them all agaiun they had died about 5 mins ago [16:23:27] thanks [16:23:47] I'm going to disconnect before I make things even worse [16:24:06] :) [16:26:14] I have a bunch of work to do on my DD application so I'm going to call it a day for today [16:33:24] * andrewbogott waves [17:36:37] Is horizon down for others, or mostly me? [17:40:23] Rook: seems to be working for me, you mean in eqiad? [17:40:32] What are you seeing? [17:40:48] Yes, I'm getting a 504 when I try to login [17:41:20] I'll log out and in again [17:43:53] yeah, seems to be failing for me too. moritzm, any chance that's related idm changes? I'll also see what I can see, just covering all options [18:18:37] seems so, at least it's the 2fa that's failing [18:22:33] 2fa in Horizon should be only about talking to wikitech's action api [18:22:55] unless you've changed that recently? [18:26:31] Hm... I'm used to thinking that the 2fa on wikitech is another random wmcs-only feature. Is that wrong and it's normal now? [18:26:42] but yeah, it's not in osm [18:27:35] It's just https://www.mediawiki.org/wiki/Extension:OATHAuth [18:28:07] ok, then I guess I have to look elsewhere for this breakage. [18:28:30] I have 2FA on all of my staff wiki accounts (SUL, officewiki, foundationwiki) [18:29:38] heh, I'm so used to wikitech being weird that I don't notice when it's normal. [18:31:04] I'm about to start reviewing functionality for the IdM to see what if anything is left before we can rip out OpenStackManager (would still need LDAPAuth). [18:32:26] hm, striker logins are also broken but in a slightly different way... [18:33:08] oh, well, it's probably not actually different [18:37:48] for me striker is hanging at login. it could be related to oathauth as there is an api check before prompting for 2fa [18:38:05] * bd808 tries to see if there are error logs [18:40:25] lots of "Bad OAuth request from {ip}" and "{key} exists, so nonce has been used by this consumer+token" [18:40:36] that could be memcache problems [18:41:44] we would have just gotten the 1.42.0-wmf.4 train today too, right? [18:42:09] yeah. wikitech is in group1 [18:43:15] you're seeing that in the wiki logs or the striker logs? [18:43:32] think I should just bounce memcached on the cloudwebs? [18:44:01] wiki logs via https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK [18:44:22] I think its likely that something in the train busted wikitech oathauth [18:45:48] "Retrying due to nonce error, see https://phabricator.wikimedia.org/T106066" [18:45:55] T106066 [18:46:36] I looked at that but didn't clue into it being a new warning. The ticket hasn't been touched in > a year [18:46:45] there might have been mcrouter config changes causing issues per T106066 [18:48:30] Stashbot's error logs are full of "2023-11-08T17:43:10Z mwclient.client WARNING : Retrying due to nonce error, seehttps://phabricator.wikimedia.org/T106066" now too. [18:49:08] that may have been why the bot looked dead before [18:50:47] * andrewbogott tries to summon timo [18:50:55] o/ [18:51:02] hi! [18:51:06] https://github.com/wikimedia/operations-mediawiki-config/commits/master doesn't seem to have anything relevant other than the group1 promotion [18:51:33] what's the observable problem we're talking about? [18:51:34] Krinkle, as of https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK we can't use 2fa on wikitech anymore [18:51:40] lots of memcached complaints [18:51:41] https://logstash.wikimedia.org/app/dashboards#/view/AWw-hnwr0jm7pOHZUFzK [18:51:55] whoops, sorry, same paste twice [18:52:05] what I mean is, 'as of today' presumably the train roll-out [18:52:46] It's not obvious what broke it but I'm guessing you're the expert on debugging mcrounter/memcached things [18:52:50] "PHP Notice: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.4/includes/config/EtcdConfig.php on line 218" is emdemic too [18:53:26] hm, why would etcd be involved at all? [18:53:56] database routing and I think some service things are in etcd [18:54:17] ok [18:54:23] EtcdConfig is fine, that's new logging for prexisting and expected behaviour [18:55:09] great, that's one less dead end for us to follow :) [18:55:24] we cache etcd data for 10s, then we acquire a graceful lock in apcu from the next web request to refresh the data, which takes about 1s during which concurrent requests re-use the previous data until the new one is ready. [18:55:38] long version: https://phabricator.wikimedia.org/T349376#9303579 [18:55:51] short version: it's expected, but we should not log unless it's been more than N seconds since the last refresh [18:56:14] what error am I looking for in relation to 2FA failure? [18:56:24] > Retrying due to nonce error, see https://phabricator.wikimedia.org/T106066 [18:56:25] this one? [18:56:58] * Krinkle clicks (-) on level:INFO and level:NOTICE [18:57:13] Krinkle: the primary problem is inability to log in on wikitech and striker. That seems to be caused by a timeout when validating the 2fa token against wikitech's api [18:57:35] the nonce error is suspected to be related [18:57:59] bd808: that's about what we know so far, right? [18:58:07] https://logstash.wikimedia.org/goto/9515974b7d17d0e4666ab57a88dcd006 -- started at 09:00 UTC today. that's when the train ran [18:58:15] is the wikitech database primary eqiad or codfw? [18:58:21] i.e. did it follow the switchover? [18:59:08] it is group1 today, and labswiki is in group1 [18:59:25] the nonce error started about 6 hours ago [18:59:35] ~7h now [18:59:51] Krinkle: as far as I know there is no wikitech db in codfw [19:00:32] it should be a "normal" s6 db these days I thought? [19:00:32] mcrouter-primary-dc did switch to codfw, but that shoudl be fine to access from eqiad, and we do for various use cases (i.e. rate limiter, CA tokens) [19:00:44] does wikitech still use nutcracker? [19:00:51] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/db-production.php#L132 [19:00:56] sorry, I'm frantically trying to eat lunch and saying lots of wrong things [19:01:25] switchign that to mcrouter, even in keeping with its own custom memcache host, might make for a small incremental improvement towards less novelty [19:01:31] Krinkle: we might actually still use nutcracker, yes. Let's figure out how to check that. [19:01:56] it's been like 5y since we used that in prod so not a lot of knowledge around that [19:02:13] there was some long ago reason why we didn't migrate, but I think it sort of boiled down to "nobody who knew what to do helped" [19:02:36] > Retrying due to nonce error [19:02:43] This is not logged by MediaWIki, but by striker. [19:04:22] https://logstash.wikimedia.org/goto/b61bb63672a331b31c1e533b175b252d [19:04:37] ok, so we need to correlate it with MW debug messages prior to the striker message [19:04:50] "include ::profile::openstack::eqiad1::nutcracker" -- we are using nutcracker [19:05:10] we have mcrouter and nutcracker both running, I'm not 100% sure that nutcracker is used by wikitech and not just horizon [19:05:22] I'm still trying to track that down [19:05:37] comment in the role says "# Currently we run both nutcracker and mcrouter. Nutcracker is for [19:05:37] # the soon-to-be-moved wikitech/mediawiki install; [19:05:37] # Mcrouter is used by Horizon. They share the same [19:05:37] # memcached backends." [19:05:58] ok then [19:06:09] * andrewbogott sad about 'soon-to-be-moved' [19:06:51] Are both of those transparent-ish memcached proxies? Like, could we just change the ports to point at mcrouter and be done? [19:07:00] Or more likely I guess move mcrouter to where mw is looking [19:08:12] andrewbogott: mw is currently configured to "if wikitech" connect to the port we use for nutcracker instead of the port where appservers run mcrouter. [19:08:26] quite possibly you already run mcrouter on the same port as appservers [19:08:36] ok. [19:08:44] the difference if I remember is that nutcracker also routes for redis and wikitech still uses redis to cache sessions and some other stuff [19:08:49] I'm not actually proposing we change this unless it will fix today's breakage [19:08:57] yeah, not likely the cause. [19:09:14] so the messages we have in logstash before each error from striker is: [19:09:19] > Bad OAuth request from {ip} [19:09:32] > OAUTH:labswiki:nonce:*:oauth_token=*&oauth_token_secret=[REDACTED]:* exists, so nonce has been used by this consumer+token [19:09:37] > Persisting session for unknown reason [19:09:42] those three basically [19:10:25] The "Bad OAuth request from {ip}" one, if expanded has "Nonce already used: xxxxx" as reason [19:10:32] fwiw I have not tried clearing memcached yet, so lmk if/when you start to think this is just a poisoned cache issue [19:10:39] given it can find the data in memcached, suggests that maybe it's not a memcached problem [19:10:56] if the error was that it seemed that tokens are invalid / not stored in memcached, that woudl seem more likely [19:11:15] * andrewbogott nods [19:11:32] I'm guessing nothing changed in striker in the last 12 hours? [19:11:54] nope [19:11:57] nope, and horizon also stopped working [19:12:06] and not explicit changes to OATHAuth in https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4 [19:12:27] and Stashbot is also failing to auth to wikitech with no codebase changes [19:14:09] i see, so that's why stashbot went MIA [19:17:13] * Krinkle invites MatmaRex [19:17:33] MatmaRex: https://logstash.wikimedia.org/goto/b61bb63672a331b31c1e533b175b252d [19:17:50] this shows log messages from mediawiki and from striker (toolsadmin) from cloudweb hosts (mixed). [19:18:07] the observable problem is "Retrying due to nonce error" as logged by striker. [19:18:16] um, hi [19:18:18] note that wikitech/cloudweb is in eqiad (no codfw equiv) [19:18:19] * MatmaRex reads channel logs [19:18:31] bryan and andrew suspect a memcached issue [19:18:49] > andrewbogott: lots of memcached complaints [19:18:53] what did this refer to btw? [19:19:06] Krinkle: just the nonce log messages [19:19:24] since when are things broken? [19:19:30] ~7h ago [19:19:40] andrewbogott: striker.py saying "Retrying due to nonce error" is a memcached complaint? [19:20:01] MatmaRex: pretty much exactly aligned with the 1.42.0-wmf.4 group1 deploy window [19:20:05] best to assume I don't know anything at this point [19:20:48] I don't know striker's code base but I'm assuming it logs "Retrying due to nonce error" in response to its HTTP oauth request to Wikitech api.php being denied [19:21:11] Krinkle: yeah, its a message from the mwclient python library. Let me see if I can track down more. [19:21:13] so it's 99% a wmf.4 bug? i can go read the changes [19:21:23] https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.4 [19:22:07] wmf.4 + mwclient seems likely at the moment, although I think horizon uses raw requests calls to talk to the action api [19:22:22] No changes: LdapAuthentication, OATHAuth, OAuth, OpenStackManager, WebAuthn [19:23:35] here's where the none error log messages come from -- https://github.com/mwclient/mwclient/blob/98af129e2af5475d75caf464ac4fae332e8f35fc/mwclient/client.py#L389-L397 [19:24:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/zed/keystone/wmfkeystoneauth/wikitechclient.py <- what Horizon does [19:24:36] ah, so it does use mwclient too. [19:24:47] i know that there have been recent changes in CentralAuth memcached use https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/960643 and in mcrouter config https://phabricator.wikimedia.org/T342201#9308357 . this is probably not relevant, right? [19:25:07] CentralAuth prolly not, unless it's already enalbed on wiktiech? [19:25:30] Krinkle: CA is not installes on wikitech [19:25:33] (i honestly don't know how OAuth interacts with anything, and how wikitech interacts with anything :) ) [19:25:33] *installed [19:25:42] mcrouter isn't used yet either, wikitech is still on nutcracker with its own memcache host (not mcrouter). [19:25:57] the timing with _joe_ 's change to mcrouter-primary-dc is interesting thoguh [19:26:09] it added cross-dc support for mw-on-k8s to connect to mcrouter-primary-dc [19:26:23] but wikitech isn't served from k8s right? or does it have a small % already? [19:26:48] that was rolled out like 7h ago [19:26:48] not served fvrom k8s. only from cloudweb100* hosts [19:27:07] * bd808 wishes that were not the case [19:27:41] <_joe_> Krinkle: 0% for now, something might change in a near future [19:29:39] The action api query that is failing per the logs is https://wikitech.wikimedia.org/w/api.php?action=query&continue=&format=json&meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg -- and that works fine without OAuth being involved [19:30:20] right, it preusmably wants the 'userinfo' part there for the currently loggedin user [19:30:57] that's part of the mwclient.Site() constructor actions [19:31:23] first error: Nov 8, 2023 @ 12:35:01.406 [19:31:49] that's 35min after group1 https://sal.toolforge.org/production?p=1&q=&d=2023-11-08 [19:31:52] so... [19:31:55] all of the things that are failing are using mwclient to authenticate to wikitech via OAuth owner-only tokens [19:32:07] well, can we rollback wikitech only to the group2/previous version? [19:32:12] at least to rule it out? [19:32:25] i.e. same as enwiki is running still [19:32:57] it's technically possible. I don't know how thcipriani and co feel about partial group changes these days [19:33:29] MatmaRex: any other hunches? [19:33:29] we could hack it locally on cloudweb* too [19:33:45]