[00:11:40] Just be glad you're not working for reddit today :-( [00:22:56] roy649_: seems like the alternative place reddit users picked to chat is ... https://downdetector.com/status/reddit/map/ :p [01:41:39] in https://phabricator.wikimedia.org/T332101 I am wondering whether https://sitemaps.wikimedia.org serves a purpose nowadays. It seems like the sitemap files have last been updated in 2018 which makes me wonder what would happen if we just delete that virtual host or whether it's possible something uses the sitemaps and would complain.. while we also don't update new versions of those files. [01:42:10] if you know anything about them, comment on ticket would be nice. thanks [05:44:06] duesen: I +1ed your change [07:26:57] <_joe_> so we're going 100% for parsoid in parsercache? amazing :) [07:27:11] <_joe_> we can drop restbase's caching then :) [08:05:21] _joe_: when you get time may you look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483/ ? It is to populate some scap dsh targets from a Puppet DB query instead of relying on a hardcoded list of hosts. That will slightly simplify a few things :) [08:05:48] I am merely looking at whether it is ok to use puppetdb query, I can then get the series merged with others from serviceops-collab [08:47:03] marostegui: ty! [09:13:07] <_joe_> hashar: done [09:13:31] _joe_: awesome :) and thank you for that puppet db query feature! [09:13:54] <_joe_> it's ok as long as you don't need a server to be up immediately when adding it [09:15:18] I think we need Puppet to run on the scap targets before being able to deploy. That then applies `scap::target` which will be collected [09:15:49] at least for those hosts, I think it will be fine :] [11:51:32] someone from the Community Tech team reached out to me to ask about deploying an updated version of "wikidiff2" ? [11:51:41] tbh not something I really know anything about [11:51:49] I pointed them at https://www.mediawiki.org/wiki/Wikidiff2/Release_process [11:52:06] but they have more questions, is anyone familiar with this could help them out? [12:03:19] <_joe_> topranks: interesting choice :) Point them to serviceops or moritz I would say [12:03:31] * TheresNoTime hides [12:03:55] <_joe_> TheresNoTime: hihi [12:04:40] _joe_: thanks yeah indeed :) [12:04:46] okay tbf [[Wikidiff2/Release process]] doesn't look too bad [12:05:15] yeah it's not a bad explainer [12:06:25] is getting added to `releasers-wikidiff2` via an LDAP-Access-Request? [12:09:21] not one that came in to me, I suspect I perhaps dealt with some LDAP request for someone on that team on one of my previous clinic duty weeks and they got my name [12:09:55] TheresNoTime: that's good info though, they are currently not part of that group that I can see [12:10:27] (I worded that very poorly, I meant to ask: Does one make an `LDAP-Access-Request` to get added to `releasers-wikidiff2`?) [12:10:57] <_joe_> the release won't get wikidiff2 on our servers though, if that's what you want [12:12:45] TheresNoTime: yes getting added to the group would come through as an LDAP-Access-Request I believe [12:13:44] good to know, thank you [12:17:44] topranks: best to ask them to create a Phab task initially, I can update our build [12:18:40] moritzm: yep that is indeed the sensible way to go thanks :) [15:31:55] Emperor: would you mind if we did another thumbor-k8s test after your proxy restarts? [15:32:49] Can you give us a little time to see if the rate of tempauth_token_denied drops off after the restart first, please? [15:33:50] Also, while you're here, in the light of last night's discussions about the 404s we're seeing with thumbs, how hard would it be to reduce the TTL on thumb-404s (and/or, for bonus credit, use a different code for "AUTH fail"?) [15:34:39] yeah no rush [15:39:52] ah, I missed the discussion entirely. It would be quite easy to reduce the TTL - but on k8s. Doing it on metal would be work, but would be doable [15:41:19] there is some session usage in Thumbor fwiw as we saw during the credentials rollover. I'm sure we can extend that though [15:43:20] There is something weird happening with tempauth in eqiad since it was pooled, though - https://thanos.wikimedia.org/graph?g0.expr=swift_proxy_server_tempauth_token_denied_total&g0.tab=0&g0.stacked=0&g0.range_input=6d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [15:44:05] or the rate view - https://thanos.wikimedia.org/graph?g0.expr=irate(swift_proxy_server_tempauth_token_denied_total%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [15:44:41] I'm not surprised by a bit of an uptick when we pool eqiad, but that's a really high number of tempauth tokens being denied in eqiad. Which might well be contributing to the problems [15:46:25] hmm, they're definitely coming from thumbor? [15:46:49] if auth errors are manifesting as 404s, it would explain why the 404 level is higher than 200s on eqiad compared with codfw pre-switchover [15:47:00] I am seeing 403s in the logs too [15:47:41] hnowlan: from lsat night it looked quite a lot like thumbor says 404 when it can't retrieve the original to make a thumbnail of, including if that's because of an auth failure [15:47:53] (but thumbor's logs are not hugely enlightening) [15:48:03] hm [15:48:14] *hugops* for y'all, still trying to figure out T331820? [15:48:15] T331820: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 [15:48:16] it's nearly exclusively for officewiki [15:48:34] which would imply it could be an issue with the private credentials [15:48:36] looking [15:48:58] [have to step away for ~30m, will be back] [15:49:22] Emperor: in the interim can we try doing that pooling? will back off if there are adverse effects ofc [15:49:58] xover reported some issues on commons with thumbnail generation earlier in -operations, didn't seem private related [15:50:56] yeah I think there could easily be multiple issues overlapping [15:51:10] https://phabricator.wikimedia.org/P45876 [15:51:43] of 1888 403s received by thumbor from swift, 1880 were officewiki URLs [15:51:49] *thumbor1006 [15:54:26] I don't know how/why there would be a credential mismatch though, we've rolled the proxies in eqiad, and the thumbor instances, and they take their keys from the same place in puppet [15:56:31] could it simply the case that there are different keys in each DC and this is cross-DC in a way we didn't understand while we're in a switchover? or something like that. [15:56:35] I have no idea how swift auth works [15:57:10] could be, there are different keys per DC [15:58:04] maybe for private wikis, it always uses a specific DC because it's only really deployed on one side? [15:58:05] I can try a roll restart of thumbor again just in case but that'd involve hella session caching to be over a day [15:58:26] (but is now using the keys from the other side in its auth attempts) [15:58:35] aiui it doesn't see a difference between wikis with the exception of which keys to use [15:59:08] I'll try depooling and restarting a single instance to see what happens [16:02:49] done, watching [16:05:01] yeah still happening [16:11:48] What has changed since the last switchover? or did we see similar issues last time? Thumbor itself won't have changed at all [16:20:08] <_joe_> hnowlan: I doubt anything has changed in thumbor - given it's called locally by the swift 404 handler - unless we did some swift credentials rollover recently [16:21:02] we did actually rollover swift credentials [16:21:38] for https://phabricator.wikimedia.org/T328901#8641543 [16:21:53] (recently as in Feb 23) [16:24:45] https://thanos.wikimedia.org/graph?g0.expr=irate(swift_proxy_server_tempauth_token_denied_total%7Bsite%3D%22eqiad%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=3m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D looks like the rates were similar pre-rollover [16:24:57] so I dunno if I'm convinced that graph corresponds to the reported issues [16:26:24] given that credentials are being read from the same place in puppet, and we've restarted everything since I'm not sure where the failure would be [16:29:09] just to note because I think it's unrelated - officewiki thumbor errors have been seen on thumbor since feb 26th also [16:30:45] it seems strange to me that the eqiad rate is so high (cf codfw) [16:31:49] I still wonder about the x-dc angle on the creds though [16:32:10] https://github.com/openstack/swift/blob/master/swift/common/middleware/tempauth.py <-- a number of things cause the token_denied counter to rise, [16:32:57] maybe we're not realizing that we /think/ component A in DC X is contacting component B in the same DC X, but in fact it's misconfigured or hardcoded in this scenario (which I can imagine for a special case like private officeiwki), and is in fact communicating with component B in the "wrong" DC and thus using the wrong DC-specific key. [16:33:38] or the inverse of that, or some related scenario [16:34:29] there's no logs to give us insight into the actual failures that are happening on swift's end no? [16:39:15] indeed not as far as I can see [16:40:14] we could turn proxy-server logging up to debug, which I think then logs self.logger.debug('User: %s uses token %s (trans_id %s)' % [16:40:14] (user, 's3' if s3 else token, trans_id)) [16:40:38] but that'd leak tokens everywhere which would be bad (and I'm not sure how easy it would be to tie a token back to the original credential) [16:47:28] but I do think https://thanos.wikimedia.org/graph?g0.expr=irate(swift_proxy_server_tempauth_token_denied_total%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=30d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D shows that something is sending a lot of duff auth tokens to eqiad swift [16:47:29] do we have a relatively recent failure case? [16:48:38] based on reports seems like over time the broken thumbs fix themselves ~somehow, it'd be handy to get verbose logging from thumbor on something currently broken [16:48:40] I think every eqiad frontend is denying 2-3 tokens a minute [16:48:51] oh, thumbs-wise [16:49:22] there are still some appearing at https://phabricator.wikimedia.org/T331820 (but perhaps inevitably a bunch of them get fixed eventually) [16:49:27] I bet those duff auth tokens are valid in codfw :) [16:49:28] given the historical rates I'm not sure the rate of token issues aren't related [16:49:37] *are related [16:50:03] they're not massively higher now than they were pre-switchover [16:52:09] hnowlan: what makes you say that? Equiad looks vastly higher now that at any point in the recentish past [16:53:25] I'm not sure how far back thanos storage goes, but e.g. 90 days shows this recent eqiad-only hike is exceptional (and unmatched by anything in codfw) [16:54:27] Emperor: my bad, wrong graph [16:55:19] what's an moss-fe vs an ms-fe? [16:57:31] https://commons.wikimedia.org/wiki/Special:NewFiles have all the reproduction cases you want :) [16:58:02] hnowlan: a temporary repurposed frontend that was earmarked for MOSS (and will have to go back there in due course); same hardware [16:58:03] AntiComposite: ah nice, thanks [16:58:27] what's MOSS? [16:59:32] the typo thing? [16:59:55] media storage something, but https://wikitech.wikimedia.org/w/index.php?search=MOSS&title=Special%3ASearch&ns0=1&ns12=1&ns116=1&ns498=1 needs docs :) [17:00:09] bblack: the Ceph-based object storage system that I will never ever have time to work on because swift is too broken [17:00:18] https://en.wikipedia.org/wiki/Wikipedia:Typo_Team/moss was the closest I found digging around a few minutes heh [17:00:35] ok got it [17:00:48] cf T279621 [17:00:49] T279621: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 [17:03:04] ty, I created https://wikitech.wikimedia.org/wiki/MOSS as a pointer to that task [17:12:18] I can't reproduce this issue on staging-k8s thumbor, but I can do it with relative frequency on the metal instances. Dunno if there's anything useful to be read from that [17:24:10] I'm kinda at a loss. I might try restarting some more of the metal instances to see if there's some kind of inconsistency that's been picked up [17:46:03] lmao https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=irate(swift_proxy_server_tempauth_token_denied_total%5B5m%5D)&g0.max_source_resolution=0s&g0.partial_response=0&g0.range_input=1h&g0.stacked=0&g0.store_matches=%5B%5D&g0.tab=0 [17:47:51] files on https://commons.wikimedia.org/wiki/Special:NewFiles looking okay now [17:49:04] restarting codfw instances had no effect, restarting eqiad ones did [17:50:14] any idea why the restart had a positive impact? [17:50:33] absolutely none :( [17:51:22] maybe they'd still got an old credential in memory? [17:51:26] candidates I can think of: Cached credentials/sessions, cached DNS queries [17:51:29] yeah [17:51:37] BUT! they were restarted since the credentials were rolled [17:51:42] and they didn't fail in *all* cases [17:53:19] hnowlan: do we have a cookbook for restarting thumbor? [17:53:50] $ cookbook -lv | grep thumbor | `-- sre.misc-clusters.thumbor: An thumbor reboot class [17:53:57] :) [17:54:03] but I never used it [17:54:11] aye, although I did this one by hand just to see if there was a single problem instance [17:54:20] so check with the $owners if it's safe to use [17:54:33] hnowlan: and shall I update T331820 to say we think we've addressed this? [17:54:33] T331820: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 [17:54:37] volans: lolsob [17:54:46] * volans pun intented [17:54:58] careful to not get a NullPointerException ;) [17:55:08] when resolving owners [17:55:24] * Emperor was going to git blame on that cookbook and then apply sysadmin jenga rules [17:55:53] Emperor: oh I just did heh [17:56:08] that was one of the first examples of the shared class for batch restart/reboots [17:56:13] hnowlan: thanks :) [17:56:40] I wonder if it was the case that only *some* thumbor workers were consistently bad [17:56:48] I'll see if we have the info to verify that [18:03:12] I should add some panels based on the tempauth metrics to our swift graphs [18:04:37] seeeing very inconsistent levels of "Auth GET failed" between instances - thumbor1001 and thumbor1005 are way ahead of the others (700k, 1M+ respectively) [18:06:54] just asked ChatGPT via Telegram bot if and when I should use quotes in Puppet Hieradata.. and it ..knew [18:13:03] or it very confidently lied to you [18:14:44] <_joe_> TheresNoTime: it's a parrot, it can't have "confidence". It mashes things up without intelligence :) [18:15:20] <_joe_> hnowlan: I was having a tingling feeling we should've looked at the nutcracker running on the thumbor hosts [18:15:32] hmm. Credentials were rolled on the 23rd, thumbor1005 was restarted on the 23rd. Wonder if it somehow started thumbor without the right credentials? Still no good explanation for thumbor1001 [18:16:05] _joe_: hmm, what makes you say that? [18:16:11] It didn't *look* like rate limiting [18:16:32] <_joe_> yeah I don't think it was hte case indeed [18:16:53] <_joe_> but I did have that feeling because I've seen nutcracker acting funny after network partitions [18:17:09] <_joe_> and we've had some network shakeups in eqiad in the last few weeks [18:21:57] sigh, looking at this would have made this a lot shorter https://thanos.wikimedia.org/graph?g0.expr=sum%20by%20(instance)%20(rate(haproxy_http_request_duration_count%7Bbackend%3D%22thumbor%22%2C%20status_code%3D%22404%22%7D%5B5m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=3d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [18:25:22] the sooner we get thumbor on k8s the better, as ever [18:33:49] filed https://phabricator.wikimedia.org/T332210, will look at it in the coming days [18:34:43] hnowlan: sorry we didn't get to try thumbor-on-k8s again today; happy to make time for it tomorrow (I've only a couple of meetings) [18:35:18] Emperor: ah yeah, no way it'd have made sense during this. tomorrow would be good [18:37:36] I have meetings 11:30-12:00 and 16:00-16:30 UTC, pick a time :) [18:37:50] * Emperor off to choir rehearsal now, biab [20:43:37] (beta cluster) okay, I'm stuck — I'm looking at T332211, I see `ENOTFOUND cloudmetrics1002.eqiad.wmnet`, that's "expected" and was corrected in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/ff669ca837caf2420cb1fd84f33ac9447cc8fc0d. deployment-docker-cpjobqueue01 and deployment-docker-changeprop01 needs that new config. https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments doesn't seem to [20:43:37] apply here (even though `/srv/deployment-charts` etc is on deployment-deploy03). Help? [20:43:38] T332211: deployment-docker-changeprop01: `worker died, restarting` - https://phabricator.wikimedia.org/T332211 [20:53:21] TheresNoTime: operations/deployment-charts is not (directly) used on beta at all - iirc there was some magic for changeprop to apply those configs directly [20:54:06] urgh, starting to get very out of my depth here [20:54:17] also heads up that the entire stats service on wmcs is going to get removed soon-ish. I still need to come up with some dates but https://wikitech.wikimedia.org/wiki/News/2023_Cloud_VPS_metrics_changes [20:54:29] statsd service* [20:56:15] TheresNoTime: https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep has some copy-paste snippets you could try [20:56:32] thanks taavi >.< [21:06:43] that worked [21:06:46] ffffffff [21:08:16] TheresNoTime: do you want me to send a reminder about the age of the beta code stewardship review task? [21:10:37] :D [21:23:54] think we need https://www.isbetabroken.com but `doesbetahaveacodestewardyet.com` [21:29:30] at least you can implement that as a simple static site since it's not going to change [21:30:59] * legoktm cries while quipping [22:06:44] phabricator maintenance window ended. changes: people can now use "other assignee" custom field (aka 'train conductor'), PHP security fix, fixed links in footer, better error message when users try to upload large files [22:15:54] mutante: might need a rollback [22:16:09] You broke stuff https://phabricator.wikimedia.org/T332234 [22:16:13] Cc brennen [22:17:04] brennen: unfortunately that seems very related to the phab deploy. undefined variable: other_assignees [22:18:12] RhinosF1: thank you, debugging started [22:33:01] RhinosF1: that workboard is back [22:33:25] full error and ticket that was related linked