[06:04:11] <XioNoX>	 hello! I'm back!
[06:10:30] <XioNoX>	 looks like mr1-eqiad is down?
[06:16:50] <XioNoX>	 opened https://phabricator.wikimedia.org/T331839
[08:05:57] <marostegui>	 !oncall
[08:06:09] <marostegui>	 !oncall-now sre
[08:06:09] <sirenbot>	 Oncall now for team sre, rotation business_hours:
[08:06:09] <sirenbot>	 e.ffie, x.ionox
[08:06:36] <XioNoX>	 that's me
[08:43:13] <elukey>	 o/
[08:43:31] <elukey>	 me and Filippo are looking into the benthos issue, not really sure what's happening
[08:46:40] <volans>	 <3, let me know if I can help in anyway (knowing nothing but using that data :D )
[08:50:27] <elukey>	 I just tried one more thing but still benthos is processing around one third of the events (jugding from traffic graphs)
[08:50:38] <elukey>	 the other test that I'd like to do is https://gerrit.wikimedia.org/r/c/operations/puppet/+/897063
[08:50:52] <elukey>	 so we change the kafka consumer group name to start completely fresh
[08:51:18] <volans>	 elukey: one thing we could try is to see if for any reason the logic to pick half the topics and double the messages is borked
[08:51:33] <volans>	 and restore temporarily to get the data from all topics
[08:52:15] <volans>	 feel free to discard it if there is no way it could it ;)
[08:52:38] <elukey>	 nono it could be anything, let's wait for godog's input
[08:53:31] <elukey>	 it could be the kafka client used by benthos that is somehow borked after the centrallog1001 firewall change
[08:53:59] <elukey>	 but it seems more on the kafka side
[08:54:08] <elukey>	 or a combination of both
[08:54:13] <volans>	 if it's firewall why only partial data?
[08:54:20] <volans>	 I would expect all or nothing
[08:56:33] <elukey>	 so the change that went out on friday firewalled benthos on centrallog1001 while it was still active, and 1002/2002 were in service.. I saw some weird behavior of kafka in the past when clients saw its kafka connections blackholed (Rather than gracefully  terminated)
[08:56:50] <elukey>	 so I tried to revert the firewall change, allow for 1001 to gracefully stop, etc..
[08:57:17] <volans>	 usually already established connections should not be affected by a new firewall rule
[08:58:05] <elukey>	 then newer ones trying to connect to brokers got in trouble
[08:58:34] <elukey>	 anyway, when I checked yesterday all clients showed some errors, so I thought it was a simple kafka consumer group inconsistent state
[08:58:55] <elukey>	 when you restart consumers the kafka broker leader for the group rebalances the partition assignments etc..
[08:59:27] <elukey>	 at the moment kafka reports that the consumer group is reading from 5 text partitions and 2 upload ones
[08:59:33] <elukey>	 that makes zero sense
[08:59:58] <elukey>	 (I tried to stop all consumers, delete the consumer group on kafka, restart etc.. same thing)
[09:00:53] <elukey>	 this is why I'd like to try with a new consumer group name, I fear that some state is not cleaned up on our version of kafka in certain corner cases
[09:01:33] <elukey>	 if https://gerrit.wikimedia.org/r/c/operations/puppet/+/897063 doesn't work we can try to expand the input traffic etc..
[09:01:52] <elukey>	 but so far it seems benthos is not pulling from enough partitions of the webrequest topics
[09:03:28] <volans>	 elukey: ack
[09:12:13] <elukey>	 thanks for the review, trying the new consumer group name
[09:20:36] <elukey>	 ok nothing really changed, it is probably something else
[09:28:18] <godog>	 ack (sorry had to step afk)
[09:29:05] <elukey>	 np :)
[09:29:12] <elukey>	 so the change in consumer group name didn't really help
[09:29:34] <elukey>	 I see now that the consumer group is reading only from upload partitions, and this is reflected in turnilo/druid of course
[09:29:37] <volans>	 :(
[09:42:15] <vgutierrez>	  pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud seems to be struggling with HDD space issues
[09:43:19] <volans>	 isn't here a timer to cleanup old runs?
[10:19:27] <godog>	 I'm seeking a reviewer for https://gerrit.wikimedia.org/r/c/operations/puppet/+/897833 to restore puppet on prometheus hosts
[10:20:39] <volans>	 godog: done
[10:21:29] <volans>	 the decom cookbook should warn about those references, I wonder how they were missed
[10:21:43] <godog>	 thank you volans 
[10:23:43] <volans>	 btw godog do you think the cookbook should check also the alerts repo? are there hostname references?
[10:27:05] <godog>	 volans: good question, in alerts there shouldn't be references to explicit hosts (there are in *_test.yaml)
[10:29:35] <godog>	 IMHO at this stage it is fine not to check
[10:34:10] <volans>	 ack
[10:34:20] * volans tempted to use codesearch's APIs maybe...
[10:34:42] <volans>	 although might cause false matches if the time to index changes is slow
[10:40:07] <godog>	 indeed
[10:43:34] <_joe_>	 volans: so you want a potentially critical workflow to depend on a service running in cloud VPS
[10:45:24] <volans>	 nah it would be best effort ofc, it's just a grep, I'm more worries of the workflow: merge hiera, puppet-merge, run decom
[10:45:48] <volans>	 that requires to grep on the actual data at that time
[11:21:38] <Emperor>	 Can anyone help me understand where the 502s ATS is reporting against swift actually come from? https://grafana.wikimedia.org/goto/TsPdW0aVz?orgId=1 (codfw) and https://grafana.wikimedia.org/goto/2I0FZ0a4z?orgId=1 (eqiad) both show a rise in 502s starting 9th March around 13:00. If I query the proxy-access logs in codfw, I get ~0.24 500s/second (all of which are thumbs, so I think thumbor errors getting passed on), 0.09 503s/second, and
[11:21:38] <Emperor>	 only 1 502 logged since midnight. I don't know if any of this relates to chunked upload problems reported in T328872
[11:21:39] <stashbot>	 T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872
[11:22:02] <Emperor>	 [eqiad frontends report almost no errors at all today]
[11:22:54] <Emperor>	 That marked uptick in both clusters at the same time (when I was making no changes to swift) is concerning
[11:28:14] <volans>	 I'm sure you're more interested on the internal paths of the requests, but maybe the superset dashboard can help you to narrow it a bit
[11:28:17] <volans>	 https://superset.wikimedia.org/superset/dashboard/p/qeGOw46vDXk/
[11:28:43] <volans>	 see the Request>>Tables (right-most) tab for same sampled URLs for example
[11:29:53] <Emperor>	 well that tells me none of the errors are PUT (which I guess eliminates it from the "upload issues" question)
[11:30:45] <volans>	 yep almost all GET, some HEAD
[11:33:24] <volans>	 and TTFB are quite slow (if you zoom the graph to few hours and put second granularity to avoid average effects)
[11:34:06] <volans>	 there is a flat base around 7s and then peaks on the 20~30s range
[11:34:50] <Emperor>	 OK, so taking one of the requested objects and grepping for Mer_Hayrenik_Anthem_of_Armenia.ogg/Mer_Hayrenik_Anthem_of_Armenia.ogg.mp3 I find 4 hits today, all 206
[11:35:25] <Emperor>	 ...which I guess is consistent, because wherever the 502s are coming from it's not the swift proxies themselves, something else in the stack is producing a 502, surely, since otherwise I could find them in the swift logs?
[11:35:52] <Emperor>	 in fact looks like all 4 requests were the same client, 
[11:36:31] <volans>	 do you hav response times on the backend side? could those be timeouts?
[11:41:17] <Emperor>	 if I'm reading the log lines correctly, ~0.06s
[11:42:01] <volans>	 who's sleeping for 7s then :D
[11:42:51] <Emperor>	 IHNI :(
[11:45:13] <volans>	 Emperor: exactly at that time there was a deploy of thumbor
[11:45:37] <Emperor>	 Oh.
[11:45:39] <volans>	 15:10 hnowlan@deploy2002: helmfile [staging] START helmfile.d/services/thumbor: apply
[11:45:53] <volans>	 not sure if related but surely mightbe
[11:46:04] <Emperor>	 is that not thumbor-on-k8s only? which I think is currently not pooled. But yes, that does smell fishy
[11:46:17] <Emperor>	 hnowlan: any thoughts on this, please?
[11:49:23] * volans has quickly checkend and all k8s backends are pooled=inactive in confctl (although they have different weights: 10, 2, 0)
[11:51:06] <Emperor>	 so that shouldn't have changed the thumbs setup
[11:53:54] <Emperor>	 _shouldn't_ but it is awfully coincidental
[11:54:21] <Emperor>	 (and I still don't have much of a handle on where in our stack is saying 502 to ATS)
[12:09:52] <hnowlan>	 Emperor: apologies, was in a meeting- catching up 
[12:11:21] <hnowlan>	 thumbor k8s definitely *shouldn't* be doing anything, especially staging 
[12:11:23] <hnowlan>	 but let me see 
[12:12:50] <hnowlan>	 staging isn't getting any requests and so *shouldn't* be making requests 
[12:14:29] <hnowlan>	 However, 502s are definitely a symptom of thumbor pooling :/ 
[12:14:47] <hnowlan>	 *thumbor-k8s pooling
[12:26:20] <Emperor>	 Hmm, still a mystery then :(
[12:27:13] <hnowlan>	 no requests on any of the thumbor k8s nodes that I can see recently. I don't like that timing coincidence 
[12:36:12] <hnowlan>	 Emperor: oh actually, I'm wrong - the usual pattern of thumbor-k8s badness is 503s rather than 502s 
[12:37:28] <hnowlan>	 The version of thumbor that was deployed on the 9th is the same as the version deployed days beforehand, there were stale credentials hanging around in staging 
[12:37:43] <hnowlan>	 *deployed in prod days beforehand
[12:39:42] <hnowlan>	 I might roll-restart thumbor-k8s everywhere just in case there is some residual badness, but graphs and logs make it look like all is quiet 
[12:40:59] <hnowlan>	 ohhh wait 
[12:41:11] <hnowlan>	 thumbor2004 is spamming auth failures 
[12:41:57] <hnowlan>	 most codfw thumbor hosts are 
[12:42:45] <hnowlan>	 not a huge amount but consistently 
[12:43:16] <hnowlan>	 "ClientException: Auth GET failed: https://ms-fe.svc.codfw.wmnet/auth/v1.0 502 Bad Gateway" 
[12:44:27] <hnowlan>	 looks like a credentials issue, restarting the codfw thumbor services in case they didn't pick up the rotated credentials for some reason 
[12:46:27] <volans>	 if it helps, from superset, eqiad and drmrs have the most of them but here are 502s in all DCs
[12:55:18] <hnowlan>	 no joy :(
[12:55:34] <volans>	 *there
[12:59:29] <hnowlan>	 those 502s on the swift auth seem like a culprit to me but I would never discount thumbor oddness 
[13:07:17] <hnowlan>	 that's only on codfw btw it seems 
[13:11:01] <hnowlan>	 there's a good bit of "ERROR Insufficient Storage 10.192.16.160:6033/sdz1" and similar in server.log on codfw ms-fe hosts 
[13:11:53] <hnowlan>	 Emperor: not sure if that'd be a cause 
[13:17:52] <Emperor>	 that's because we have 4 failed drives in codfw servers now; shouldn't be causing issues (but it is a bit weird - swift-dispersion-report knows they're unmounted)
[13:21:28] <hnowlan>	 ah okay 
[13:21:36] <hnowlan>	 any ideas about the 502s on auth? 
[13:22:37] <hnowlan>	 coincidentally there's an alert about ms-fe2012 returning 502s 
[13:23:21] <Emperor>	 I don't think swift tempauth logs anything ever :(
[13:26:11] <Emperor>	 hnowlan: I don't suppose thumbor records the text that came with the 502?
[13:43:26] <marostegui>	 jbond: ok to merge your change?
[13:45:31] <marostegui>	 This is the change https://gerrit.wikimedia.org/r/c/operations/puppet/+/897853
[13:49:27] <marostegui>	 jbond: I am going to revert it, it's been there for 2 hours, and I need to merge mine, sorry about it
[13:50:04] <jbond>	 marostegui: sorry just saw this
[13:50:13] <marostegui>	 jbond: ok, can I merge?
[13:50:16] <marostegui>	 I can abandon the revert
[13:50:21] <jbond>	 yes can be merged
[13:50:24] <marostegui>	 ok
[13:50:29] <jbond>	 thanks
[14:01:05] <Emperor>	 hnowlan: also, I don't see why/how/what said 502s would suddenly increase around the time there was a deploy of unpooled thumbor-on-k8s?
[14:29:29] <Emperor>	 hnowlan: also, also: shouldn't it be caching auth responses, not making loads of requests?
[14:30:16] <bblack>	 healthchecks maybe?
[14:31:30] <bblack>	 (if you've added a bunch of thumbor-on-k8s backends to pybal config indirectly, then even if they're depooled, pybal would be doing some kind of persistent healthchecks on them)
[14:31:45] <volans>	 they are inactive though in confctl
[14:32:21] <volans>	 I noticed that some have weight=0 and I remeber there were issues with weight=0, could it be a re-occurrence of this in some way?
[14:32:31] <bblack>	 inactive means different things in different contexts, it's pretty confusing even to me
[14:32:49] <volans>	 (but I don't recall if inactive should wipe everything from pybal side and hence can't affect this)
[14:33:11] <bblack>	 I think "inactive" for pybal still results in healthchecks, just not ipvs routing
[14:35:39] <hnowlan>	 healthchecks on thumbor will not incur connections to swift
[14:36:10] <hnowlan>	 Emperor: the errors are intermittent it seems, there'd be hundreds per second so I don't think it's failing every request 
[14:36:51] <bblack>	 ok
[14:36:56] <hnowlan>	 Emperor: I don't understand why the uptick happened alongside the deploy either tbh, but I also struggle to see how it'd cause this, particularly in staging which is never going to be pooled 
[14:37:44] <Emperor>	 it's particularly weird in eqiad, where the load is really low
[14:38:07] <bblack>	 if not pybal healthchecks, it probably has to be something else "internal"
[14:38:26] <bblack>	 (icinga? or somehow triggered by swift replication?)
[14:38:47] <volans>	 bblack: what do you mean? the 502s are for real user traffic
[14:38:55] <volans>	 *user-generated
[14:39:14] <hnowlan>	 Emperor: unfortunately I only get the first 60 chars in logging which is the "502 Bad Gateway" 
[14:39:26] <bblack>	 oh I wasn't aware
[14:39:41] <volans>	 see https://superset.wikimedia.org/superset/dashboard/p/qeGOw46vDXk/
[14:39:55] <hnowlan>	 just in case there is some sort of residual issue with k8s-thumbor that is escaping logging and metrics I'm going to roll-restart thumbor-k8s wherever I can 
[14:40:19] <hnowlan>	 however I dunno if there's a point, the issues are manifesting on metal thumbor nodes 
[14:40:59] <Emperor>	 hnowlan: do you see a change on the metal thumbor nodes around the time of the uptick noted on the ATS graphs?
[14:41:08] <Emperor>	 <-- meeting right now, sorry
[14:44:15] <hnowlan>	 Emperor: there are some spikes in errors in the same hour as the uptick for ATS, but nothing consistent https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-7d&orgId=1&to=now&viewPanel=38 
[14:44:40] <hnowlan>	 those ^ are errors returned by thumbor's haproxy btw
[14:45:52] <Emperor>	 hnowlan: that's generally quite a low number of 502s, compared to the levels reported by ATS
[14:48:02] <hnowlan>	 yeah :/ 
[14:49:11] <hnowlan>	 there isn't an uptick in 429 requests either which is a side effect of a bug that has been there forever that sometimes hides exceptions (fixed in k8s of course but) 
[14:51:02] <bblack>	 503s are higher than 502s?
[14:51:38] <bblack>	 different thing entirely of course, but puts it in perspective
[14:52:42] <hnowlan>	 that's kinda to be expected with thumbor, it routinely barfs on bad images/malformed formats etc 
[14:53:27] <hnowlan>	 (and gives 503s)
[14:53:47] <_joe_>	 did anyone restart/depool the swift frontend emitting the 502s?
[14:54:05] <_joe_>	 because we know that will fix the user-visible issue
[14:56:25] <_joe_>	 I doubt the issue is due to thumbor on k8s this time. It happened before we started pooling it, repeatedly
[14:56:26] <hnowlan>	 looks like it's happening on all frontends 
[14:57:06] <_joe_>	 ok, then keep one depooled if you want more investigation, roll restart the others?
[14:57:10] <hnowlan>	 assuming we can trust the 502s in /var/log/swift/proxy-access.log
[14:57:13] <Emperor>	 I tried a rolling-restart of eqiad last week (IIRC), to no avail
[14:57:25] <Emperor>	 yes, on the 10th
[14:57:54] <bblack>	 eqiad's basically depooled though, the errors are in codfw
[14:57:55] <hnowlan>	 the rate on eqiad is a lot lower but I guess that's to be expected 
[14:57:59] <Emperor>	 hnowlan: part of the problem is the 502s largely aren't in proxy-access.log
[14:58:14] <Emperor>	 bblack: same uptick in 502s in eqiad 
[14:58:19] <bblack>	 are they being synthesized by ATS perhaps, for some kind of connectfail/timeout?
[14:58:20] <Emperor>	 which is particularly confusing
[14:58:27] <bblack>	 (the excess 502s)
[14:58:36] <hnowlan>	 Emperor: ah, so they're just not logged at all? 
[14:58:49] <Emperor>	 part of the frustration is that I don't know where these 502s are actually coming from
[14:59:05] <_joe_>	 bblack: I'm pretty sure that's the case
[14:59:21] <bblack>	 it would make sense
[15:05:20] <volans>	 bblack: do we have a timeout of 7s somewhere in the traffic stack?
[15:05:24] <Emperor>	 the codfw swift frontends seem quite unevenly loaded, which is strange (based on wc -l of proxy-access.log)
[15:05:34] <volans>	 the TTFB as I was mentioning before has a base around 7s and then spikes up to 20~30s
[15:05:54] <volans>	 a git grep failed me earlier today
[15:06:46] <Emperor>	 e.g. ms-fe2012 6,155,301 entries, cf ms-fe2010 has 33,913,803
[15:07:51] <bblack>	 yeah I'm seeing ~7s examples at the ATS layer
[15:08:12] <bblack>	 if I watch ATS traffic on an upload backend in codfw for 502-to-the-user, I see entries like:
[15:08:44] <bblack>	 Date:2023-03-13 Time:15:05:04 ConnAttempts:0 ConnReuse:5 TTFetchHeaders:7282 ClientTTFB:7284 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:7284 TotalPluginTime:1 ActivePluginTime:1 TotalTime:7284 OriginServer:swift.discovery.wmnet OriginServerTime:7283 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:502 OriginStatus:502 
[15:08:49] <bblack>	 ReqURL:http://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png [...]
[15:08:55] <bblack>	 all those time values are in ms, so just a little over 7s
[15:09:37] <bblack>	 it claims "OriginStatus" was 502 in all such cases where RespStatus is 502, but I'm not sure exactly whether ATS would call a synthetic 502 an originStatus of 502
[15:11:32] <Emperor>	 Mar 13 15:05:24 ms-fe2011 proxy-server: REDACTED 10.192.32.36 13/Mar/2023/15/05/24 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f3/f/f3/Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png HTTP/1.0 200 https://en.wikipedia.org/ Mozilla/5.0%20%28X11%3B%20CrOS%20x86_64%2014541.0.0%29%20AppleWebKit/537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome/110.0.0.0%20Safari/537.36 - - 7205 - tx901c2bde9678452a82da6-00640f3bb4 -
[15:11:32] <Emperor>	 0.3440 - - 1678719924.264201164 1678719924.608155489 0
[15:11:41] <Emperor>	 ^-- so swift thinks it returned that in 0.3440 seconds
[15:12:03] <Emperor>	 https://docs.openstack.org/swift/queens/logs.html log format
[15:12:40] <bblack>	 ATs is definitely capable of creating synthetic 502s, it even documents some custom reason strings for them
[15:13:14] <bblack>	 I don't think we log the "reason" though
[15:13:23] <bblack>	 (the text after the status code)
[15:16:01] <volans>	 the fact that the swift log is @15:05:24 and the ats one @15:05:04 is "ok"? as in, are we sure is the same request?
[15:16:24] <Emperor>	 volans: I found that with sudo cumin -x "P{O:swift::proxy}" "grep -F Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png /var/log/swift/proxy-access.log | grep ' 15:05'"
[15:16:57] <XioNoX>	 hello! the task for the eqiad row C upgrade is out! https://phabricator.wikimedia.org/T331882
[15:17:12] <volans>	 even if ATS logs the time of the incoming request and not the reponse, 20s seems a lot of difference in our setup
[15:17:35] <Emperor>	 (no entries at 15:04)
[15:22:38] <bblack>	 I don't think those were the same request, those two log entries
[15:23:33] <bblack>	 hmmm no, I thought the UAs were different, but actually they're the same
[15:23:39] <bblack>	 still, could be a followup req trying to fix the 502
[15:26:16] <bblack>	 seems likely those were the same req.  There weren't other examples in swift logs from 15:04 or 15:06
[15:26:46] <bblack>	 client IP matches too
[15:28:25] <bblack>	 yeah it's clearly synthetic.  20s is more than 7, too
[15:36:38] <Emperor>	 does leave the question of what was going on in those seconds
[15:37:55] <bblack>	 yes, and aside from ATS and swift-proxy, we also have nginx in the way
[15:38:07] <bblack>	 do nginx error logs show the 502s (and at which of the two rates?)
[15:40:13] <bblack>	 in general the nginx logs show a ton of errors in unified.error.log, even back through long history.  lots of "broken pipe" and others
[15:40:43] <bblack>	 but maybe the rate differs
[15:41:17] <bblack>	 e.g.
[15:41:18] <bblack>	 2023/03/13 14:48:36 [error] 1831274#1831274: *751793257 writev() failed (32: Broken pipe) while sending request to upstream, client: 10.192.16.58, server: ms-fe.svc.codfw.wmnet, request: "PUT /v1/AUTH_mw/wikipedia-commons-local-thumb.0d/0/0d/PadmajaWiki.jpg/2560px-PadmajaWiki.jpg HTTP/1.1", upstream: 
[15:41:23] <bblack>	 "http://10.192.32.36:80/v1/AUTH_mw/wikipedia-commons-local-thumb.0d/0/0d/PadmajaWiki.jpg/2560px-PadmajaWiki.jpg", host: "ms-fe.svc.codfw.wmnet"
[15:42:02] <bblack>	 that's a PUT though, but there are GETs as well
[15:51:33] <Emperor>	 most of the non-thumb errors are PUTs not GETs, AFAICT
[15:55:16] <bblack>	 sorry I'm kinda distracted to keep at this persistently
[15:55:54] <bblack>	 but maybe look at the history of relevant 502 rates at the swift-proxy vs nginx vs ATS/user-facing might provide more insight about which layer this is commonly happening at.
[15:58:47] <Emperor>	 I don't think it can be nginx, since in eqiad there's basically nothing in the unified.error.log now
[15:58:58] <Emperor>	 and yet ATS is still reporting errors against eqiad swift.
[15:59:26] <bblack>	 well, at least it's not in the logs then
[15:59:54] <bblack>	 but still, it is the proxy layer that sits between ATS<->swift, and could have some involvement
[16:00:51] <_joe_>	 eqiad swift just gets the writes from mediawiki
[16:02:32] <Emperor>	 it's also getting some non-PUT requests (container listings and the like), based on the proxy-access log
[16:03:49] <Emperor>	 e.g. Mar 13 00:00:02 ms-fe1012 proxy-server: 10.192.48.18 10.64.130.2 13/Mar/2023/00/00/02 HEAD /v1/AUTH_mw/wikipedia-en-local-public.f6/f/f6/Mandeville_Place_Philadelphia.jpg HTTP/1.0 200 - wikimedia/multi-http-client%20v1.0 AUTH_tk11c74e60d... - - - tx6933926aa3f7466a95d53-00640e6782 - 0.1480 - - 1678665602.305792332 1678665602.453765154 0
[16:03:58] <Emperor>	 (those IPs are internal)
[16:11:10] <Emperor>	 ms-fe1012 has served 16,294 PUTs, 244,535 HEADs, and 1,227,529 GETs today (which suprises me a bit)
[16:57:52] <_joe_>	 Emperor: are those all for originals?
[17:01:33] <Emperor>	 _joe_: if you mean "not thumbs", then there are 233 matches for 'thumb' for GET, 13,372 for HEAD, 55 for PUT
[17:07:44] <_joe_>	 anyways, sorry: I might have missed something: the errors all come from trafficserver in eqiad
[17:07:54] <_joe_>	 and it should be still contacting swift in codfw
[17:08:16] <_joe_>	 while the other sites seem ok?
[17:09:48] <_joe_>	 actually no, they all report errors
[17:10:09] <_joe_>	 I frankly think a roll restart of swift-proxy in codfw is what I'd do
[17:10:44] <Emperor>	 's easy to try.
[17:11:32] <Emperor>	 was I misreading https://grafana.wikimedia.org/goto/IFt1s0a4k?orgId=1 as referring to eqiad swift then?
[17:12:00] <_joe_>	 it refers to ATS in eqiad
[17:12:08] <_joe_>	 which points to swift.discovery.wmnet
[17:12:13] <_joe_>	 which now points to codfw
[17:15:26] <bblack>	 what controls swift->thumbor?
[17:15:55] <bblack>	 (does it use the local DC only?)
[17:17:33] <Emperor>	 thumborhost in proxy-server.conf which is thumbor.svc.[dc].wmnet
[17:17:44] <Emperor>	 we no longer throw new thumbnails to the other DC
[17:18:03] <bblack>	 I guess swift replication eventually does it indirectly?
[17:18:38] <Emperor>	 no, we don't replicate thumbnails
[17:19:08] <bblack>	 well, that could be a driver of some differences somehow, with one DC depooled
[17:19:51] <bblack>	 there's likely to be some regional variance between the two sides of our infra in normal times
[17:20:01] <Emperor>	 (I've done the rolling-restart)
[17:20:34] <bblack>	 some images/thumbs are only referenced by wikis that are popular on the eqsin+ulsfo+codfw side of the world, others on the eqiad+esams+drmrs side of the world.  So the sets would differentiate a bit over time.
[17:20:58] <bblack>	 maybe that results in more misses of what would've otherwise been an existing thumb, when depooling one of the core sites
[17:21:27] <bblack>	 (misses in swift, I mean, needing more traffic to thumbor to make them up)
[17:22:07] <Emperor>	 _joe_: points for you, the 502 rate looks to have dropped down again
[17:24:39] <Emperor>	 (and minus points for me misunderstanding what the dashboards meant)
[17:25:17] <volans>	 FYI the data in the webrequest_sampled_live dataset can't be trusted 100% right now fo the benthos issues, it should be restored, but might still have issues, to be verified
[17:26:03] <volans>	 in ~1~2h you'll get the data in the sampled_128 one 
[17:45:49] <_joe_>	 bblack: the thumbnail removal / replacement is badly broken atm, we have a lot of inconsistencies between datacenters for sure
[17:45:57] <_joe_>	 see the task I opened some time ago
[17:46:26] <_joe_>	 basically: FileMultiWrite does read from the "master" datacenter only
[17:46:37] <_joe_>	 with all the hilarity that ensues as a consequence
[17:46:59] <bblack>	 ok :)
[17:47:16] <_joe_>	 bblack: https://phabricator.wikimedia.org/T331138
[17:48:01] * bd808 mumbles something about MediaWiki needing more active investment in media support
[18:17:30] <bblack>	 while digging around with brett in commons looking at other (only tangentially related) error code cases from thumbor, etc
[18:18:07] <bblack>	 we seem to have noticed that mpeg videos on commons, they sometimes link to supposed jpeg thumbnails, but that those jpeg thumbnails seem to consistently timeout or return a 429
[18:18:16] <bblack>	 https://commons.wikimedia.org/wiki/File:Rice_Wine.mpg
[18:18:23] <bblack>	 ^ is a fairly clean example
[18:18:30] <bblack>	 Size of this JPG preview of this MPG file: 800 × 450 pixels. Other resolutions: 320 × 180 pixels | 640 × 360 pixels | 1,024 × 576 pixels | 1,280 × 720 pixels | 1,920 × 1,080 pixels.
[18:18:44] <bblack>	 even when you take the first option there, or the small ones (didn't try the large ones), they all fail
[18:19:08] <bblack>	 is this just a known-longstanding thing in the commons world, or is something more-recenty broken?
[18:19:21] <_joe_>	 I am pretty sure it is
[18:19:34] <_joe_>	 clearly we weren't able to extract a thumbnail from that video
[18:19:43] <_joe_>	 hnowlan: ^^
[18:19:45] <bblack>	 yeah but I haven't found any mpeg with a thumbnail that works
[18:19:55] <_joe_>	 oh ok that is indeed new
[18:20:07] <bblack>	 if you search for mpegs, the search result page doesn't have thumbnails for any of them anyways, and I haven't found working ones poking around
[18:21:11] <_joe_>	 I'd open a bug 
[18:21:17] <bblack>	 ok
[18:21:47] <_joe_>	 although I will say - I don't think we have anyone dedicated to work on this, I can maybe dig into how thumbor responds
[18:23:24] <_joe_>	 I suspect this might have to do with recent videos and thumbor still running on jessie
[18:27:34] <bblack>	 "I don't think we have anyone dedicated to work on this" seems to be a recurrent theme about a number of things around here! :)
[18:28:28] <_joe_>	 right?
[18:30:01] <brett>	 https://phabricator.wikimedia.org/T244570 is already reported :)
[18:39:58] <mutante>	 try   "missing thumbnails" + open + task in Phabricator search 
[18:40:06] <hnowlan>	 wasn't aware of that bug, ty brett 
[18:40:16] <hnowlan>	 that fix will be rolled out on thumbor-k8s at least 
[18:40:44] <hnowlan>	 good/bad to know it's been around for that long :|
[19:17:54] <brett>	 I suspect that thumbor is a bit of an unloved stepchild :)
[19:19:21] <bd808>	 worse that unloved it is prod software that never had a real owner.
[19:20:02] <bd808>	 Gilles did lots and lots of work to get it into prod, but never really with any tie to a team with plans for long term maintenance
[19:21:04] <bd808>	 kind of like when I rolled out the ELK stack I suppose, except without the teams that showed up to rescue me from owing it forever