[06:04:11] hello! I'm back! [06:10:30] looks like mr1-eqiad is down? [06:16:50] opened https://phabricator.wikimedia.org/T331839 [08:05:57] !oncall [08:06:09] !oncall-now sre [08:06:09] Oncall now for team sre, rotation business_hours: [08:06:09] e.ffie, x.ionox [08:06:36] that's me [08:43:13] o/ [08:43:31] me and Filippo are looking into the benthos issue, not really sure what's happening [08:46:40] <3, let me know if I can help in anyway (knowing nothing but using that data :D ) [08:50:27] I just tried one more thing but still benthos is processing around one third of the events (jugding from traffic graphs) [08:50:38] the other test that I'd like to do is https://gerrit.wikimedia.org/r/c/operations/puppet/+/897063 [08:50:52] so we change the kafka consumer group name to start completely fresh [08:51:18] elukey: one thing we could try is to see if for any reason the logic to pick half the topics and double the messages is borked [08:51:33] and restore temporarily to get the data from all topics [08:52:15] feel free to discard it if there is no way it could it ;) [08:52:38] nono it could be anything, let's wait for godog's input [08:53:31] it could be the kafka client used by benthos that is somehow borked after the centrallog1001 firewall change [08:53:59] but it seems more on the kafka side [08:54:08] or a combination of both [08:54:13] if it's firewall why only partial data? [08:54:20] I would expect all or nothing [08:56:33] so the change that went out on friday firewalled benthos on centrallog1001 while it was still active, and 1002/2002 were in service.. I saw some weird behavior of kafka in the past when clients saw its kafka connections blackholed (Rather than gracefully terminated) [08:56:50] so I tried to revert the firewall change, allow for 1001 to gracefully stop, etc.. [08:57:17] usually already established connections should not be affected by a new firewall rule [08:58:05] then newer ones trying to connect to brokers got in trouble [08:58:34] anyway, when I checked yesterday all clients showed some errors, so I thought it was a simple kafka consumer group inconsistent state [08:58:55] when you restart consumers the kafka broker leader for the group rebalances the partition assignments etc.. [08:59:27] at the moment kafka reports that the consumer group is reading from 5 text partitions and 2 upload ones [08:59:33] that makes zero sense [08:59:58] (I tried to stop all consumers, delete the consumer group on kafka, restart etc.. same thing) [09:00:53] this is why I'd like to try with a new consumer group name, I fear that some state is not cleaned up on our version of kafka in certain corner cases [09:01:33] if https://gerrit.wikimedia.org/r/c/operations/puppet/+/897063 doesn't work we can try to expand the input traffic etc.. [09:01:52] but so far it seems benthos is not pulling from enough partitions of the webrequest topics [09:03:28] elukey: ack [09:12:13] thanks for the review, trying the new consumer group name [09:20:36] ok nothing really changed, it is probably something else [09:28:18] ack (sorry had to step afk) [09:29:05] np :) [09:29:12] so the change in consumer group name didn't really help [09:29:34] I see now that the consumer group is reading only from upload partitions, and this is reflected in turnilo/druid of course [09:29:37] :( [09:42:15] pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud seems to be struggling with HDD space issues [09:43:19] isn't here a timer to cleanup old runs? [10:19:27] I'm seeking a reviewer for https://gerrit.wikimedia.org/r/c/operations/puppet/+/897833 to restore puppet on prometheus hosts [10:20:39] godog: done [10:21:29] the decom cookbook should warn about those references, I wonder how they were missed [10:21:43] thank you volans [10:23:43] btw godog do you think the cookbook should check also the alerts repo? are there hostname references? [10:27:05] volans: good question, in alerts there shouldn't be references to explicit hosts (there are in *_test.yaml) [10:29:35] IMHO at this stage it is fine not to check [10:34:10] ack [10:34:20] * volans tempted to use codesearch's APIs maybe... [10:34:42] although might cause false matches if the time to index changes is slow [10:40:07] indeed [10:43:34] <_joe_> volans: so you want a potentially critical workflow to depend on a service running in cloud VPS [10:45:24] nah it would be best effort ofc, it's just a grep, I'm more worries of the workflow: merge hiera, puppet-merge, run decom [10:45:48] that requires to grep on the actual data at that time [11:21:38] Can anyone help me understand where the 502s ATS is reporting against swift actually come from? https://grafana.wikimedia.org/goto/TsPdW0aVz?orgId=1 (codfw) and https://grafana.wikimedia.org/goto/2I0FZ0a4z?orgId=1 (eqiad) both show a rise in 502s starting 9th March around 13:00. If I query the proxy-access logs in codfw, I get ~0.24 500s/second (all of which are thumbs, so I think thumbor errors getting passed on), 0.09 503s/second, and [11:21:38] only 1 502 logged since midnight. I don't know if any of this relates to chunked upload problems reported in T328872 [11:21:39] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [11:22:02] [eqiad frontends report almost no errors at all today] [11:22:54] That marked uptick in both clusters at the same time (when I was making no changes to swift) is concerning [11:28:14] I'm sure you're more interested on the internal paths of the requests, but maybe the superset dashboard can help you to narrow it a bit [11:28:17] https://superset.wikimedia.org/superset/dashboard/p/qeGOw46vDXk/ [11:28:43] see the Request>>Tables (right-most) tab for same sampled URLs for example [11:29:53] well that tells me none of the errors are PUT (which I guess eliminates it from the "upload issues" question) [11:30:45] yep almost all GET, some HEAD [11:33:24] and TTFB are quite slow (if you zoom the graph to few hours and put second granularity to avoid average effects) [11:34:06] there is a flat base around 7s and then peaks on the 20~30s range [11:34:50] OK, so taking one of the requested objects and grepping for Mer_Hayrenik_Anthem_of_Armenia.ogg/Mer_Hayrenik_Anthem_of_Armenia.ogg.mp3 I find 4 hits today, all 206 [11:35:25] ...which I guess is consistent, because wherever the 502s are coming from it's not the swift proxies themselves, something else in the stack is producing a 502, surely, since otherwise I could find them in the swift logs? [11:35:52] in fact looks like all 4 requests were the same client, [11:36:31] do you hav response times on the backend side? could those be timeouts? [11:41:17] if I'm reading the log lines correctly, ~0.06s [11:42:01] who's sleeping for 7s then :D [11:42:51] IHNI :( [11:45:13] Emperor: exactly at that time there was a deploy of thumbor [11:45:37] Oh. [11:45:39] 15:10 hnowlan@deploy2002: helmfile [staging] START helmfile.d/services/thumbor: apply [11:45:53] not sure if related but surely mightbe [11:46:04] is that not thumbor-on-k8s only? which I think is currently not pooled. But yes, that does smell fishy [11:46:17] hnowlan: any thoughts on this, please? [11:49:23] * volans has quickly checkend and all k8s backends are pooled=inactive in confctl (although they have different weights: 10, 2, 0) [11:51:06] so that shouldn't have changed the thumbs setup [11:53:54] _shouldn't_ but it is awfully coincidental [11:54:21] (and I still don't have much of a handle on where in our stack is saying 502 to ATS) [12:09:52] Emperor: apologies, was in a meeting- catching up [12:11:21] thumbor k8s definitely *shouldn't* be doing anything, especially staging [12:11:23] but let me see [12:12:50] staging isn't getting any requests and so *shouldn't* be making requests [12:14:29] However, 502s are definitely a symptom of thumbor pooling :/ [12:14:47] *thumbor-k8s pooling [12:26:20] Hmm, still a mystery then :( [12:27:13] no requests on any of the thumbor k8s nodes that I can see recently. I don't like that timing coincidence [12:36:12] Emperor: oh actually, I'm wrong - the usual pattern of thumbor-k8s badness is 503s rather than 502s [12:37:28] The version of thumbor that was deployed on the 9th is the same as the version deployed days beforehand, there were stale credentials hanging around in staging [12:37:43] *deployed in prod days beforehand [12:39:42] I might roll-restart thumbor-k8s everywhere just in case there is some residual badness, but graphs and logs make it look like all is quiet [12:40:59] ohhh wait [12:41:11] thumbor2004 is spamming auth failures [12:41:57] most codfw thumbor hosts are [12:42:45] not a huge amount but consistently [12:43:16] "ClientException: Auth GET failed: https://ms-fe.svc.codfw.wmnet/auth/v1.0 502 Bad Gateway" [12:44:27] looks like a credentials issue, restarting the codfw thumbor services in case they didn't pick up the rotated credentials for some reason [12:46:27] if it helps, from superset, eqiad and drmrs have the most of them but here are 502s in all DCs [12:55:18] no joy :( [12:55:34] *there [12:59:29] those 502s on the swift auth seem like a culprit to me but I would never discount thumbor oddness [13:07:17] that's only on codfw btw it seems [13:11:01] there's a good bit of "ERROR Insufficient Storage 10.192.16.160:6033/sdz1" and similar in server.log on codfw ms-fe hosts [13:11:53] Emperor: not sure if that'd be a cause [13:17:52] that's because we have 4 failed drives in codfw servers now; shouldn't be causing issues (but it is a bit weird - swift-dispersion-report knows they're unmounted) [13:21:28] ah okay [13:21:36] any ideas about the 502s on auth? [13:22:37] coincidentally there's an alert about ms-fe2012 returning 502s [13:23:21] I don't think swift tempauth logs anything ever :( [13:26:11] hnowlan: I don't suppose thumbor records the text that came with the 502? [13:43:26] jbond: ok to merge your change? [13:45:31] This is the change https://gerrit.wikimedia.org/r/c/operations/puppet/+/897853 [13:49:27] jbond: I am going to revert it, it's been there for 2 hours, and I need to merge mine, sorry about it [13:50:04] marostegui: sorry just saw this [13:50:13] jbond: ok, can I merge? [13:50:16] I can abandon the revert [13:50:21] yes can be merged [13:50:24] ok [13:50:29] thanks [14:01:05] hnowlan: also, I don't see why/how/what said 502s would suddenly increase around the time there was a deploy of unpooled thumbor-on-k8s? [14:29:29] hnowlan: also, also: shouldn't it be caching auth responses, not making loads of requests? [14:30:16] healthchecks maybe? [14:31:30] (if you've added a bunch of thumbor-on-k8s backends to pybal config indirectly, then even if they're depooled, pybal would be doing some kind of persistent healthchecks on them) [14:31:45] they are inactive though in confctl [14:32:21] I noticed that some have weight=0 and I remeber there were issues with weight=0, could it be a re-occurrence of this in some way? [14:32:31] inactive means different things in different contexts, it's pretty confusing even to me [14:32:49] (but I don't recall if inactive should wipe everything from pybal side and hence can't affect this) [14:33:11] I think "inactive" for pybal still results in healthchecks, just not ipvs routing [14:35:39] healthchecks on thumbor will not incur connections to swift [14:36:10] Emperor: the errors are intermittent it seems, there'd be hundreds per second so I don't think it's failing every request [14:36:51] ok [14:36:56] Emperor: I don't understand why the uptick happened alongside the deploy either tbh, but I also struggle to see how it'd cause this, particularly in staging which is never going to be pooled [14:37:44] it's particularly weird in eqiad, where the load is really low [14:38:07] if not pybal healthchecks, it probably has to be something else "internal" [14:38:26] (icinga? or somehow triggered by swift replication?) [14:38:47] bblack: what do you mean? the 502s are for real user traffic [14:38:55] *user-generated [14:39:14] Emperor: unfortunately I only get the first 60 chars in logging which is the "502 Bad Gateway" [14:39:26] oh I wasn't aware [14:39:41] see https://superset.wikimedia.org/superset/dashboard/p/qeGOw46vDXk/ [14:39:55] just in case there is some sort of residual issue with k8s-thumbor that is escaping logging and metrics I'm going to roll-restart thumbor-k8s wherever I can [14:40:19] however I dunno if there's a point, the issues are manifesting on metal thumbor nodes [14:40:59] hnowlan: do you see a change on the metal thumbor nodes around the time of the uptick noted on the ATS graphs? [14:41:08] <-- meeting right now, sorry [14:44:15] Emperor: there are some spikes in errors in the same hour as the uptick for ATS, but nothing consistent https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-7d&orgId=1&to=now&viewPanel=38 [14:44:40] those ^ are errors returned by thumbor's haproxy btw [14:45:52] hnowlan: that's generally quite a low number of 502s, compared to the levels reported by ATS [14:48:02] yeah :/ [14:49:11] there isn't an uptick in 429 requests either which is a side effect of a bug that has been there forever that sometimes hides exceptions (fixed in k8s of course but) [14:51:02] 503s are higher than 502s? [14:51:38] different thing entirely of course, but puts it in perspective [14:52:42] that's kinda to be expected with thumbor, it routinely barfs on bad images/malformed formats etc [14:53:27] (and gives 503s) [14:53:47] <_joe_> did anyone restart/depool the swift frontend emitting the 502s? [14:54:05] <_joe_> because we know that will fix the user-visible issue [14:56:25] <_joe_> I doubt the issue is due to thumbor on k8s this time. It happened before we started pooling it, repeatedly [14:56:26] looks like it's happening on all frontends [14:57:06] <_joe_> ok, then keep one depooled if you want more investigation, roll restart the others? [14:57:10] assuming we can trust the 502s in /var/log/swift/proxy-access.log [14:57:13] I tried a rolling-restart of eqiad last week (IIRC), to no avail [14:57:25] yes, on the 10th [14:57:54] eqiad's basically depooled though, the errors are in codfw [14:57:55] the rate on eqiad is a lot lower but I guess that's to be expected [14:57:59] hnowlan: part of the problem is the 502s largely aren't in proxy-access.log [14:58:14] bblack: same uptick in 502s in eqiad [14:58:19] are they being synthesized by ATS perhaps, for some kind of connectfail/timeout? [14:58:20] which is particularly confusing [14:58:27] (the excess 502s) [14:58:36] Emperor: ah, so they're just not logged at all? [14:58:49] part of the frustration is that I don't know where these 502s are actually coming from [14:59:05] <_joe_> bblack: I'm pretty sure that's the case [14:59:21] it would make sense [15:05:20] bblack: do we have a timeout of 7s somewhere in the traffic stack? [15:05:24] the codfw swift frontends seem quite unevenly loaded, which is strange (based on wc -l of proxy-access.log) [15:05:34] the TTFB as I was mentioning before has a base around 7s and then spikes up to 20~30s [15:05:54] a git grep failed me earlier today [15:06:46] e.g. ms-fe2012 6,155,301 entries, cf ms-fe2010 has 33,913,803 [15:07:51] yeah I'm seeing ~7s examples at the ATS layer [15:08:12] if I watch ATS traffic on an upload backend in codfw for 502-to-the-user, I see entries like: [15:08:44] Date:2023-03-13 Time:15:05:04 ConnAttempts:0 ConnReuse:5 TTFetchHeaders:7282 ClientTTFB:7284 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:7284 TotalPluginTime:1 ActivePluginTime:1 TotalTime:7284 OriginServer:swift.discovery.wmnet OriginServerTime:7283 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:502 OriginStatus:502 [15:08:49] ReqURL:http://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png [...] [15:08:55] all those time values are in ms, so just a little over 7s [15:09:37] it claims "OriginStatus" was 502 in all such cases where RespStatus is 502, but I'm not sure exactly whether ATS would call a synthetic 502 an originStatus of 502 [15:11:32] Mar 13 15:05:24 ms-fe2011 proxy-server: REDACTED 10.192.32.36 13/Mar/2023/15/05/24 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f3/f/f3/Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png HTTP/1.0 200 https://en.wikipedia.org/ Mozilla/5.0%20%28X11%3B%20CrOS%20x86_64%2014541.0.0%29%20AppleWebKit/537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome/110.0.0.0%20Safari/537.36 - - 7205 - tx901c2bde9678452a82da6-00640f3bb4 - [15:11:32] 0.3440 - - 1678719924.264201164 1678719924.608155489 0 [15:11:41] ^-- so swift thinks it returned that in 0.3440 seconds [15:12:03] https://docs.openstack.org/swift/queens/logs.html log format [15:12:40] ATs is definitely capable of creating synthetic 502s, it even documents some custom reason strings for them [15:13:14] I don't think we log the "reason" though [15:13:23] (the text after the status code) [15:16:01] the fact that the swift log is @15:05:24 and the ats one @15:05:04 is "ok"? as in, are we sure is the same request? [15:16:24] volans: I found that with sudo cumin -x "P{O:swift::proxy}" "grep -F Gm_uzbekistan_company_logo.png/220px-Gm_uzbekistan_company_logo.png /var/log/swift/proxy-access.log | grep ' 15:05'" [15:16:57] hello! the task for the eqiad row C upgrade is out! https://phabricator.wikimedia.org/T331882 [15:17:12] even if ATS logs the time of the incoming request and not the reponse, 20s seems a lot of difference in our setup [15:17:35] (no entries at 15:04) [15:22:38] I don't think those were the same request, those two log entries [15:23:33] hmmm no, I thought the UAs were different, but actually they're the same [15:23:39] still, could be a followup req trying to fix the 502 [15:26:16] seems likely those were the same req. There weren't other examples in swift logs from 15:04 or 15:06 [15:26:46] client IP matches too [15:28:25] yeah it's clearly synthetic. 20s is more than 7, too [15:36:38] does leave the question of what was going on in those seconds [15:37:55] yes, and aside from ATS and swift-proxy, we also have nginx in the way [15:38:07] do nginx error logs show the 502s (and at which of the two rates?) [15:40:13] in general the nginx logs show a ton of errors in unified.error.log, even back through long history. lots of "broken pipe" and others [15:40:43] but maybe the rate differs [15:41:17] e.g. [15:41:18] 2023/03/13 14:48:36 [error] 1831274#1831274: *751793257 writev() failed (32: Broken pipe) while sending request to upstream, client: 10.192.16.58, server: ms-fe.svc.codfw.wmnet, request: "PUT /v1/AUTH_mw/wikipedia-commons-local-thumb.0d/0/0d/PadmajaWiki.jpg/2560px-PadmajaWiki.jpg HTTP/1.1", upstream: [15:41:23] "http://10.192.32.36:80/v1/AUTH_mw/wikipedia-commons-local-thumb.0d/0/0d/PadmajaWiki.jpg/2560px-PadmajaWiki.jpg", host: "ms-fe.svc.codfw.wmnet" [15:42:02] that's a PUT though, but there are GETs as well [15:51:33] most of the non-thumb errors are PUTs not GETs, AFAICT [15:55:16] sorry I'm kinda distracted to keep at this persistently [15:55:54] but maybe look at the history of relevant 502 rates at the swift-proxy vs nginx vs ATS/user-facing might provide more insight about which layer this is commonly happening at. [15:58:47] I don't think it can be nginx, since in eqiad there's basically nothing in the unified.error.log now [15:58:58] and yet ATS is still reporting errors against eqiad swift. [15:59:26] well, at least it's not in the logs then [15:59:54] but still, it is the proxy layer that sits between ATS<->swift, and could have some involvement [16:00:51] <_joe_> eqiad swift just gets the writes from mediawiki [16:02:32] it's also getting some non-PUT requests (container listings and the like), based on the proxy-access log [16:03:49] e.g. Mar 13 00:00:02 ms-fe1012 proxy-server: 10.192.48.18 10.64.130.2 13/Mar/2023/00/00/02 HEAD /v1/AUTH_mw/wikipedia-en-local-public.f6/f/f6/Mandeville_Place_Philadelphia.jpg HTTP/1.0 200 - wikimedia/multi-http-client%20v1.0 AUTH_tk11c74e60d... - - - tx6933926aa3f7466a95d53-00640e6782 - 0.1480 - - 1678665602.305792332 1678665602.453765154 0 [16:03:58] (those IPs are internal) [16:11:10] ms-fe1012 has served 16,294 PUTs, 244,535 HEADs, and 1,227,529 GETs today (which suprises me a bit) [16:57:52] <_joe_> Emperor: are those all for originals? [17:01:33] _joe_: if you mean "not thumbs", then there are 233 matches for 'thumb' for GET, 13,372 for HEAD, 55 for PUT [17:07:44] <_joe_> anyways, sorry: I might have missed something: the errors all come from trafficserver in eqiad [17:07:54] <_joe_> and it should be still contacting swift in codfw [17:08:16] <_joe_> while the other sites seem ok? [17:09:48] <_joe_> actually no, they all report errors [17:10:09] <_joe_> I frankly think a roll restart of swift-proxy in codfw is what I'd do [17:10:44] 's easy to try. [17:11:32] was I misreading https://grafana.wikimedia.org/goto/IFt1s0a4k?orgId=1 as referring to eqiad swift then? [17:12:00] <_joe_> it refers to ATS in eqiad [17:12:08] <_joe_> which points to swift.discovery.wmnet [17:12:13] <_joe_> which now points to codfw [17:15:26] what controls swift->thumbor? [17:15:55] (does it use the local DC only?) [17:17:33] thumborhost in proxy-server.conf which is thumbor.svc.[dc].wmnet [17:17:44] we no longer throw new thumbnails to the other DC [17:18:03] I guess swift replication eventually does it indirectly? [17:18:38] no, we don't replicate thumbnails [17:19:08] well, that could be a driver of some differences somehow, with one DC depooled [17:19:51] there's likely to be some regional variance between the two sides of our infra in normal times [17:20:01] (I've done the rolling-restart) [17:20:34] some images/thumbs are only referenced by wikis that are popular on the eqsin+ulsfo+codfw side of the world, others on the eqiad+esams+drmrs side of the world. So the sets would differentiate a bit over time. [17:20:58] maybe that results in more misses of what would've otherwise been an existing thumb, when depooling one of the core sites [17:21:27] (misses in swift, I mean, needing more traffic to thumbor to make them up) [17:22:07] _joe_: points for you, the 502 rate looks to have dropped down again [17:24:39] (and minus points for me misunderstanding what the dashboards meant) [17:25:17] FYI the data in the webrequest_sampled_live dataset can't be trusted 100% right now fo the benthos issues, it should be restored, but might still have issues, to be verified [17:26:03] in ~1~2h you'll get the data in the sampled_128 one [17:45:49] <_joe_> bblack: the thumbnail removal / replacement is badly broken atm, we have a lot of inconsistencies between datacenters for sure [17:45:57] <_joe_> see the task I opened some time ago [17:46:26] <_joe_> basically: FileMultiWrite does read from the "master" datacenter only [17:46:37] <_joe_> with all the hilarity that ensues as a consequence [17:46:59] ok :) [17:47:16] <_joe_> bblack: https://phabricator.wikimedia.org/T331138 [17:48:01] * bd808 mumbles something about MediaWiki needing more active investment in media support [18:17:30] while digging around with brett in commons looking at other (only tangentially related) error code cases from thumbor, etc [18:18:07] we seem to have noticed that mpeg videos on commons, they sometimes link to supposed jpeg thumbnails, but that those jpeg thumbnails seem to consistently timeout or return a 429 [18:18:16] https://commons.wikimedia.org/wiki/File:Rice_Wine.mpg [18:18:23] ^ is a fairly clean example [18:18:30] Size of this JPG preview of this MPG file: 800 × 450 pixels. Other resolutions: 320 × 180 pixels | 640 × 360 pixels | 1,024 × 576 pixels | 1,280 × 720 pixels | 1,920 × 1,080 pixels. [18:18:44] even when you take the first option there, or the small ones (didn't try the large ones), they all fail [18:19:08] is this just a known-longstanding thing in the commons world, or is something more-recenty broken? [18:19:21] <_joe_> I am pretty sure it is [18:19:34] <_joe_> clearly we weren't able to extract a thumbnail from that video [18:19:43] <_joe_> hnowlan: ^^ [18:19:45] yeah but I haven't found any mpeg with a thumbnail that works [18:19:55] <_joe_> oh ok that is indeed new [18:20:07] if you search for mpegs, the search result page doesn't have thumbnails for any of them anyways, and I haven't found working ones poking around [18:21:11] <_joe_> I'd open a bug [18:21:17] ok [18:21:47] <_joe_> although I will say - I don't think we have anyone dedicated to work on this, I can maybe dig into how thumbor responds [18:23:24] <_joe_> I suspect this might have to do with recent videos and thumbor still running on jessie [18:27:34] "I don't think we have anyone dedicated to work on this" seems to be a recurrent theme about a number of things around here! :) [18:28:28] <_joe_> right? [18:30:01] https://phabricator.wikimedia.org/T244570 is already reported :) [18:39:58] try "missing thumbnails" + open + task in Phabricator search [18:40:06] wasn't aware of that bug, ty brett [18:40:16] that fix will be rolled out on thumbor-k8s at least [18:40:44] good/bad to know it's been around for that long :| [19:17:54] I suspect that thumbor is a bit of an unloved stepchild :) [19:19:21] worse that unloved it is prod software that never had a real owner. [19:20:02] Gilles did lots and lots of work to get it into prod, but never really with any tie to a team with plans for long term maintenance [19:21:04] kind of like when I rolled out the ELK stack I suppose, except without the teams that showed up to rescue me from owing it forever