[11:06:01] I see some local changes regarding jaeger in `build2001:/srv/images/production-images` - am I safe to overwrite them with a `git pull`? [11:36:51] stupid question time (not urgent, go help with the incident first): what does the Kafka MirrorMaker lag alert actually mean? [11:40:02] btullis: these are probably by Jesse or Chris, but I'd say maybe copy the git diff just in case and then revert to what's in the repo [11:41:27] moritzm: Ack, thanks. [11:44:29] kamila_: I think that the alert is triggered by the values in this panel. https://grafana-rw.wikimedia.org/d/000000521/kafka-mirrormaker?forceLogin&orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=codfw%20prometheus%2Fops&var-mirror_name=main-codfw_to_main-eqiad&viewPanel=5 [11:45:15] We have kafka-mirror-maker configuered to keep topics in sync on the two corresponding kafka-main-[codfw|eqiad] clusters. [11:45:21] ah, that makes sense I suppose, thanks btullis [11:45:37] From the values here, we can see that two topics in particular are having a hard time keeping in sync. [11:46:25] Both related to the job queue and htmlCacheUpdate - At this point I get out of my depth quickly. [11:47:05] Whether that is related to the current incident, I couldn't really say. [11:47:54] yeah [11:47:58] it sure is interesting [11:47:59] ...but they are still trending upwards, which is a concern [11:48:01] thank you for the context [11:48:09] yvw [12:21:57] btullis: who can dig more into this? if it's related to the incident we need to know [12:24:22] I am happy to have a look at it myself. I had assumed that, as the incident resolved, this consumer lag would drop. But it doesn't look like it is. [12:25:41] yes, please. it's still climbing hard [13:00:42] o/ [13:07:47] FYI: I just edited the grafana dashboard to use thanos as the datasource for all panels, rather than having to select them specifically (since mirrormaker dashboard needs cross dc metrics...and it was created before we had thanos) [13:21:23] ottomata: Ack, many thanks. [13:44:18] btullis: if you look at 30days of data you might see similar spikes in consumer lag from mirror maker on this same topic [13:46:35] october 10 it went behind by 330k messages [13:47:34] dcausse: OK, will do. Thanks. [13:52:02] dcausse: We discovered that the lag isn't so much of an issue. It's the fact that these two topics are being filled rapidly in codfw. The rate at which they are emptying looks promising, though. https://grafana.wikimedia.org/goto/3yDjMiZHg?orgId=1 [13:53:52] FYI: The incident is now in Monitoring state, reads and writes are operational in codfw [13:55:25] btullis: sure, htmlCacheUpdate topic size (all partitions) went up to something like 832Gb around oct 9... [13:56:07] o_o [13:56:50] *2 because of changeprop paritioning schemes [20:57:51] Hi folks. Just wondering if may be a problem with largeish (80mb) uploads to Commons. I'm running DPLA Bot and we're seeing uploads of that size take ~12 minutes or so. [21:00:46] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue is partially relevant [21:00:59] But where is the bot running? How is the upload being done etc? [21:01:12] The bot is running in AWS using Pywikibot [21:02:17] I recently cleaned up the code, but as far as I can tell, I haven't messed with the actual calls to Pywikibot [21:06:41] standard uploads or chunked? [21:06:56] chunked, 3mb, async [21:07:50] annecdotally, it feels like the upload is actually really fast, and then not acknowledged for a long time [21:10:40] I thought that too for a file I uploaded 9 October [23:57:00] I've been getting lots of "Failed to block internal_api_error_DBQueryError" errors lately. [23:57:05] Is this a known problem?