[01:08:46] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 16 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:08:50] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 11.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:10:26] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:12:08] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[07:46:07] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1121 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104
[07:46:34] <jynus>	 db1121?
[07:48:12] <jynus>	 weird pattern since 5am: https://grafana.wikimedia.org/goto/-0W-74s4z?orgId=1
[07:49:13] <jynus>	 MediaWiki\Specials\SpecialMostImages::reallyDoQuery
[07:49:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104
[07:51:54] <jynus>	 it's the special pages refresh
[07:52:41] <jynus>	 ^ doesn't need immediate action, could be related to certain tables being "cold" after switchover, but someting to have into account for pooling weights if it reocurrs
[08:37:45] <jynus>	 The MW reported errors from logstash is such a neat idea, but I wonder if it wouldn't be more useful if it looked like this: https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1&forceLogin&viewPanel=14
[08:44:01] <jynus>	 Example outage: https://grafana.wikimedia.org/goto/p0aUN4s4z?orgId=1 for https://wikitech.wikimedia.org/wiki/Incidents/2023-02-23_PHP_worker_threads_exhaustion
[08:44:35] <marostegui>	 we could include that on the mysql-aggregated dashboard indeed
[08:44:50] <jynus>	 the thing is- it is already there, just with a weird expression!
[08:45:09] <jynus>	 that is why I am surprised
[08:46:08] <jynus>	 this is what is there: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&viewPanel=14 which is an average over 14 days????
[08:46:35] <marostegui>	 "expr": "avg_over_time(log_mediawiki_mysql_hits[7d]) * 1440",
[08:46:54] <marostegui>	       "metric": "mysql_global_status_access_denied_errors",
[08:47:04] <jynus>	 yeah, if everything is removed and only log_mediawiki_mysql_hits is left, you get my graph
[08:47:17] <marostegui>	 Yeah, I am checking yours: "expr": "log_mediawiki_mysql_hits",
[08:47:27] <marostegui>	 I think that's probably more useful for a first check
[08:48:01] <jynus>	 just leaving it there, for the dbas to decide, but the expression looks weird to me
[08:48:26] <marostegui>	 I am fine changing and use the other metric
[08:48:27] <marostegui>	 Amir1: ^
[08:48:42] <jynus>	 maybe there is a privacy reason or something why it was done like that?
[08:50:30] <jynus>	 or maybe only big trends were wanted- but that also seems like the wrong approach to achive that?
[08:59:12] <Amir1>	 Let me see
[09:01:36] <jynus>	 I don't know who added it, I supposed it was one of you
[09:01:43] <Amir1>	 I did
[09:02:00] <jynus>	 Amir1: was it at 4 am in the morning :-P?
[09:02:12] <Amir1>	 I think it's useful to have a graph for errors
[09:02:23] <Amir1>	 I mean the graph of my errors
[09:02:28] <jynus>	 I couldn't agree more
[09:02:36] <jynus>	 the first sentence
[09:04:06] <Amir1>	 But errors can be anything not mysql hits. Maybe I'm missing some context. Give me a bit to get some coffee 
[09:05:18] <jynus>	 so I don't have context for what hits are, but whatever is measuring, instant rate seems more useful than an average over 1 week!
[09:06:31] <jynus>	 for example, the instant rate seems to show correctly the replication lag issue I mentioned above: https://grafana.wikimedia.org/goto/YUvmdVsVz?orgId=1
[09:09:10] <Amir1>	 Ah that part. I'll write something about it 
[09:09:40] <jynus>	 not needed really, but do you see what I am proposing?
[09:11:22] <Amir1>	 Yeah yeah. I think having two would be useful 
[09:11:35] <jynus>	 2?
[09:11:54] <Amir1>	 I need the smooth one for mw deploys 
[09:12:15] <jynus>	 I see, so that's the context I lacked
[09:12:26] <Amir1>	 Measuring impacts of mw deploys basically 
[09:12:37] <jynus>	 you want a way to see if error rates have changed compared to las week, right?
[09:12:51] <Amir1>	 Because errors tend to have a lot of spikes
[09:13:06] <jynus>	 I can give you that with a better expression than a rolling average
[09:13:15] <Amir1>	 Yeah but not without smoothing because they get all sorts or random spikes
[09:13:33] <Amir1>	 Try and see
[09:14:12] <jynus>	 something like this is a better option: https://grafana.wikimedia.org/goto/AdpAOVsVz?orgId=1
[09:15:34] <jynus>	 let me try something on a copy and show you
[09:16:16] <jynus>	 you definitly want to avoid average as a central value
[09:24:27] <Amir1>	 I want to mostly see the trends in the baseline basically or if we have so many spikes that moves the average 
[09:24:54] <Amir1>	 but weekly seems excessive, maybe daily or something like that *shrugs*
[10:14:05] <jynus>	 Amir1: this is a first approximation (daily granularity, but can be tuned further): https://grafana.wikimedia.org/goto/RAmCtVyVz?orgId=1
[10:15:14] <Amir1>	 thanks!
[10:15:35] <jynus>	 I think it is better than the rolling average
[10:15:46] <jynus>	 and has the last week in the same graph
[10:16:27] <jynus>	 while spikes can be seen in case there is something to discard manually
[10:21:33] <jynus>	 or the opposite, we can make it 7 day total but daily steps
[10:32:58] <hnowlan>	 how hard is it to delete an individual file from swift? Specificially a thumbnail that I'd like to regenerate in this case
[10:34:54] <jynus>	 assuming you can find it- relatively easy- it is documented at:
[10:34:57] <Emperor>	 hnowlan: you may be able to achieve that by puring the original image (at least if it's one of the standard sizes)
[10:35:12] <jynus>	 ^what Emperor says
[10:35:14] <Emperor>	 otherwise, easy (but should be done with care), destructions on wikitech:
[10:35:30] <Emperor>	 https://wikitech.wikimedia.org/wiki/Swift/How_To#Delete_a_container_or_object
[10:35:51] <Emperor>	 [or ask me nicely ;-) ]
[10:37:03] <jynus>	 given how easy is to delete containers, maybe a wrapper script could be done for individual objects with some checks
[10:39:06] <hnowlan>	 oooh scary :D 
[10:39:19] <hnowlan>	 Emperor: as in purging using purgeList.php? 
[10:40:04] <Emperor>	 just add ?action=purge to the image page on commons
[10:40:21] <Emperor>	 https://en.wikipedia.org/wiki/Wikipedia:Purge#Images
[10:41:37] <hnowlan>	 ohhhh, doh. Thanks! 
[10:44:16] <Emperor>	 (that does still often leave the thumbs in the CDN cache, which can be confusing)
[10:49:44] <hnowlan>	 good to know. nothing I need to fix is critical enough that the CDN will interfere 
[10:51:26] <Amir1>	 I'm half seriously thinking of running action=purge with a bot on all images of commons to achieve T211661
[10:51:27] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[10:51:40] <Amir1>	 very slowly obviously 
[10:56:51] <Emperor>	 I think one of u.random's KRs for this quarter is making the swift object expirer run; and/or we could just delete old thumbs :)
[11:43:17] <Amir1>	 A rather simple solution is that to do rolling purges of all thumbs, if they are needed, they will get regenerated. Similar to cache TTLs basically
[11:47:59] <Amir1>	 (obviously very slowly not to break thumbor, e..g one shard/container only at a time)
[12:02:36] <Emperor>	 yeah, that might work
[15:22:43] <jynus>	 db2184 will complain about lag for a second
[15:32:57] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on backup1-codfw on db2184 is CRITICAL: 65.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2184&var-port=9104
[15:36:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on backup1-codfw on db2184 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2184&var-port=9104
[15:36:12] <jynus>	 that's all