[01:08:46] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 16 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:08:50] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 11.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:10:26] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:12:08] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:46:07] PROBLEM - MariaDB sustained replica lag on s4 on db1121 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [07:46:34] db1121? [07:48:12] weird pattern since 5am: https://grafana.wikimedia.org/goto/-0W-74s4z?orgId=1 [07:49:13] MediaWiki\Specials\SpecialMostImages::reallyDoQuery [07:49:20] RECOVERY - MariaDB sustained replica lag on s4 on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [07:51:54] it's the special pages refresh [07:52:41] ^ doesn't need immediate action, could be related to certain tables being "cold" after switchover, but someting to have into account for pooling weights if it reocurrs [08:37:45] The MW reported errors from logstash is such a neat idea, but I wonder if it wouldn't be more useful if it looked like this: https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1&forceLogin&viewPanel=14 [08:44:01] Example outage: https://grafana.wikimedia.org/goto/p0aUN4s4z?orgId=1 for https://wikitech.wikimedia.org/wiki/Incidents/2023-02-23_PHP_worker_threads_exhaustion [08:44:35] we could include that on the mysql-aggregated dashboard indeed [08:44:50] the thing is- it is already there, just with a weird expression! [08:45:09] that is why I am surprised [08:46:08] this is what is there: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&viewPanel=14 which is an average over 14 days???? [08:46:35] "expr": "avg_over_time(log_mediawiki_mysql_hits[7d]) * 1440", [08:46:54] "metric": "mysql_global_status_access_denied_errors", [08:47:04] yeah, if everything is removed and only log_mediawiki_mysql_hits is left, you get my graph [08:47:17] Yeah, I am checking yours: "expr": "log_mediawiki_mysql_hits", [08:47:27] I think that's probably more useful for a first check [08:48:01] just leaving it there, for the dbas to decide, but the expression looks weird to me [08:48:26] I am fine changing and use the other metric [08:48:27] Amir1: ^ [08:48:42] maybe there is a privacy reason or something why it was done like that? [08:50:30] or maybe only big trends were wanted- but that also seems like the wrong approach to achive that? [08:59:12] Let me see [09:01:36] I don't know who added it, I supposed it was one of you [09:01:43] I did [09:02:00] Amir1: was it at 4 am in the morning :-P? [09:02:12] I think it's useful to have a graph for errors [09:02:23] I mean the graph of my errors [09:02:28] I couldn't agree more [09:02:36] the first sentence [09:04:06] But errors can be anything not mysql hits. Maybe I'm missing some context. Give me a bit to get some coffee [09:05:18] so I don't have context for what hits are, but whatever is measuring, instant rate seems more useful than an average over 1 week! [09:06:31] for example, the instant rate seems to show correctly the replication lag issue I mentioned above: https://grafana.wikimedia.org/goto/YUvmdVsVz?orgId=1 [09:09:10] Ah that part. I'll write something about it [09:09:40] not needed really, but do you see what I am proposing? [09:11:22] Yeah yeah. I think having two would be useful [09:11:35] 2? [09:11:54] I need the smooth one for mw deploys [09:12:15] I see, so that's the context I lacked [09:12:26] Measuring impacts of mw deploys basically [09:12:37] you want a way to see if error rates have changed compared to las week, right? [09:12:51] Because errors tend to have a lot of spikes [09:13:06] I can give you that with a better expression than a rolling average [09:13:15] Yeah but not without smoothing because they get all sorts or random spikes [09:13:33] Try and see [09:14:12] something like this is a better option: https://grafana.wikimedia.org/goto/AdpAOVsVz?orgId=1 [09:15:34] let me try something on a copy and show you [09:16:16] you definitly want to avoid average as a central value [09:24:27] I want to mostly see the trends in the baseline basically or if we have so many spikes that moves the average [09:24:54] but weekly seems excessive, maybe daily or something like that *shrugs* [10:14:05] Amir1: this is a first approximation (daily granularity, but can be tuned further): https://grafana.wikimedia.org/goto/RAmCtVyVz?orgId=1 [10:15:14] thanks! [10:15:35] I think it is better than the rolling average [10:15:46] and has the last week in the same graph [10:16:27] while spikes can be seen in case there is something to discard manually [10:21:33] or the opposite, we can make it 7 day total but daily steps [10:32:58] how hard is it to delete an individual file from swift? Specificially a thumbnail that I'd like to regenerate in this case [10:34:54] assuming you can find it- relatively easy- it is documented at: [10:34:57] hnowlan: you may be able to achieve that by puring the original image (at least if it's one of the standard sizes) [10:35:12] ^what Emperor says [10:35:14] otherwise, easy (but should be done with care), destructions on wikitech: [10:35:30] https://wikitech.wikimedia.org/wiki/Swift/How_To#Delete_a_container_or_object [10:35:51] [or ask me nicely ;-) ] [10:37:03] given how easy is to delete containers, maybe a wrapper script could be done for individual objects with some checks [10:39:06] oooh scary :D [10:39:19] Emperor: as in purging using purgeList.php? [10:40:04] just add ?action=purge to the image page on commons [10:40:21] https://en.wikipedia.org/wiki/Wikipedia:Purge#Images [10:41:37] ohhhh, doh. Thanks! [10:44:16] (that does still often leave the thumbs in the CDN cache, which can be confusing) [10:49:44] good to know. nothing I need to fix is critical enough that the CDN will interfere [10:51:26] I'm half seriously thinking of running action=purge with a bot on all images of commons to achieve T211661 [10:51:27] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [10:51:40] very slowly obviously [10:56:51] I think one of u.random's KRs for this quarter is making the swift object expirer run; and/or we could just delete old thumbs :) [11:43:17] A rather simple solution is that to do rolling purges of all thumbs, if they are needed, they will get regenerated. Similar to cache TTLs basically [11:47:59] (obviously very slowly not to break thumbor, e..g one shard/container only at a time) [12:02:36] yeah, that might work [15:22:43] db2184 will complain about lag for a second [15:32:57] PROBLEM - MariaDB sustained replica lag on backup1-codfw on db2184 is CRITICAL: 65.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2184&var-port=9104 [15:36:05] RECOVERY - MariaDB sustained replica lag on backup1-codfw on db2184 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2184&var-port=9104 [15:36:12] that's all