[09:05:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:35:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:29:31] <inflatador>	 I'm looking at a graphite alert for wikidata and wondering if there is an equivalent prom metric? Metric name is `wikidata.maxlag` and I found it in this dashboard: https://grafana-rw.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?viewPanel=12&orgId=1&forceLogin=&from=1711411200000&to=1711497599000
[20:31:47] <cwhite>	 inflatador: Good question!  Looks like the source of that metric is maxlag.php (https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/wmde/scripts/+/refs/heads/master/src/wikidata/maxlag.php)
[20:32:59] <cwhite>	 It appears that script runs somewhere in the analytics infrastructure.  We have several metrics emitted by analytics that need input from them to convert to Prometheus.  Unfortunately, that means I can't offer a way to make it into a Prometheus metric at the moment :/
[20:34:18] <inflatador>	 cwhite thanks for hitting me back! I've been reading old tickets and I think (hope?) https://phabricator.wikimedia.org/T331405#8806909 is roughly the same. Still need to test a bit
[20:34:52] <inflatador>	 not as tidy as `wikidata.maxlag` unfortunately ;(
[20:36:17] <cwhite>	 Indeed.  Someday we'll have a path forward for the analytics metrics and hopefully make it a bit easier to reference :)
[20:36:36] <cdanis>	 cwhite: maxlag is actually inside mediawiki too
[20:36:43] <cdanis>	 the script from analytics just queries wikidata specifically
[20:38:23] <cwhite>	 hmm, maybe a new metric we can emit from MW directly?
[20:38:32] <cdanis>	 or rather, it, wow. it causes an API error to happen and then parses the error
[20:38:52] <cdanis>	 to get the lag measurement that I'm pretty sure the RBDMS loadbalancer inside MW probably already has
[20:39:39] <cdanis>	 (you can provide `maxlag` on the API to indicate the maximum amount of seconds delayed you are willing your read to be, as compared to the master DB, and that sets it to maximum -1 seconds)
[20:41:16] <cdanis>	 like this for instance: https://www.wikidata.org/w/api.php?action=query&formatversion=2&meta=siteinfo&siprop=dbrepllag&sishowalldb=true
[20:42:42] <cdanis>	 oh, and that repo is named `analytics/wmde/scripts`? so maybe it is really for their environment, not ours?
[20:42:51] <inflatador>	 interesting...I assumed the maxlag.php script was querying blazegraph
[20:43:11] <inflatador>	 if it's just hitting api.php though, probably not
[20:43:51] <cdanis>	 inflatador: yeah that is just mediawiki asking its mysql replicas how far behind they are
[20:44:14] <cdanis>	 `maxlag` has a specific meaning in mediawiki API
[20:46:25] <cwhite>	 Interesting, thanks cdanis!
[20:46:36] <inflatador>	 +1 to that
[20:47:09] <cwhite>	 We recently migrated the `mediawiki.loadbalancer.lag.$group.$host` metric - perhaps we can write a query based on that?
[20:47:28] <cdanis>	 well I think inflatador is concerned with essentially WDQS indexing delay, a different kind of lag
[20:49:15] <inflatador>	 Blazegraph lag is a more accurate measurement for stuff we own. We already monitor/alert on that, but we also want to know when wikidata itself is lagged as that is a more accurate measurement of user impact
[20:49:45] <cdanis>	 oh
[20:50:13] <inflatador>	 it's a "nice to have"
[20:50:41] <cdanis>	 sure, as long as the kind of lag you're concerned about is the mariadb replication delay, that metric cwhite mentioned will probably help
[20:50:56] <inflatador>	 b/c we have caused throttling to wikidata accidentally, see T360993 
[20:50:58] <stashbot>	 T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993
[20:51:55] <inflatador>	 cool, I think we'll just make a graphite alert then. Long-term, the solution is probably a better prometheus query
[20:52:06] <gehel>	 In the context of Wikidata, the usual mw maxlag is abused to also report on the wdqs lag. Something like max(mariadb_lag, 60*wdqs_lag).
[20:52:32] <cdanis>	 ! TIL
[20:53:24] <gehel>	 I'm pretty sure that the wdqs lag is taken from graphite, via a Cron or systems timer (yes, it's convoluted)
[20:54:51] <inflatador>	 for the wikidata alerts, sounds like it