[09:05:25] FIRING: [4x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:25] RESOLVED: [4x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:31] I'm looking at a graphite alert for wikidata and wondering if there is an equivalent prom metric? Metric name is `wikidata.maxlag` and I found it in this dashboard: https://grafana-rw.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?viewPanel=12&orgId=1&forceLogin=&from=1711411200000&to=1711497599000 [20:31:47] inflatador: Good question! Looks like the source of that metric is maxlag.php (https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/wmde/scripts/+/refs/heads/master/src/wikidata/maxlag.php) [20:32:59] It appears that script runs somewhere in the analytics infrastructure. We have several metrics emitted by analytics that need input from them to convert to Prometheus. Unfortunately, that means I can't offer a way to make it into a Prometheus metric at the moment :/ [20:34:18] cwhite thanks for hitting me back! I've been reading old tickets and I think (hope?) https://phabricator.wikimedia.org/T331405#8806909 is roughly the same. Still need to test a bit [20:34:52] not as tidy as `wikidata.maxlag` unfortunately ;( [20:36:17] Indeed. Someday we'll have a path forward for the analytics metrics and hopefully make it a bit easier to reference :) [20:36:36] cwhite: maxlag is actually inside mediawiki too [20:36:43] the script from analytics just queries wikidata specifically [20:38:23] hmm, maybe a new metric we can emit from MW directly? [20:38:32] or rather, it, wow. it causes an API error to happen and then parses the error [20:38:52] to get the lag measurement that I'm pretty sure the RBDMS loadbalancer inside MW probably already has [20:39:39] (you can provide `maxlag` on the API to indicate the maximum amount of seconds delayed you are willing your read to be, as compared to the master DB, and that sets it to maximum -1 seconds) [20:41:16] like this for instance: https://www.wikidata.org/w/api.php?action=query&formatversion=2&meta=siteinfo&siprop=dbrepllag&sishowalldb=true [20:42:42] oh, and that repo is named `analytics/wmde/scripts`? so maybe it is really for their environment, not ours? [20:42:51] interesting...I assumed the maxlag.php script was querying blazegraph [20:43:11] if it's just hitting api.php though, probably not [20:43:51] inflatador: yeah that is just mediawiki asking its mysql replicas how far behind they are [20:44:14] `maxlag` has a specific meaning in mediawiki API [20:46:25] Interesting, thanks cdanis! [20:46:36] +1 to that [20:47:09] We recently migrated the `mediawiki.loadbalancer.lag.$group.$host` metric - perhaps we can write a query based on that? [20:47:28] well I think inflatador is concerned with essentially WDQS indexing delay, a different kind of lag [20:49:15] Blazegraph lag is a more accurate measurement for stuff we own. We already monitor/alert on that, but we also want to know when wikidata itself is lagged as that is a more accurate measurement of user impact [20:49:45] oh [20:50:13] it's a "nice to have" [20:50:41] sure, as long as the kind of lag you're concerned about is the mariadb replication delay, that metric cwhite mentioned will probably help [20:50:56] b/c we have caused throttling to wikidata accidentally, see T360993 [20:50:58] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [20:51:55] cool, I think we'll just make a graphite alert then. Long-term, the solution is probably a better prometheus query [20:52:06] In the context of Wikidata, the usual mw maxlag is abused to also report on the wdqs lag. Something like max(mariadb_lag, 60*wdqs_lag). [20:52:32] ! TIL [20:53:24] I'm pretty sure that the wdqs lag is taken from graphite, via a Cron or systems timer (yes, it's convoluted) [20:54:51] for the wikidata alerts, sounds like it