[10:47:50] lunch [13:21:44] \o [13:46:43] I'm still updating the ticket, but apparently they updated CODFW wikikube to kubernetes 1.31 on Monday [13:49:05] interesting, did that break anything? [13:49:12] (for us) [13:55:49] There was the matter of that rdf-streaming-updater ;) [13:56:06] ahh [13:56:27] I'm still piecing things together, but I'm curious as to why the SUP wasn't affected, or at least I haven't seen any evidence that it was yet. I didn't look too closely yesterday [14:24:22] hmm, it looks like Flink provides the latest checkpoint in its metrics: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#checkpointing [14:25:16] We could probably add that to our dashboard [14:37:53] yea that might be useful [14:38:04] https://grafana.wikimedia.org/goto/dUr-0EENR?orgId=1 working on it [14:49:26] sadly, `flink_jobmanager_job_lastCheckpointExternalPath{kubernetes_namespace="rdf-streaming-updater", release="commons", job_name="WCQS_Streaming_Updater"}` returns 0, that would've been nice too [16:32:37] hmmm. this might be our problem with that metric https://issues.apache.org/jira/browse/FLINK-20530 [18:19:07] lunch, back in ~40 [19:21:37] sorry, been back while [20:32:27] meh...turns out with mwscript-k8s sometimes you run the command...and then wait several minutes for a container to start [20:33:23] That's not ideal [20:33:54] it's not necessarily normal, i've been running mwscript-k8s over and over all day, but in the last ~10 minutes it takes a few minutes to start each time [21:13:30] it turns out .... if you call `$pageStore->incrementLinkCacheHitOrMiss( 'hit', 'good' )` a couple million times mediawiki OOMs [21:14:18] still not entirely sure why, but getting closer :P [21:26:56] well...i don't have great answers on how to fix :S The problem is, probably, that prometheus metric collection in mediawiki maintains an in-memory data structure containing all metrics collected this execution. But it's kind aexpecting a webrequest that ends in about 100ms, not a maint script that visits every valid article in the wiki [21:27:15] so we generate a bunch of metrics, fill memory, and then OOM [21:36:31] it's only a partial explanation though...still not entirely sure why only hewikisource [21:56:17] Ouch [21:59:37] wdqs2009 (the single legacy full graph host) is alerting constantly. Looks like it's CPU-bound https://grafana.wikimedia.org/goto/zVmjvsPHg?orgId=1 [22:00:29] I think we should probably detune the alert [22:04:20] ryankemper any opinion on ^^ ? . I also notice wdqs2014 is alerting for ProbeDown, will check that one out as well [22:05:24] inflatador: yeah we don't have the same guarantees we do of the proper services so i'm fine detuning that one [22:05:37] really i only care about that it's passing healthchecks and responding to requests [22:06:08] unfortunately it's a probedown health check, so that means it's not ;( [22:09:57] I think 2014 is in a weird state? Its motd say it's a wdqs:main host but I don't see it associated with a role in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/manifests/site.pp#2618 ? [22:10:51] inflatador: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/manifests/site.pp#2594 [22:11:36] ryankemper ah, thanks! I might not be getting enough O to my brain ;) [22:12:16] I might have missed the 2014 probedown but the last one i see is from 2009 [22:12:24] I restarted blazegraph on 2009 for the timebeing [22:12:50] the probedown alerts are generic alerts that apply to other services right? or is it our own [22:13:59] they're blackbox checks that we set up and they do a sparql query, ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051369 where we detuned a year ago or so [22:18:30] yeah, 2014 is throttling ATM `WDQS=wdqs2014.codfw.wmnet; curl "http://${WDQS}/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201" [22:18:30] Service load too high, please come back later` [22:20:40] Or it's just broken. I restarted it and the lag is 4 days? Not sure how I missed this one yesterday ;( [22:31:52] wdqs2009's lag panel shows a lot of metrics gaps https://grafana.wikimedia.org/goto/L9eYFyENR?orgId=1 , suggests that BG wasn't responding to the lag query in time [22:38:33] I'm out 'till Friday. Assuming the same "1 day of backlog ~= 1 h of applying updates" math, 2014 should be OK to repool in ~4h