[07:07:17] we're not going too good with the lag right now :( [07:07:36] wdqs1013 i super laggy, but codfw got up to 10min, before going down [07:08:20] gehel: mind depooling wdqs1013? [07:08:31] flink restarted in codfw [07:08:44] zpapierski: I think you can do that on this machine [07:09:04] but I think simply restarting blazegraph will be enough [07:09:11] I'll restart then [07:09:26] the lag should be absorbed quickly [07:09:54] right, 35m is manageable, especially with new updater [07:10:11] what happened in codfw? [07:10:55] haven't looked yet, I suppose oomkilled but not sure, mind taking a look? [07:11:37] I don't mind, but I'm not sure where to look in the case of streaming updater on k8s [07:12:01] blazegraph restarted on wdqs1013 [07:13:08] I usually check containerStatuses with "kubectl get pod -o yaml" (see https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater) [07:13:14] thanks for the restart! [07:13:42] ah, right - I forgot all about the times I had to play around with that, traumatic event, I guess :) [07:13:51] looking [07:15:29] or not, deployment servers won't accept my login [07:15:39] ah,nvm [07:16:06] nope, not working [07:18:02] deployment.eqiad.wmnet ? [07:18:24] yep, but that was me [07:20:19] pods weren't restarted recently [07:20:53] let's see the logs... [07:26:29] first look suggests, and I haven't looked into task manager logs, that job manager lost a task manager and restarted the job [07:26:47] after restart it went back to business as usual [07:27:06] it would be nice, that restarts like these didn't go up to our SLO [07:27:15] s/. that/if [07:28:02] https://grafana-rw.wikimedia.org/d/gCFgfpG7k/flink-session-cluster?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater suggests a container restart [07:28:56] zpapierski: sorry I'm late to the party [07:29:07] do you still need wdqs1013 depooled? [07:29:15] when the series are not continuous it's generally a container restart [07:29:19] you should be able to do it yourself: `sudo -i depool` [07:29:31] yep, David told me [07:29:44] but, it turned out that 4h lag is nothing for new updater [07:29:51] so depooling isn't necessary [07:29:51] \o/ [07:30:44] dcausse: containers we're restarted in the past, need to see when was the last time [07:33:38] the -xkp2v pod seems to have restarted its container according to graphana [07:35:32] yep [07:35:43] https://www.irccloud.com/pastebin/3i4WcS3E/ [07:38:30] in general, it all happened quite fast - kubernetes restarted the task manager, job manager noticed and restarted the job from a checkpoint and it caught up afterwards [07:40:08] it happens sometimes, one container got restarted in eqiad yesterday but the impact is generally not visible [07:40:50] here it seems cpu usage for -xkp2v climbed before being killed so perhaps GC is to blame [07:41:30] hmm, can't see logs on kubectl before the restart, you know of the way? [07:42:08] yes --previous or something like that, it's the wikiteck doc [07:42:31] ah, will look there first next time [07:42:38] but logs are in logstash too [07:43:34] nothing interesting there, even error wasn't logged [07:47:53] anyway- wdqs1013 - I can't make out anything interesting about that host, but according to this - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&from=1635217596626&to=1635219411752&var-site=eqiad&var-k8sds=eqiad%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops [07:48:06] lag started to rise around 3:15 [07:48:28] but I see nothing particular about it on general graphs - https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1635217090000&to=1635222547000&var-cluster_name=wdqs [07:48:53] (at least for wdqs1013, codfw had a GC party, apparently) [07:49:17] (other than wdqs1013 stopping raporting triples) [07:49:56] yes it's generally an indication of either gc death spiral or the number of threads going to insane values [07:50:20] there are gc logs we can import in some visualization tool if you want to confirm it's the case [07:50:41] should it be visible on GC Time/Count, though? [07:50:57] it stops reporting metric so not really :/ [07:51:04] ah, ok [07:51:19] then it's probably the case and jvmquake should help [07:51:33] yes I hope so [07:51:53] it can survive way too long in that state, this is not good [07:52:37] 4:30h and probably still going on [07:53:22] what I find funny is that it actually manages to keep steady pace at somepoint [07:53:50] hmm, unless it's a metric failure [07:54:06] I think it is [07:54:07] what metric? [07:54:15] https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?viewPanel=4&orgId=1&from=now-6h&to=now&var-site=eqiad&var-k8sds=eqiad%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops&refresh=30s [07:54:28] it keeps at ~35m [07:54:41] this one is reported by the updater [07:55:30] it's wdqs_streaming_updater_kafka_stream_consumer_lag_Value so it's not pulled out blazegraph [07:55:31] ok, the blazegraph one shows almost 3h, which makes sense [07:55:44] right after the restart the updater reported on shows the same [07:56:04] but when blazegraph stops responding the metric is stuck because the updater is stuck on blazegraph connection [07:56:17] I see [07:56:34] good that I took blazegraph metric for SLO then [07:56:35] it should for more precision reports a "time" that can be substract to now() on the prometheus query side [07:56:57] :) [07:59:03] I guess it's a bit false on the pure lag pov, but closer to reality since this node can't respond to queries anyways [07:59:59] but imo the root of the problem is having zombie node like that for several hours [08:00:06] yeah, agreed [08:00:28] and blazegraph lag gives enough visibility [08:00:58] sure [09:06:09] ryankemper: (for when you're back) did you have time to create an access request task for ejoseph ? [09:58:21] lunch [10:09:52] hello folks, as FYI I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/732611 (tls proxy change) [10:15:50] Lunch [10:16:18] elukey: thanks for the ping ! cc ryankemper [10:56:15] after months of complaining I need to give credit where it's due - current Gnome (40), working on Wayland, with PipeWire in place of pulseaudio, is actually all kinds of awesome. I dare say it's almost as usable as macos. [10:57:46] lunch [12:25:00] @team: if you have a minute to share this on LinkedIn: https://www.linkedin.com/feed/update/urn:li:activity:6857337258173956096/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6857337258173956096%29 (the graph consultant position) [12:25:10] or any other channels you might have access to [12:28:32] direct link to the job posting: https://boards.greenhouse.io/wikimedia/jobs/3546920 [14:37:18] here's the message on the streaming updater to address some of the concerns, review quite welcome https://docs.google.com/document/d/1PWRMGGVIZJLm95PtMletLRzEfxJAYE-oiSA_mmvyz-8/edit?usp=sharing [14:56:18] \o [14:59:09] o/ [15:03:44] */ [15:03:46] o/ [15:06:03] addshore: does Wikidata have some graph that shows the maxlag stats, like what's affecting it? [15:06:50] zpapierski: https://grafana-rw.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m [15:07:22] not high urgency, but if somebody has a chance to answer a question here: T287950 [15:07:22] T287950: Remove $wgJobQueueAggregator from CirrusSearch README - https://phabricator.wikimedia.org/T287950 [15:07:34] thanks dcausse ! [15:36:48] thats the one! [15:37:06] how is the new updater going? nicely? :P [15:37:28] Looks like no wdqs induced maxlag spikes in the alst 7 days [15:38:01] or even in the alst 30 days actually [15:38:51] The new update does seem to have a slightly different lag profile though? Slightly more lag, all of the time (only 2s instead of 0.5s) but someone on twitter noticed [15:39:25] I think zpapierski is going to send a follow up email about that. I was also confused initially, and someone had filed a ticket about it [15:39:38] oooh a ticket *looks* [15:40:01] Without knowing how it works, i kind of makes sense, as there is more logic going on as part of the update stream. and also 1 udpater for all the query services [15:40:03] https://phabricator.wikimedia.org/T294025 [15:40:16] addshore: yes it's basically a trade-off between throughput, latency and consistency [15:40:49] I dont think anyone will / should be cmplaining about even 5 seconds or lag really, but its fun to see folks noticed [15:41:06] it's substantially more than 5 seconds :) [15:41:41] but in general, we're about having more lag for consistency sake, rather than lose data because we're going fast [15:41:48] aah yes, im forgetting that maxlag does not mean actual lag in seconds for the query service [15:42:22] yes that's something we might want to re-tune, I mean the scaling factor [15:43:12] It looks like still sometimes servers drop off the map> [15:43:13] ? [15:43:23] if most bots use maxlag=5 we should probably change that to be in line with what we want to guarantee (lag < 10mins) [15:43:27] like wdqs1013 today? [15:43:27] updater didn't fix our blazegraph instances [15:43:40] aaah right, thats still the other blazegraph issues [15:43:44] yes this is blazegraph misbehaving on killer queries [15:43:44] just recovery is much faster now? [15:43:55] yes [15:43:57] yep, coming right at as, whenever we're happy with our work [15:45:14] We could certainly tweak the factor a bit [15:45:37] It might be an idea to tweak how the maxlag is affected by wdqs updates in general though [15:46:39] Is the updater aware of how up to date / lagged the services are that it is updating? [15:47:29] it's not updating them per se [15:47:48] It's only feeding the kafka topic and each updater has its own consumer group [15:47:48] ahh, it just creates more streams? that are consumed still by some other process? [15:47:51] * addshore loads the diagram [15:47:58] one more stream [15:48:03] mutation topic [15:48:21] gotcha, so each query service still runs some other process, which is taking in that streaming and writing it to itself? [15:48:25] and other processes, on instances, consume that topic and mutate the graph [15:48:31] gotcha, cool! [15:48:38] basically,yes [15:49:03] nice thing here is that they all use the same, single data stream, with all the guarantees that Kafka can provide [15:49:08] At a super super high level on our WIP architecture docs site we have this https://wmde.github.io/wikidata-wikibase-architecture/systems/Query/03-Context_and_Scope.html [15:49:31] and the warning on that page right now is that it doesnt reflect wikidata reality, I might go and update that and send it back to yall for a glance once I am done [15:49:48] great, happy to look at it [15:49:48] I think I'm going to try and tidy this site up a bit for wikidatacon this weekend and hide some of the more WIP areas [15:51:05] thanks for helping me update my mental map :) [15:51:35] ah, right - it's this weekend [15:51:56] can't wait for our discussion session on Saturday [15:55:22] yeah, should be fun :D [17:01:37] dinner [17:01:59] ryankemper: i was kinda surprised to see your deploy of 0.3.90 to wcqs still in the processlist on deploy1002. is it still running? Have to re-sync with the updated runBlazegraph.sh still [17:02:41] ebernhardson: oh weird, let me check [17:03:41] ebernhardson: I couldn't get to your reviews today, somehow day just escaped me, but I'll do it tomorrow. OTOH, can we move forward this one - https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/716076 ? [17:04:25] ryankemper: are you still available [17:04:28] zpapierski: i really don't know what we need to do in the other patch, i can try and look at it again [17:04:30] ejoseph: yes one sec [17:04:58] Oops yeah went back into my tmux and the deploy is blocked on user input, looks like it failed to deploy to the first node so there's still something else to unbreak presumably [17:05:11] https://www.irccloud.com/pastebin/P55Qc6HC/ [17:05:23] ebernhardson: closing it is a valid option as well, though I like the simple configuration - for mediainfo (commons), for wikidata [17:05:27] up to you [17:05:33] (rolled back( [17:05:36] (rolled back)* [17:05:51] ryankemper: which node? Likely it failed to start because runBlazegraph.sh didn't have the stuff to pass the secret, and not having secrets fails init [17:06:01] ebernhardson: wcqs1001.eqiad.wmnet [17:06:29] ryankemper: yup i had checked that one, it didn't have the updated runBlazegraph.sh. A new sync should do it [17:06:38] ebernhardson: ack [17:09:03] hmm, still failed to start :S looking [17:14:26] it's odd, almost like it's holding open an old version of runBlazegraph.sh somewhere, because i can see the config in the file but it doesn't make it into the commandline [17:15:30] shouldn't be possible though, the systemd unit invokes bash...hmm [17:16:07] Maybe it doesn't take effect until we run through a deploy? [17:16:14] Since we haven't actually applied the new version yet [17:16:49] ryankemper: hmm, /srv/deployment/wdqs/wdqs/runBlazegraph.sh has the new script on wdqs1001, and systemd is constantly restarting the instance, so it should be reading the new runBlazegraph.sh [17:17:41] i'm sure it's something easy that i'm just not seeing :) Will find something... [17:26:11] * ebernhardson facepalms: Unrecognized option: -dorg.wikidata.query.rdf.mwoauth.OAuthProxyConfig.sessionStoreKeyPrefix=mw-oauth-proxy:wcqs: [17:26:17] it's -D [17:31:15] haha [17:32:57] ryankemper: https://gerrit.wikimedia.org/r/c/operations/puppet/+/734702 [17:55:13] cool, auth is now working [21:44:40] Trey314159: you might find this fun: https://www.thisworddoesnotexist.com/ [21:44:47] or scary :) [23:10:26] ebernhardson: thanks, that's very amusing. Many years ago, I wrote something that generates definitions, but not the words—for cheating at Scrabble!