[00:22:31] fyi, i'll be working tomorrow (Tuesday) [06:50:01] seeing `Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate` quite a bit, runbook says to let y'all know, I have zero idea what I'm looking at with wd, but it doesn't look like a data-reload is in progress *shrug* [07:06:31] tn: thanks for the heads up, it's likely we need to tune that alert to be less noisy [07:07:16] gehel: We got tagged on https://gerrit.wikimedia.org/r/c/operations/puppet/+/734988 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/735012 but it looks maps-related so we should probably tag someone from the appropriate team (come to think of it I don't know who owns maps now) [07:07:19] No worries, thought I'd be safer saying something *just in case* [07:07:55] tn: yes, I/we definitely always appreciate it :) [07:08:44] gehel: and for the elastic stuff, here's the relation chain on that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736116/ I'm out tuesday so will proceed on wednesday if everything looks good [08:00:12] ryankemper: I tagged hnowlan on those CRs [08:31:12] tn / ryankemper: that alert if for wcqs, which isn't in production yet, we should probably silence that cluster until we're ready [08:32:49] dcausse, zpapierski, ejoseph: should we talk about next project for Emmanuel at 11:30 [08:33:05] Invite sent, looks like all of our calendar are free [08:33:16] ah, our timezones match now :) [08:34:12] yes! and if I understand correctly, Nigeria does not observe daylight saving time, so we'll keep the same time until next Spring. [08:34:29] yep, it's WAT all year long [08:34:38] I'm envious [08:49:26] zpapierski: there is interest in your TDD workshop, but the dates on https://www.mediawiki.org/wiki/Code_Health_Group/projects/DevEd/Workshops#Test_Driven_Development_Bowling_Kata_Workshop are in the past... [08:49:30] Could you update? [08:53:08] sure thing, just did [08:53:55] but it'll take some time, I can't do it this week and it's a bank holiday in Poland (and some other WW2 participating countries as well) the next week [08:57:04] hello folks, there seems to be wcqs1001 returning alerts for burning free allocators, and blazegraph shows some errors (mostly connections reset) in the logs [08:57:31] good morning all [08:59:11] elukey: thanks for the ping! WCQS isn't in production yet, so that should not be an issue. We still need to improve this alerting. [09:00:12] zpapierski: there is still a date in the past for the TDD workshop [09:00:35] elukey: let me see if I can find how to silence that alert in alert manager [09:03:53] elukey: should be silenced for now [09:06:32] ack thanks :) [09:25:07] gehel: forgot to change a 10 to 11 [09:25:10] it's done [09:30:21] zpapierski: are you available [09:30:45] ejoseph: in a meeting, I'll let you know once I'm done [09:39:43] Ok cool [09:57:37] ejoseph: I'll be 2' late for our meeting, I really need a quick break! [10:05:07] ejoseph: I'm there [10:41:19] some of the plugins that needs to be upgrade: [10:41:20] https://gerrit.wikimedia.org/r/admin/repos/search/extra [10:41:24] https://gerrit.wikimedia.org/r/admin/repos/search/extra-analysis [10:41:28] https://gerrit.wikimedia.org/r/admin/repos/search/highlighter [10:42:29] plugins: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst [10:43:32] https://gerrit.wikimedia.org/r/c/search/extra/+/711226 [11:23:33] zpapierski: I fixed both and it works as expected you can join me the same meeting link [11:23:59] I'm there [12:24:38] break [13:59:28] Errand, back in 20' [14:10:36] Running runUpdate.sh on my own machine, it was working, then it crashed due to a null pointer exception, and re-starting the process didn't help. https://hackerpaste.hns.siasky.net/#CADQ-0rx0fASCq5PROzHyqHcRpp3QSHF1T-gEZor0LRlAQgP0ayKgFD-eG82znpIMV [14:12:59] hare: seems like the recent change entry have a "null" namespace [14:13:51] oh, version is the .40 docker build [14:14:25] Was this a bug in an earlier version of the query service? [14:15:08] hare: it's been a while since we used the recent change poller [14:15:30] I suppose this is a good time for me to look into the Kafka streamer [14:15:34] at a glance it seems like an recent change entry without a namespace [14:15:59] I think we should add some guard in the updater code [14:16:14] but not sure why it suddenly stopped working on your side [14:20:10] hare: is it easy for you to deploy a custom version of the updater? if yes I can provide a custom build with a fix [14:20:37] I just have to swap out the runUpdate.sh file, right? [14:22:30] ejoseph: did you manage to talk about java types with zpapierski? Otherwise I have some time now if you want to jump in a meet. [14:22:49] we did talk about them, yes [14:23:06] hare: I believe you just have to swap the jar file "wikidata-query-tools-VERSION-jar-with-dependencies.jar" with a new one [14:23:13] zpapierski: so you've seen ejoseph's code, pretty good! [14:23:54] yep, we also had a chance to push even further into stream territory [14:24:16] gehel: did you have a chance to provide ejoseph with some background on maven? [14:24:27] nope, not yet, but I can [14:24:46] dcausse: if you come up with a replacement file I'd like to try it. I can probably do some Docker-Compose magic to shoehorn it in there, or hotswap it while the container is still running to test it [14:24:47] that would be great, everytime I try to explain something about maven, it does something oppiste [14:25:34] hare: docker cp would probably be enough [14:25:57] you can also replace the reference in runUpdate.sh to another jar file [14:26:09] hare: ok, building, will send you a link soonish [14:26:30] I'd recommend using streaming updater, but I don't think it's been dockerized yet [14:26:43] addshore: ^^ do you plan to ? [14:27:13] I'd say its currently being considered at least [14:27:27] hence my question easterday around different backends for flink other than kafka [14:27:32] *yesterday [14:27:52] right - my proposal for the cold walkthrough stands [14:28:03] zpapierski: for the moment, we don't expose the kafka streams publicly, so that's a no go for third party WDQS. [14:28:38] In a case of third party wikibase installations, adding Flink + Kafka increases the complexity quite a lot! Not sure it is worth it. [14:28:44] ah, I didn't know that recent changes streams aren't exposed [14:29:03] I suppose it wouldn't be possible for me to set up my own Kafka based on a publicly available Wikimedia source? (This could include me SSH tunneling into Cloud Services) [14:29:04] dockerized streaming updater? https://docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater/tags/ [14:29:23] ottomata: it's only flink [14:29:40] hare, there should be a kafka docker hub thing you could use to try it out [14:29:42] we should rename this image [14:29:42] ? [14:29:44] recent change as a mediawiki API is exposed, but not the kafka stream. dcausse and ottomata have some ideas on how to expose all that. [14:30:05] gehel: you talking about exposing kafka or expsoing rdf updates in eventstreams? [14:30:10] ejoseph, zpapierski: want to jump into a call to talk about Maven? Or only ejoseph (not sure if zpapierski needs to be there or not). [14:30:12] eventstreams http* [14:30:14] Yeah, you need some more mw extension, eventbus etc, then kafka, then flink, and then to run a different script not runUpdate.sh [14:30:25] hare: so streaming updater for now is a no-go, but we are considering a public stream of changes [14:30:29] meet.google.com/nqv-hsaw-pqa [14:30:30] and also the service that sits between eventbus and kafka? [14:30:57] hare im currently trying to work on some diagrams covering all of this is you fancy a look! [14:31:20] https://wikitech.wikimedia.org/wiki/Event_Platform#Platform_Architecture_Diagram [14:31:21] ? [14:31:22] :) [14:31:23] ottomata: any and all [14:31:40] I'm wondering is there a way of using streaming updater with minimal modification [14:32:01] if we do expose the resulting RDF stream through eventgate / evenstream (not sure of the terminology) then it becomes super easy to consume it in 3rd party wdqs deployments (with minor changes to the streaming updater consumer) [14:32:28] I think we're discussing 2 different things here :) [14:32:30] public kafka idea: https://phabricator.wikimedia.org/T280628#7089008 [14:32:35] that is hards [14:32:37] but [14:32:49] it is very very easy to expose existent streams via public eventstreams http API [14:32:51] ottomata: thanks for the link, I'll be sure to include that picture in the "deployment" view for the Wikimedia case [14:32:57] yes, we're discussing at least 2 different things (if not 3) [14:33:09] thats the fun of IRC ;) [14:33:12] soooo, whatchyall think about dune? [14:33:14] ...hehehe jk [14:33:17] :P [14:33:23] ottomata: it was great! gutted the next one will take years [14:33:32] two things: 1) exposing any wikibase stream of changes for a streaming updater producer to produce rdf patch stream [14:33:57] 2) exposing wikidata rdf patch stream from an already deployed streaming updater [14:34:45] 3) fixing hare's problem right now by fixing the current recent change based updater [14:34:46] and getting me to talk about Dune is a very bad idea :) [14:34:47] "1" is hard and not sure worthwhile for small installations [14:35:43] zpapierski: we are in meet.google.com/nqv-hsaw-pqa with ejoseph. Do you want to join? Or should we just start? [14:36:16] gehel: go ahead I'll get some other stuff out of my way [14:37:29] dcausse: I'm missing some pieces here - what is need for someone to launch a custom streaming updater for a custom wikibase instance [14:37:29] ? [14:38:22] 2) requires exposing rdf patches and requires no tinkering with flink, but some changes in streaming updater consumer [14:39:10] zpapierski: hadoop + spark for the initial state and soon reconciliation, kafka, event-gate [14:39:32] 2 is planned [14:39:55] well "planned" is a bit pretentious, "considered" is better :) [14:41:35] in theory if you'll start from scratch, no initial state is needed [14:42:10] event-gate? [14:42:26] needed for mw to send events to kafka [14:43:15] ah, thx - is it difficult to set up for 3rd party setups? [14:44:04] no clue but I doubt many third parties installations use EventBus + kafka (but I could be wrong) [14:44:52] I'd assume so as well [14:45:04] honestly yeah, feels like a lot of steps [14:45:45] otoh, if this RCPoller bug is any indication, we will probably get much more bugs related to the code we don't deploy :( [14:47:36] we have the same problem for MW mysql search based subsystem [14:48:07] in general softwares that you do not touch remain stable unless external APIs/deps change [14:48:39] hopefully the recentchange api response is something relatively stable [14:51:05] I wonder how a null namespace could come about, and/or how it would evade detection until now [14:51:49] we might be able (and by we I mean addshore, obviously) set the docker up with those preconfigured so that streaming updater would just work - but it does sound like a substantial amount of work [14:52:35] \o [14:52:38] if we wan't to be able to work of recent changes API, changes are minimal and will be good to go for most Wikidata based installations I guess [14:52:40] o/ [14:52:50] RIght now I imagine close to 0 third parties use kafka etc [14:53:07] zpapierski: that sounds exciting [14:53:08] zpapierski: so i really dont know, what am i supposed to do next for wcqs? I don't really see how to do anything else [14:53:31] api -> flink -> something that isn't kafka -> wdqs [14:53:49] ebernhardson: meet? [14:54:34] zpapierski: i have another meeting at top of the hour [14:54:40] o/ [14:54:59] no worries, after that? [14:55:06] zpapierski: ya, at :30 i can [14:55:35] dcausse: can you join us as well? I imagine streaming updater will be a strong focus here and the stuff around MCR [14:55:57] sure [14:56:08] thx! [14:56:19] ok, invite sent [14:57:28] hare: https://people.wikimedia.org/~dcausse/wikidata-query-tools-0.3.92-SNAPSHOT-jar-with-dependencies.jar [14:57:38] I added a log.warn for this suspicious entry [14:57:51] I have no clue how it could happen neither [15:06:32] ejoseph: our parent pom: https://github.com/wikimedia/wikimedia-discovery-discovery-parent-pom [15:06:54] and the related gerrit repo: https://gerrit.wikimedia.org/r/admin/repos/wikimedia/discovery/discovery-parent-pom [15:18:53] dcausse: that appears to have fixed it, thank you! [15:19:41] hare: glad to hear! if by chance you could check the logs that would tell us what this recentchange entry was about [15:24:58] I don't know what to look for. [15:27:14] addshore: it would require some thoughts around fault tolerance - https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/guarantees/ [15:27:34] cool thing about Kafka is it plays very well with that [15:27:50] hare: the log line should be "Skipping change without a namespace: ..." but where is the log file in your setup?... this, I don't know :( [15:38:59] I'm in /var/log/wdqs and I can't find that string in any of the files [15:39:17] are you running it in docker? is it sent to stdout? [15:39:22] you might not find it in files [15:39:57] I don't recall it logging anything of that nature to stdout and if it did it's no longer in my buffer [15:43:42] how are you running it? [15:45:54] I am running runUpdate.sh from a command line inside the Docker container [16:06:54] hmm, okay, I would have thought that would have had some output to stdout, but i could be wrong [16:27:46] ebernhardson: I did that : [16:28:01] sudo -u analytics-search kerberos-run-command analytics-search sh -c 'export HADOOP_CLASSPATH=`hadoop classpath`; ./bin/flink run -p 12 -s swift://updater.thanos-swift/wdqs_streaming_updater/checkpoints/b85e02696673c5d09d41918872c98898/chk-19553 -c org.wikidata.query.rdf.updater.UpdaterJob ~dcausse/streaming-updater-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar ../updater-job-partial-reordering.properties [16:28:03] Kids are way too tired. Early dinner for me. I'll be back later. [16:29:21] sudo -u analytics-search kerberos-run-command analytics-search sh -c 'export HADOOP_CLASSPATH=`hadoop classpath`; ./bin/flink run -p 12 -c org.wikidata.query.rdf.updater.UpdaterJob ~dcausse/streaming-upda'ter-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar ../updater-job-partial-reordering.properties' [17:27:26] ejoseph: sorry for not getting back in time for elasticsearch plugins, we needed to work out the streaming updater stuff, ping me in the morning tomorrow once you're available [17:41:00] Alright thanks [20:18:21] ryankemper: I've downtimed wcqs1002 and 2001, they are noisy in icinga. No investigation yet. They seemed to have disappeared from prometheus, which probably means high overload or network issues. [20:22:02] created: T294865 [20:22:02] T294865: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 [20:23:30] hmm, i mentioned to d and z, but i started up the import process just to see what it does. Apparently it overloads machines :) [20:23:37] (in wcqs) [20:24:06] That might explain. Did you start it on 1002 and 2001? [20:24:16] all of them, with about an hour between machines [20:24:31] they should all munge and import locally [20:24:50] only 1002 and 2001 seem affected. [20:24:51] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=wcqs&var-instance=All&var-datasource=thanos [20:25:52] 2001 disappears ~18:00 UTC. None of the servers show a particularly high load [20:26:41] hmm, should be the only thing running. The others all report processing wikidump somewhere between 750 and 850 [20:27:00] wonder what else could be different... [20:27:24] no emergency here, and late in this part of the world. Keep me updated if you find something [20:27:32] sure [20:27:33] * gehel reminds ebernhardson that today is a US holiday! [20:27:51] is it? [20:28:26] Election day I've been told! [20:30:13] Are all the query completion candidates historical English queries? or do we also have non-English queries? I forget, but do remember talking about how we were only focusing on English to start [20:30:41] huh, i guess it's election day in some places. But we don't have unified elections :) [20:31:16] mpham: it's everything, but it requires some repetition of inputs from multiple users. it all depends on frequency of input [20:33:30] oops, sorry, i meant if I'm a user being presented different options to complete what I started inputting, will all those options be in English? Or will I potentially also see, say, Spanish completions as well? [20:33:55] mpham: i mean it doesn't event take language into account right now, if enough users repeat a query it will be returned [20:34:38] mpham: but the inputs are exceptionally biased towards english, those are the users we mostly have [20:35:33] oh i see. thanks. I didn't realize exactly how it worked before [20:36:11] so before we were worried about non-English, because of that bias [20:36:56] I suppose one is the bias, that spanish, for example, will be drowned out by english in the frequencys. But also we have to trim low-frequency queries, and with less users more of the non-english get trimmed there too [20:38:37] commons makes the language problem particularly annoying, deploying the same kind of solution to for example es.wikipedia avoids the frequency bias. [20:39:05] we had pondered trying to split it up, but it seemed for a first run avoiding all that complexity would be better off (thus preferencing english) [20:40:33] do i understand it right that commons makes it annoying because we're getting users from multiple languages in the same interface? whereas on es.wikipedia it's more reliably just spanish speakers? [20:40:45] right [20:41:01] in which case, it actually seems less complicated to deploy/test it on the language-specific wikis rather than commons? [20:43:29] maybe? We suspected that commons has a much bigger problem with autocomplete because the titles aren't nearly as useful [20:43:54] I suppose i'm thinking it's easier to measure something working on commons, because the current state is so mediocre [20:44:16] although maybe it's better with the work they've done in mediasearch, this was the initial evaluation [20:58:04] gotcha. thanks for the context