[07:58:45] hmm, I'm not sure I understand how coverage is reported for cirrussearch - is "@covers" annotation important? [07:59:20] zpapierski: covers is important for another coverage report not sonar I think [08:00:06] no, I don't mean sonar (I think) [08:00:14] https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-patch-docker/46404/console [08:00:27] specifically: [08:00:30] https://www.irccloud.com/pastebin/EuROfxLI/ [08:00:31] yes that's the one that needs the @covers tag [08:01:00] ah, that exaplains things, after refactoring class structure change and I didn't update those tags, thx [08:11:42] finally flink *seems* up and running, if only I knew how to access a NodePort from the outside I could test the UI or the REST api... [08:33:36] FYI https://phabricator.wikimedia.org/T287563 Is something around new entities on wikidata currently going slower in terms of indexing? [08:44:39] addshore: do you know when this problem started (roughly?) [08:44:46] nope :/ [08:44:48] looking at the jobqueue [08:47:36] hm.. we should track the latency of the updates instead of guesstimating from the consumer lag [08:52:27] ah it's tracked in https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=now-30d&to=now [08:53:27] so yeah I see timings reaching 15min to 30min but nothing particularly new :/ [08:54:53] since now all the jobs (even unprioritized ones) are going to the ElasticWrite topic it might cause contention there [08:55:19] oh looking at the timings before the DC switch it was a lot a better [08:56:05] p99 was < 1min [08:56:43] now it's more > 2min [09:11:00] ooooooo [09:13:20] adding comments to the ticket [09:13:26] ty! [09:13:29] still can't figure out why [09:13:39] QueryCompletionSuggesterTest doesn't like being a pure tests, it fails with "Error: Call to a member function loadFullData() on null" on jenkins :( [09:13:46] thats okay, at least we know this person isnt crazy :) [09:13:54] oh... lydia wrote it xD [09:15:53] relocating [09:21:54] @addshore thanks for confirming my not-crazyness :P [09:22:02] Though tbf I just wrote the report for Nikki [09:22:05] =] [09:23:36] :) [09:24:51] :D [09:26:05] zpapierski: from jenkins log I see that SearchContext is being used, this one must be contructed with all its constructor dependencies provided [09:26:32] there should be helper methods in the parent CirrusTestCase to build those [09:27:19] thx, will try that [09:55:02] lunch [10:21:46] lunch [10:56:51] dcausse: I don't see any mentions of SearchContext in CirrusTestCaseTrait, do you know of some example usage? [11:37:28] ah, ok, you meant mocking dependencies, not the searchContext [12:08:56] yes [12:43:31] do I have to mock them all? [12:52:01] mock the dependencies? [12:52:09] zpapierski: ^ [12:52:18] I copied from a SearchContextTest (mostly nulls, not counting the last position, CirrusSearchHookRunner) [12:52:25] now I get [12:52:26] 13:32:47 Premature access to service container [Called from CirrusSearch\SearchConfig::getProfileService in /workspace/src/extensions/CirrusSearch/includes/SearchConfig.php at line 302] [12:52:52] this is in jenkins? [12:52:56] yep [12:52:58] looking [12:53:11] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/97719/console [12:53:26] maybe it's complaining about this? [12:53:29] https://www.irccloud.com/pastebin/YxKx1Kq5/ [12:56:26] zpapierski: something like: $context = new SearchContext( $this->newHashSearchConfig( [] ), [], null, null, $this->createCirrusSearchHookRunner() ); [12:56:41] ok, will try [12:58:36] unit test are runnable locally [12:58:57] outside vagrant [12:59:13] I tried on Vagrant, they run green [12:59:33] I'll configure it so that I can run them in IDE [13:00:15] that invocation is missing a parameter FetchPhaseConfigBuilder, can be null? [13:01:25] it should default to the noopRunner (c.f. SearchContext constructor) [13:01:59] which does not seem to depend on MW services [13:04:08] intellij complains about missing php home, but it's super not clear what I need to set... [13:04:28] ah, I think I know [13:06:12] weirdly I only had php 8 on my dev laptop [13:06:36] dcausse: did you figure out how to contact your flink in k8s to test? [13:06:40] i might be able to help [13:06:56] ottomata: just got flink up and running [13:07:07] now I'm trying to access the NodePort [13:07:25] trying https://staging.svc.eqiad.wmnet:4007/ without luch [13:07:30] *luck [13:08:00] dcausse@deploy1002:~$ kubectl get svc [13:08:02] NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE [13:08:04] flink-session-cluster-main-tls-service NodePort 10.64.76.161 4007:4007/TCP 12d [13:08:34] i sometimes try a pod itself [13:08:46] trying with port 6005 or 6003 (eventgate or mathoid) no luck as well [13:08:54] sometimes its easier, especially with tls and sans, sometimes you need to set some fancy --resolve things with curl [13:08:59] ottomata: how do you do that? [13:09:00] cd /srv/deployment-charts/helmfile.d/services/rdf-streaming-updater [13:09:05] kube_env rdf-streaming-updater staging; kubectl get pods -o wide [13:09:09] shows IPs [13:09:21] you should be able to access the pod IPs directly on their ports [13:09:23] not the k8s ports [13:09:28] or...hm maybe that's a chart setting [13:09:31] with eventgate i can do that [13:09:40] lets see [13:09:42] port 8081? [13:09:43] oh did not know that those IPs were visible [13:09:44] main_app.porrt? [13:09:48] yes [13:10:06] hmmm maybe not [13:10:34] sorry 4007 8081 is container port [13:10:42] seems to work [13:10:46] curl -k https://10.64.75.198:4007 [13:11:17] nice [13:11:44] in prod it is easier to go through the k8s service [13:11:50] ah ok [13:12:01] since the certs should be set up for the discovery or .svc. urls (if you have those) [13:12:19] but this method works in prod too [13:12:38] how do you set these .svc urls? [13:12:52] dcausse: I made tests work from IDE, but they don't see Elastica dependencies, am I missing some configuration for the project? [13:13:34] relocating [13:13:58] zpapierski: add "extensions/*/composer.json" to your composer.local.json in the mediawiki folder (inside the merge-plugin include) [13:14:42] * dcausse can't find the doc that explains how to setup all that [13:23:16] zpapierski: https://phabricator.wikimedia.org/phame/post/view/169/changes_and_improvements_to_phpunit_testing_in_mediawiki/ & https://www.youtube.com/watch?v=HOWKHUA-wAI [13:48:17] oh dcausse hm [13:49:21] https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) [13:49:34] but service-ops shoudl help you with this [13:50:18] zpapierski: dcausse btw........i am really hoping and looking forward to a tech talk on this flink updater. yall gonna one day? :) [13:50:54] hopefully yes, we're prepping blogpost already [13:57:14] Yeah, once we're ready we'll probably get to it [13:58:14] Blog post will go out probably same week (at least first of three) [14:02:38] \o [14:03:16] o/ [14:03:38] o/ [14:03:41] dcausse: thx for the materials n testing [14:04:12] I see I'm doing a repeat of the history in high speed :) [14:23:00] yay, tests run [14:23:07] and fail, but that was the point [14:23:11] thangs again. dcausse [14:23:24] yw! [14:23:41] unfortunately, it's still "Premature access" [14:24:03] apparently, when creating this SearchContext [14:24:44] but I think I know why [14:24:44] zpapierski: in short you should never hit a line where MediaWikiServices is called and build&pass this dependency [14:25:49] I shouldn't, but this happens - [14:25:49] $this->cirrusSearchHookRunner = $cirrusSearchHookRunner ?? new CirrusSearchHookRunner( [14:25:50] MediaWikiServices::getInstance()->getHookContainer() ); [14:26:07] I should'e provided that field, though... [14:28:38] ah, I'm missing a debuger [14:38:33] \o [14:38:56] increased latency on wdqs updates after switchover sounds just like the RTT problems cirrus had trying to setup TLS with multiple round trips on every connection [14:39:29] is it behind envoy? [14:40:06] ebernhardson: where did you see that wdqs has had issues after the switch? [14:40:31] dcausse: a few hours ago from addshore here? dshore | FYI https://phabricator.wikimedia.org/T287563 Is something around new entities on wikidata currently going slower in terms of indexing? [14:40:49] ah that is cirrus search indexing [14:40:50] * ebernhardson could also be entirely mistaken :P [14:40:57] oh, odd [14:41:19] cirrusElasticaWrite according to job queue dashboard is behind in terms of latency [14:41:52] I'm looking at "backlog time" [14:42:01] haven't dug deeply into what that means [14:42:08] i never know what to do there...i feel like the observability of whats happening in the job queue is missing [14:42:10] but the diff is obvious [14:43:49] meh, can't find the dashboard I was looking into this morning... [14:44:22] possibly this one? https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s [14:44:33] it doesn't really tell me anything though. It's clearly designed for someone who doesn't care about individual topics:P [14:44:34] https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=now-90d&to=now vs https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&from=now-90d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite [14:45:11] not obvious but there're 2 urls in this ^ :) [14:45:37] it's out of capacity [14:45:58] the job processing line is almost straight, no variation. My best guess at least would be it needs more runners, again. But no way to explain why [14:47:27] it's processing a lot more [14:47:41] jul 1-2 something changes [14:48:27] ~280 is more than what eqiad was doing [14:48:41] more retries? [14:48:42] hmm, indeed those numbers are higher [14:48:57] not sure if we track retry counts, i guess could add something easy enough [14:49:04] (but no history compare :( [14:49:08] no :/ [14:49:28] logstash might tell but elastic won't be happy to extract historical data :/ [14:50:01] hmm, yea that might work. I dunno if it's appropriate but we can scroll elastic directly from a python script or something i guess [14:50:34] or we can aggregate and pretend the search query is accurate enough :) might find some good filters [14:50:56] yes, hoping we set a specific type, can't remember [14:51:01] or channel dunno [14:51:12] i guess i'll poke that later today, see if anytihng comes up [14:54:51] eating a bit of food btw meetings, might be a few mins late to weds meeting [21:15:53] gehel: (for tomorrow) I talked with jclark and we reviewed the availability of 10G switches in the various rows. Summary here: https://phabricator.wikimedia.org/T281989#7244671 TL;DR our best bet is to wait for the decom ticket (linked in my comment), at which point we can get everything in 10G and only take a minor hit to our perfect row distribution (we'd have one extra host in row C and one less host in row B)