[10:55:58] lunch [12:48:29] pfischer: would you have a couple minutes to review https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/9 ? [13:30:00] o/ [14:43:24] \o [14:48:53] o/ [15:26:52] inflatador: looks like wdqs1003 did not get the new release, it has blazegraph-service-0.3.122.war [15:27:53] dcausse interesting, have you checked the other hosts? [15:28:34] most of them seem to have the new jar but did not check all the servers [15:28:46] wdqs2009 looks like it has it [15:30:57] wdqs1008 does not have it [15:31:18] sounds like we'll have to do another scap deploy [15:31:52] do you still have the logs from scap? [15:32:41] Y let me paste them [15:34:16] https://phabricator.wikimedia.org/P47280 [15:34:48] no errors I can see [15:40:02] cumin finished its restart run [15:40:05] there's only the finalize stage, but I think logs are stored somewhere, looking [15:43:52] guessing something weird happened because we deployed during the switchover? [15:45:14] yes looking at scap logs it switched from deploy2002 to deploy1002 during the deploy, using different revs [15:45:34] curious, i thought the first thing scap did was sync the deploy hosts [15:45:46] maybe thats only in the mediawiki deploys [15:47:36] unsure how it words but the revision by scap was "Deploying Rev: HEAD = 0e051d8dc81928f47b8ab555f51619e740f9611b" [15:48:01] and then while doing wdqs1003 it says: Registering scripts in directory '/srv/deployment/wdqs/wdqs-cache/revs/61ef43572b9b93f9d8e81416d2df15c95be09c3d/scap/scripts [15:48:38] FWiW, I'm not seeing any alerts and query.wikidata.org looks OK [15:49:47] I think I'm going to try running another deploy, unless y'all think we should fix hosts manually [15:50:22] wdqs1003.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1008.eqiad.wmnet, wdqs2007.codfw.wmnet are the hosts that pulled 61ef43572b9b93f9d8e81416d2df15c95be09c3d instead of 0e051d8dc81928f47b8ab555f51619e740f9611b [15:50:57] this from /srv/deployment/wdqs/wdqs/scap/log/scap-sync-2023-03-14-0001-1-g0e051d8.log [15:51:58] inflatador: I think you can sync only these 4 hosts but feel free to redeploy the whole fleet [15:57:39] dcausse I'll check scap docs and target only those if there's an option [16:29:49] dcausse: reviewed https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/9 +2 [16:30:02] pfischer: thanks! [16:30:19] Do we have a merge protocol for GitLab yet? [16:30:45] Should the reviewer just approve (my preference) or merge, too? [16:30:56] pfischer: I think we still need to discuss all this more formally :) [16:31:07] Sure [16:31:42] I think we say that we're trying to mimic gerrit workflow but not sure if we really agreed to do this [16:32:32] and I personally self-merged few patches in gitlab already, (against gerrit guidelines...) [16:35:27] I'm at Kubernetes mtg but will redeploy by EoD [16:36:29] thanks! [16:37:59] wondering what's best, I have a job that just transforms a row into another then writing to a hive partition, this hive partition should have the same fields as the source [16:38:57] should the job accept just the table name as output and transparently using source fields as partition fields [16:39:40] or accept a full partition spec but then drop these fields from the source [16:53:59] workout/lunch, back in ~1h [17:56:05] back [18:17:06] dinner [18:27:11] doh. While reviewing to understand T327199 i found that we basically didn't build a new enwiki completion index between dec 9 and jan 20 [18:27:12] T327199: on-wiki search is failing to find relatively newer titles on enwiki - https://phabricator.wikimedia.org/T327199 [18:27:24] and nothing notices :( [18:33:34] wonder if anyone has come up with something better than foreachwiki yet... [18:55:24] hmm, so it's reasonable to use aggregations to get the titlesuggest indices with the lowest max(batch_id) value giving a reasonable guess for index age, i suppose we would need to record the oldest value to prometheus and alert on it? Seems a little silly to poll that value into prometheus every minute though [18:57:35] also turns out labtestwiki hasn't built a titlesuggest index since jan 26 [19:22:37] huh, not every minute. it looks like we record those stats 4 times per minute [19:26:36] might only update them once a minute, but querying returns data points 4 times per minute. [19:33:13] OK, wdqs deploy is finished. Running transfer.py (as opposed to data-transfer.py) to see if it works any better [19:34:50] inflatador: I just realized I have a doctor's appointment with Oscar tomorrow. I should be back on time for our 1:1, but with the hospital, I know when I have to be there, but never when I get out. [19:37:54] gehel np if you need to cancel/change time . I will be out most of the day Thurs though [19:38:16] they are closing down my former employer's office and I have to go get my arcade cabinet ;( [19:41:55] hmm, still getting gig speeds from the xfer based on https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-30m&to=now&viewPanel=19 [19:42:44] I'm going to kill it and try from wdqs2006, as it's in the same row as 2022 [19:48:29] scratch that, wdqs2006 is 1 Gig...wdqs2012 it is! [19:54:55] quick break, back in ~20 [20:07:43] still only getting gig speeds...wondering if there's another bottleneck somewhere [20:11:19] I'm gonna let it go with the assumption that it'll be done by tomorrow, if not we might have to check some other stuff [20:47:19] with only a brief look, it seems plausible that's all pigz can do. it seems to be keeping 2012 pretty busy. Maybe could reduce the compression level in pigz but hard to say [21:04:50] yeah seems quite plausible that bottleneck is on the compression and not the network throughput [21:21:39] it's almost done. I guess gig speeds are actually faster than my assumptions [21:23:27] looks good, except the script made an extra wdqs subdir...not sure why since it's supposed to mimic rsync syntax [21:24:38] BG still not starting on 2022...going to run puppet and see if that changes anything [21:30:53] https://phabricator.wikimedia.org/P47282 [21:33:58] heading out soon, but will take a look with d-causse tomorrow.