[10:52:08] lunch [11:47:30] hi all i would like to start converting search hosts to puppet 7. are you able to let me know some canary hosts for the roles slisted on https://phabricator.wikimedia.org/T349619 under search platform [11:47:35] also is there anything i should avoid [13:49:59] jbond: I added a comment on the task [13:50:10] inflatador, ryankemper: ^ [13:57:42] great thanks gehel that looks great [14:20:29] dcausse, pfischer: I just saw the email from Andrew about enabling canary events for all streams. Do we have an impact for WDQS streaming updater or SUP? [14:22:21] gehel: already told Andrew that the wdqs updater is not affected (all events are filtered by meta.domain) [14:23:14] the SUP is already consuming streams that have canary events [14:36:28] fyi: Antoine trying to enable java11 CI builds for some of our projects: https://gerrit.wikimedia.org/r/c/integration/config/+/973083 [14:46:12] errand [15:01:26] gehel: SUP is fine, it has filters in place for both parts [15:03:27] ebernhardson: regarding the SUP logging issue: flink only supports slf4j 1.x but the structured logging API was introduced in slf4j 2.x. However, flink supports (and comes with) log4j 2.x which already provides API for structured logging. So I’d adapt our code to use log4j directly. [15:16:43] Octopus intelligence came up in the unmeeting yesterday, and then, by coincidence, a podcast released last night on that very topic! I haven't listened yet, but you can play it online here: https://www.bbc.co.uk/sounds/play/m001scq6 (I generally recommend The Infinite Monkey Cage if you like science and British humour; it's a bit snappier at 1.5x, which the online player will do for you, BTW!) [15:28:56] hm.. over 7 days of events in *.mediawiki.cirrussearch.page_rerender.v1 I don't see a single testwiki event, only canaries... [15:55:31] will be slightly late to retro [16:03:19] retrospective in https://meet.google.com/eki-rafx-cxi (cc: ebernhardson, dr0ptp4kt [16:03:48] thx [16:55:14] workout, back in ~40 [17:26:19] !log elasticsearch::relforge migrated to puppet7 [17:26:20] jbond: Not expecting to hear !log here [17:43:03] jbond interesting! Anything we need to know/do WRT Puppet 7 migration? [17:43:45] inflatador: hopefully it should be a none event to you [17:44:34] everything should work in the same manner for now. there are some technical details behnd the scences but we have allrady migrated about 15% of hosts so have some good experience at this point an things seems good [17:44:52] and if oyu see and issue then ping me [17:45:44] jbond ACK, thanks for your help w/this [17:46:03] will miss you ;) [17:46:03] no probs :) [17:59:23] dinner [18:03:57] ebernhardson: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/59 - that’s the logging fix (fingers crossed, haven’t deployed/tested it in k8s) [18:04:00] dinner [18:04:16] pfischer: thanks! I'll test that out today [18:27:00] dcausse: i've had the spark3-submit job running for the split code this morning. i foolishly hadn't removed the .show(s) from the compiled WIP patch, and so i'm just going to cross fingers and hope it finishes in the not too distant future (premature .show()s of 10K rows not handled well by planner). if it doesn't, i'll remove those and re-run and let you know once the partitions are ready for your querying pleasure. [18:28:00] will of course remove .show()s for the refactor. talked with Joseph yesterday and the spark params are fine. will go over some more with him tomorrow on whether there are spark planner nudges we can do in the code, too. [18:28:22] uh, meant to mention you ^ dcausse [18:28:55] lunch, back in ~40 [18:39:15] dr0ptp4kt: in spark world, .show() usually recomputes the full thing from disk up to that point. Sometimes spark can reuse shuffle outputs, but it's not as common as it should be [19:19:13] * ebernhardson realizes he has no clue how to find the UI that actually uses our image suggestions search [19:29:30] leaving for my appointment a little early today, should be back in ~90 or so [19:29:39] ryankemper, inflatador : I'll skip pairing for today [19:29:57] gehel: inflatador: ack [19:48:14] updater is running again, see how long :) [19:48:47] i made an edit on testwiki and it made it through to relforge, so at least some parts are working [19:56:15] ended up needing to reset the application again, even after manually moving forward the consumer id. Best guess is that flink had stored the bad events in it's state, I don't know if we will need to think or a way to deal with that other than resetting the state in the future [21:22:02] back [21:39:24] thx ebernhardson yeah i had to shake my head at myself on that one. [21:39:52] dcausse: looks like the data loaded in. [21:40:11] select count(1) from wikibase_rdf_scholarly_split where dt = '20231030' and wiki = 'wikidata' and scope = 'scholarly_articles'; [21:40:12] 7650853205 [21:40:35] select count(1) from wikibase_rdf_scholarly_split where dt = '20231030' and wiki = 'wikidata' and scope = 'wikidata_main'; [21:40:48] 7701405178 [21:40:58] nice! [21:41:29] (dr0ptp4kt namespaced) [21:41:34] commons rdf-streaming-updater is stable, still having issues with wikidata [21:41:59] I doubt it's the issue, but going to bump up quotas in staging just to rule out memory issues [21:46:34] errors are the same as yesterday https://logstash.wikimedia.org/goto/aa12e30af7dbc9c38b9ac8f9fa5dbf54 [21:48:09] Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:6500 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused (Connection refused) [21:49:02] port 6500 should be mwapi-async [21:50:31] let me check the routes [21:51:45] is having 'mw-api-int-async-ro' in the list of listeners enough, or do we need to add them in the http_routes? [21:52:38] inflatador: having that in the listeners should open the port up. Hmm, should be able to verify by looking at the proxy configuration but i'm not sure which bit yet [21:53:11] I did have a typo in the routes yesterday that d-causse caught [21:53:16] and indeed mw-api-int-async-ro is also 6500, I hadn't noticed we re-use ports for three different version of mw-api, but makes sense [22:02:24] dr0ptp4kt: woo! congrats! [22:02:54] inflatador: I think it's missing mesh.enabled: true perhaps? [22:04:22] dcausse that sounds right. Could that explain why we couldn't reach the kubemaster? [22:08:52] OK, applying w/mesh enabled and 3GB for pods [22:11:47] well, now wikidata is stable and commons won't start...opposite of yesterday [22:11:49] ┐('~`)┌ [22:12:43] :) [22:12:55] forbidden, exceeded quota looks like [22:14:56] inflatador: those can be false-flags, i think what happens is flink keeps trying to replace the taskmanager with a new one and the old one hasn't fully died yet [22:15:19] but indeed i'm not seeing much else :S [22:15:46] can't hurt to put the quota up to something well beyond, just to rule it out [22:17:49] working on that now [22:21:45] I suppose i'm not fully understanding which things are being summed up either, the error is: requested: limits.memory=3572Mi, used: limits.memory=7772Mi, limited: limits.memory=10Gi [22:21:49] what is added up to get used? [22:23:31] oh, i bet thats the total namespace limit? Which would basically mean there is enough quota to start wikidata or commons, but not both? [22:24:55] i gotta do a school run, back in ~40min [22:30:14] Y, that is my interpretation but I haven't thought about it much. Coming from public cloud, I tend to be pretty free with quotas ;) [22:48:09] I don't think I'll have time to address this today, but here's the quota increase MR. Janis +1'd the previous patch so we should be good https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/973242 [22:49:44] anyway, I'm off....have a great weekend all