[09:17:32] pfischer: are you ready to do our ITC during our 1:1 in 45' ? [09:49:17] ryankemper (for when around): it looks like the prometheus patch for WDQS SLO was reverted: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883223/ I'll let you check with Observability what went wrong and how to test it for next time. [10:00:55] gehel: yes [10:23:44] yes I reverted the patch due to puppet failures, not to muddy the waters with a swift upgrade ongoing today, I don't have the time currently though to look into the failure further [10:24:06] but I've filed T327876 to have puppet CI validate those files, so at least we get early warnings [10:24:06] T327876: Validate Prometheus/Thanos rules in puppet CI - https://phabricator.wikimedia.org/T327876 [10:47:11] godog: thanks! And sorry for the failure [10:48:19] gehel: sure no worries! it happens [11:32:45] lunch [12:28:29] lunch + errands [13:31:32] dcausse: Shall I rebase my spark upgrade on https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/879822 or can we merge this? [13:33:15] pfischer: I'm testing these patches in yarn at the moment but why would you need those for the spark upgrade? [13:36:26] rdf-spark-tools and streaming-updater-producer should be two independent projects [13:44:15] ah I guess that's because of scala, I'd suggest overidding scala.compat.version and other related properties in rdf-spark-tools rather than the parent pom while we settle all this [13:44:47] deploying flink 1.16 might take some time [13:58:29] o/ [14:44:33] inflatador: last minute question: would you be ready to do our ITC today? (in 15') [14:44:52] gehel yes [14:45:01] inflatador: great! [14:45:26] I might be a few minutes late. Quarter in review and I'll definitely need a few minute break after that one [14:52:22] dcausse: yes, it’s somewhat butterfly-effect-ish cascade of version bumping doom [14:53:05] this "rdf" project encompasses way too many projects... :( [14:55:10] inflatador: This meeting is going to run a bit late and I want to give proper focus to our ITC. Could we move it to tomorrow, after the SRE pairing session? [14:56:40] gehel OK [14:56:46] thanks! [15:02:58] sonarcloud.io shows an error, saying that the last analysis has failed (that also results in a failed validation inside gerrit). However, the analysis results are visible (and up to date): https://sonarcloud.io/summary/new_code?id=org.wikimedia.discovery.cirrus.updater%3Acirrus-streaming-updater-parent&branch=879109-5 [15:04:15] pfischer: there might be more info in the jenkins logs (unless that's part of the async analysis on the sonar side) [15:05:07] Oh, there is something on the sonar side: [15:05:08] o Date of analysis cannot be older than the date of the last known analysis on this project. Value: "2023-01-25T14:56:50+0000". Latest analysis: "2023-01-25T14:56:51+0000". It's only possible to rebuild the past in a chronological order. [15:06:24] I suspect there is a race condition somewhere in the dance we do to get around the Zuul / Sonar integration issues. Probably transient [15:23:13] gehel: alright, thanks! [15:24:29] dcausse: Can I assist you in any way with the flink tests or is there another way to untangle dependencies? [15:26:54] pfischer: getting flink 1.16 deployed will require some operational work with the help of Brian so can possibly take some time [15:27:47] for the dependencies I could update flink's scala version without touching rdf-spark-tools so I guess this is possible? [15:28:09] might require splitting your patch [15:30:40] like overriding scala.version & scala.compat.version in the rdf-spark-tools sub module pom? [15:36:47] pfischer: does this ^ sound feasible to you or do you prefer to keep your patch as-is? [15:36:55] If I can help LMK [15:37:39] inflatador: I'll need your help for T304914 [15:37:40] T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914 [15:38:17] this is a preliminary step required before upgrading to flink 1.16 [15:41:02] dcausse Oh yeah, sorry for not jumping on this earlier. let me see if I can try this on the staging cluster as you suggested [15:42:13] inflatador: we did codfw last time together, we need to do it in eqiad, can't remember the status of staging tho [15:43:25] inflatador: If you have time we should probably schedule some time together for this [15:45:09] dcausse agreed, are you OK with tomorrow? Happy to do next wk if not [15:45:25] sure tomorrow sounds good! [15:48:59] Cool, I sent the invite. Friday is OK too if that's better [15:49:30] thanks, tomorrow is perfect! :) [15:58:26] inflatador, ryankemper: your input will be required on T327925 for the switch upgrade on Feb 7 (so fairly time sensitive) [15:58:27] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [16:02:03] :eyes [16:02:57] dcausse: I’ll give it a try and split the patch. [16:13:57] dcausse: Would it be fine to pin the flink/scala versions in the streaming-[…] projects for the time both patches coexist? [16:16:32] pfischer: you mean instead of pinning a newer scala version in rdf-spark-tools? I think that'd work too [17:10:21] workout, back in ~40 [17:46:41] back [17:47:03] have to run an errand so no unmtg for me today [18:08:34] lunch/errand, back in ~1h [18:36:08] dinner [18:50:33] Random unmeeting topic update: check the news! The earth's solid inner core may have stopped spinning and may change direction! We were wondering what effect this might have on the earth's protective magnetic field, but not to worry, the magnetic field is generated by the motion of the liquid *outer* core.. so we probably aren't going to have a pole reversal and we aren't going to be solar flare victims just yet. [18:59:22] I took a look at https://phabricator.wikimedia.org/T327925. Added corresponding notes to the standup notes for tomorrow, but tldr is the 13 elastic hosts we'll likely want to ban the thursday of the prior week so we're not scrambling to get them all banned on monday before the tuesday morning (from us time perspective) upgrade [18:59:40] By contrast the 3 wdqs hosts we can just depool on monday (one day prior) [19:05:16] ryankemper Thanks for taking a look at that [19:05:18] also, back [19:13:44] ryankemper: banning 13 nodes is unlikely to entirely work. We don't have enough nodes to relocate all shards, so some shards will stay on those nodes. [19:14:14] I'm tempted to say that we should do nothing, except maybe disable the alerts on unallocated shards. [19:14:38] We had a good experiment with the switch failure, everything just worked. [19:16:29] Actually, depooling them, so that they don't get any direct traffic might be a good idea. [19:17:06] But pybal should detect that those hosts are down pretty fast. Not sure what our policy on that is. [19:17:31] it would be less noisy if we depooled, plus we could get into one of those "won't depool more than x%" situations [19:17:40] gehel: depooling but not banning does seem sensible. I was thinking the rows E/F could take up the slack but glancing at netbox we only have rows E/F for eqiad not codfw [19:18:10] inflatador: the depool % will actually be the same either way, from pybal's perspective it doesn't care if we depooled them manually or if pybal did due to the backends being down [19:19:09] really? I thought depooling removed them from the pool as opposed to being in the pool with a failed healthcheck [19:19:46] If we do ban them, we'll have 37/50 hosts online, so it's not out of the realm of possibility that elastic could find a way to shuffle them sufficiently. But I def wouldn't be surprised if it failed to be able to shuffle them all [19:20:16] inflatador: if pybal detects a backend is failing health checks it will depool them [19:20:41] so effectively it's the same as us depooling them directly [19:20:58] They need to be marked as inactive to be considered out of the pool as far as the math is concerned [19:22:03] ah, so 'depool' sets enabled=False in pybal, same as a failing a healthcheck? [19:28:26] inflatador: exactly that [19:28:53] cool, thanks for confirming [19:28:57] and so for example when we get the backend pybal error, that's it saying it wanted to set `Pooled=false` but couldn't because doing so drops below threshold [19:42:05] hmm, we have enough dags...i wonder if we should use a tree-like directory structure to organize them like analytics does, or keep it all flat like we have now [20:01:11] dcausse: inflatador, any worries about this? https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/883653 [20:01:21] ^ installing flink in the image via pip install apache-flink [20:02:50] ottomata seems reasonable to me, although I'd want d-causse to review just to make sure we're not missing something [20:07:12] also we should touch base on the flink app stuff. I got it to build last night [20:07:19] the operator image, that is [20:22:33] inflatador: great! got some time now if you wanna [20:24:45] ottomata cool, will start huddle shortly [20:53:20] ottomata my slack just crashed [20:53:57] Trey314159 moved back our chat 15m , in a huddle w/ o-ttomata [20:54:19] inflatador: no problem [21:09:36] Trey314159 might be a few minutes still, will hit you up when ready [21:10:00] ok [21:21:02] Trey314159 and if you have to go that's totally fine too [21:21:51] inflatador: no worries, I'm hanging out in the meeting while I work. Show up if you can, if not, we'll catch up later [21:51:43] Trey314159 sorry I missed ya! We can catch up tomorrow or Friday [21:53:14] Actually, I'm around now if you want—otherwise, tomorrow after the retro? [21:54:00] Let's do tomorrow if that's cool, I have 1x1 with Ryan in ~5 [21:54:42] works for me!