[06:53:00] Is there anything wrong with CI? https://gerrit.wikimedia.org/r/c/operations/puppet/+/933373 [06:54:13] There are some spikes at https://grafana.wikimedia.org/d/000000321/zuul?from=now-3d&to=now&orgId=1 but I am not sure if it is normal or not [07:01:22] hashar: ^ [07:04:33] I did a new patchset and this time it went through, no idea [07:05:42] But it seems to me that it happens with more repos: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/933125/ [07:07:09] kart_: it is happening on more repos [07:07:24] This is a new change I just sent: https://gerrit.wikimedia.org/r/933382 [07:07:26] I am going to create a task [07:07:42] Thanks! [07:09:31] I have created it with UBN https://phabricator.wikimedia.org/T340518 [07:18:48] same for my changes :( [07:19:00] Yeah, just added a couple of more examples to the task [07:43:19] There might be old tasks that explain how it got fixed but hashar is likely the expert. I've seen him doing stuff today but no response to ping. [07:44:27] marostegui: RhinosF1: checking (sorry I had an appointment this morning) [07:44:41] No worries [07:44:56] that smells like a stuck connection again [07:46:50] Max connection count for user jenkins-bot exceeded, rejecting new connection. currentSessionCount = 4, maxSessionCount = 4 [07:48:38] !log Restart Zuul due to stuck connection | T340518 | T309376 [07:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:44] T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [07:48:45] T340518: CI marking all changes as -1 - https://phabricator.wikimedia.org/T340518 [07:50:02] marostegui: RhinosF1: should be fixed [07:50:26] I have tried a workaround fix which turned out to cause some other trouble that I haven't been able to narrow down :/ [07:50:28] thanks! [07:56:25] I'll go ahead with deployment as CI is OK ^^ [09:24:12] moritzm: per yesterday's discussion, I've made an override for my oncall week commencing 2023-08-28 and put you down for that week. I think you're going to do the reverse for w/c 2023-07-24? [09:25:50] yeah,I'll do that later [09:27:32] 👍 [10:40:23] btullis: FYI the cumin alias hadoop-worker-canary is not matching any host currently (see email to root@ 'cumin-check-aliases') [10:40:50] the alias is pointing to analytics1058 that is being decomm'ed [10:41:07] volans: Ack, thanks for the head-up. I will update it. [10:41:16] thx :) [10:41:39] it should have show up during the decom run in the puppet repo git grep that is shown at the start [10:42:11] if not that's a bug I'll have to fix [10:44:32] It looks like there was a warning, but stevemunene ran the cookbook and I don't have the output: https://phabricator.wikimedia.org/T338227#8946061 [10:51:11] the warning is unrelated. At the start the cookbook shows to the user any referece it can find in a bunch of repos for hostname and IPs to warn if there are still references in puppet that might cause harm [10:51:23] https://www.irccloud.com/pastebin/yPRbla5i/ [10:52:27] that's the one [11:00:01] effie, claime: would you be up for trying the PC config change for dewiki now? I have a conflicting meeting during the deployment window [11:00:21] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/933184 [11:00:34] Does dewiki make sense? Or should we go for enwiki directly? [11:00:41] duesen: Let me do a quick check of the jobrunners/jobqueue, then we can go ahead and proceed with dewiki [11:01:02] Leave it for and hour or to so we see what happens, and then try en? [11:01:10] Sure [11:05:02] So looking at backlog we never went above 8 minutes after the concurrency update which is pretty good imo [11:05:26] What do you think? [11:05:59] jobrunners look ok, max load was around 50% [11:07:26] Yes, it's looking very calm. Which is good, because this way we can notice small changes. [11:07:33] Can I deploy? [11:07:53] claime: can you +1 the change? [11:08:00] yeah sure, sorry [11:08:12] I was checking if it didn´t impact other jobs but doesn't look like it [11:11:14] duesen: I was at a meeting soory! [11:11:21] I am here if needed too [11:11:30] effie: all good :) [11:11:35] merging now. [11:11:41] cool! [11:12:22] we should see the changes (if any) quite quickly since backlog is around 500ms [11:15:07] I am wondering whether I should touch a template to trigger load... [11:19:03] fpm restart running... [11:27:57] deployment finished 5 minutes ago. [11:28:08] i see nothing at all in the graphs. [11:30:47] yeah, no change [11:32:48] should we go for en, or do you want to trigger some load? effie do you have an opinion? [11:34:07] I suggest going for en [11:34:50] let's do that then [11:34:54] since we added conc and we have seen improvements, we have at least 1 thing to try out before rolling back, if we need to roll back [11:35:31] right now? [11:35:34] ok [11:39:42] diff looks weird [11:40:40] yea, i screwed up the rebase [11:40:54] claime: fixed [11:41:23] if you + [11:41:28] if you +2, i'll deploy [11:41:30] +1 [11:41:34] yea :) [11:41:38] Done :) [11:42:34] merging [11:53:05] fpm restart running [11:53:35] ack [11:55:48] ok, it's live [11:56:09] * claime watches [12:00:15] * duesen doesn't see anything [12:00:18] It's like watching paint dry, nothing's moving lmao [12:00:33] i was just about to say the same thing [12:00:36] are we sure we are doing this correctly? [12:00:58] pretty sure, yes. [12:01:18] my interpretation is: the way the reace condition has been playing out, the jobrunners where already doing most of the parsing [12:01:32] so forcing them to do all the parsing doesn't change much [12:07:05] ok, I think we can call this a success. I need to run out for a bit. [12:07:12] Can we do all wikis tomorrow? [12:19:05] σθρε [12:19:08] sure [14:23:56] TIL about ':(exclude)' in git pathspecs [15:23:19] !log hi all fyi i have temporarily broken puppet-merge, fix is being done [15:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:37] [I infer everyone else already knew about that feature :) ] [15:23:57] oh no that wasn;t ment to be a log [15:25:30] Good log tho [15:25:32] Very polite [15:36:09] ok all fixed again [15:37:09] it would have been a nice tweet too :-P [15:37:19] but apparently the integration is broken since June 13th [15:45:17] volans: did you happen to see where the alternative brand of sfp-t actually did fix the pxe boot issue? [15:45:56] and what's more, the same issue occurred on another host (after I moved on), and it solved it there as well [15:46:09] hardware is the worst. [15:46:10] urandom: what do you mean by "where it fixed it"? [15:46:24] last I checked we had no idea why it failed [15:46:43] more specifically why the port on the switch would turn down and stay down [15:46:46] on reboot [15:47:38] right, J.ennH replaced the SFP-T with another of the same brand, to no effect, and then replaced it with one of a different brand (wave2wave optics), and it worked [15:47:58] i.e. it no longer turned down and stayed down [15:48:39] yeah I saw in dc-ops but I hve no idea why that would be the case [15:48:40] then, that same thing happened with 2003, and the same solution (replacing the adapter with a wave2wave optics) also solved the issue [15:48:51] neither do i [15:48:59] which is why I mentioned it, it's wild! [15:49:09] maybe netops might know some more, but they are both out [15:51:09] X.ioNoX seemed to support trying it, which seemed to suggest he at least thought it might make a difference [15:51:54] but yes, I have questions, even if I'm afraid I won't like the answers [15:52:51] yeah I bet [19:17:11] Is using redis discouraged? We have been phasing out many usages but I don't know if they were just done to phase them out specifically or the plan is to completely get rid of redis in our production infra [19:28:22] volans: ouch sorry for abandoning then [19:28:32] no prob [20:10:26] Amir1: it's a discussion we want to have tbh. At least for the misc redis cluster that serviceops maintains. I know gitlab uses redis internally too, I guess that use case isn't going away [20:10:48] thanks [20:11:52] I'm writing a basic guide on what storage developers should use and was wondering if I should suggest redis or not [20:12:57] https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc [20:13:19] That's a pretty good question [20:14:05] Let me ask you a similar question. Will you add elasticsearch? [20:14:33] I added there [20:14:50] I personally think there are usecases we could move to elastic, e.g. RC [20:16:17] Agreed, but imho That's not enough to offer it to people as a suggested data store service [20:18:06] IPoid's data scheme feels a pretty natural match for elasticsearch too [20:18:08] yeah, mostly I want to have an informative page with all the details to consider [20:18:50] like a basic part of the page is this [20:18:52] https://usercontent.irccloud-cdn.com/file/LwSoVRRz/grafik.png [20:18:55] The moment you see what a single entry is (structured json) elasticsearch just rings a 🔔 in your 🗣️ [20:19:35] akosiaris: why not mongodb? [20:19:45] haha, I was actually thinking of redis for ipoid [20:19:56] my ears are burning [20:20:15] Reedy: "but it doesn't store the data" said an esteemed SRE a couple years ago [20:20:32] s/elasticsearch/opensearch/ [20:20:37] Amir1: yup, redis is the other clear candidate tech [20:20:38] Reedy: everyone's favourite topic, aka software licensing [20:21:02] taavi: that's only a problem if you want to use a modern version ;) [20:21:44] inflatador: no need to be worried, I don't think it would be wise to push tons of arbitrary data down the current elasticsearch clusters [20:22:23] As fun as would be ;) [20:22:29] ...err..."it would be" [20:22:41] inflatador: actually, would you mind adding some pieces on limitations of ES/OS to the document I'm writing? [20:23:00] * bd808 has stuffed more than his fair share of "extra" data into the cirrussearch cluster over the years [20:23:00] once done, I'll bother you [20:23:05] Amir1 sure, hit me up with a task [20:23:10] They were built for a specific purpose, repurposing them without serious discussion and changes wouldn't be prudent [20:23:30] I forgot Blazegraph! [20:23:37] No [20:23:42] we've had a task for creating a new cluster forever (sorry bd808 ) and I think that would be a good place to start [20:23:56] we should definitely recommend to pour terabyte and terabytes of data to blazegraph [20:24:01] Don't suggest blazegraph to anyone for anything please [20:24:06] Reedy: ouch. I have a 3.4 instance running somewhere since that's what the unifi controller needs [20:24:15] :D [20:25:15] Ubiquiti user, so on brand for you taavi [20:25:16] Less of a "blaze" and more of a "dumpster fire" graph [20:25:26] I got one of their APs before the pandemic, and I think it's the same version that it was running back then :/ [20:25:26] Ahahaha [20:25:38] almost guaranteed it's get security issues in it then [20:26:05] It auto updates, doesn't it? At least the APs [20:26:19] Depends on your config [20:26:20] the swishy UI makes up for any security issues, surely? /s [20:26:23] And your controller version [20:26:30] Even the controller, depending on how it is installed [20:26:56] I just got some Mikrotik stuff to replace the ubiquiti. No comment on the quality of either brand, just looking at something new [20:26:57] TheresNoTime: you don't want to know the details of my network setup at home [20:27:28] I assume "homelab-er" doesn't really cut it? :p [20:27:48] taavi: I doubt we haven't seen similar things already [20:28:08] I used to run a Cisco catalyst at home [20:28:15] you don't need to really worry about security of blazegraph, there are more ways I can bring down WDQS that I can count [20:28:26] s/that/than [20:28:29] I stopped cause the noise was too much at some point [20:28:35] inflatador: CLJ [20:28:40] yeah, I'm sure I'm not the only one running such a ridiculous setup at home [20:28:48] akosiaris: I remember when someone asked if we can have WDQS at Miraheze. I immediately said not a chance. [20:29:09] Good for you [20:31:22] Not that I'm advocating this, but I'm always curious how WDQS would perform in Amazon Neptune (AMZ bought Blazegraph and closed-sourced it into Neptune). I wonder if anyone even approaches that scale there [20:31:49] I have a couple of the Mikrotik hEXs about, going from the ubiquiti's UI to RouterOS is pretty jarring though :') [20:31:56] honestly, we will have way less problems if wikidata get rid of research papers :D [20:32:03] in everywhere [20:32:10] We're working on splitting that out, dunno how long it will take though [20:50:53] https://www.mediawiki.org/wiki/User:ASarabadani_(WMF)/Database_for_devs_toolkit/Concepts/Choosing_storage_technology very very basic and needs a lot of changes [21:06:27] 👍 [21:09:26] made myself a note, should be up by Friday [21:13:25] thanks! [21:24:16] does it matter that we see the Debian default page here? https://prometheus-codfw.wikimedia.org/ [21:25:15] I think that's ok, urls like https://prometheus-codfw.wikimedia.org/ops/ work which is the main part