[08:15:52] GitLab needs a short maintenance break at around 10:00 UTC. Should not take more than 15 minutes. [08:22:31] <_joe_> jelto: do we have any plan to make gitlab more HA so that we don't need maintenance breaks? [08:23:07] <_joe_> if we have to transfer the SRE core repos (I'm thinking of at least puppet, alerts, deployment-charts) to gitlab, I wouldn't be confortable with regular downtimes of this duration [08:23:42] <_joe_> when gerrit is in maintenance I can expect a 2-5 minutes of downtime tops, and it's less frequent by at least an order of magnitude [08:30:30] _joe_: we have https://phabricator.wikimedia.org/T323201 to track this. And I agree, having at least one long downtime per month and often multiple ones is quite disruptive. [08:30:30] Last time we discussed that topic we came to the conclusion that the complexity from single instance (omnibus) to HA is also magnitudes higher and needs a lot more infrastructure and also more engineers. So we prioritized other features more. But we can re-evaluate that and I can add this feedback to the existing task. [08:30:59] <_joe_> will do :) [08:31:05] <_joe_> I can take care of it [08:33:17] thank you :) [10:05:44] GitLab is back, maintenance finished. [11:34:30] Anyone aware of issues with the docker registry? I have several hosts that are stcuk in the middle of image pulls [11:36:23] (codfw, for that matter) [12:07:58] klausman: I raised the nginx timeouts yesterday, but I don't think that would result in stuck images [12:08:11] Yeah, me neither [12:08:45] I see a bunch of error curvs go up in the Swift dashboard, but I can't tell what the base issue is. I have 0 knowledge about Swift and how it works at WMF [12:09:05] https://grafana.wikimedia.org/goto/9PpVmIhIk?orgId=1 [12:09:20] Note "Server errors" and ATS->Swift 500s [12:10:03] If it's a swift issue, idk either [12:10:15] Emperor you around? [12:10:43] (side note: by now the image fetches that were slow have completed, so I don't have an easy repro atm) [12:11:16] But the graphs would still give me pause [12:11:38] We need to do some cleanup on the registry hosts, they're at 90+% disk usage, but that wouldn't explain the behaviour you're seeing either [12:13:24] I also don't see anything obvious firing on AM or Icinga [12:26:34] FYI, I'll be stopping Puppet fleet-wide for about 20m starting in five minutes. if that's a bad time for anyone, let me know [12:42:06] and it's back on [13:19:49] claime: looking (sorry, was fighting DB's stupid website) [13:23:57] is a bit of an uptick in p99 connection establishment in codfw, I'll give it a kick [13:29:22] has anyone been tweaking alerts/icinga recently? ms-fe2010 is in state "unknown" because "NRPE: Command 'check_check_systemd_state' not defined" ... and that's blocking my rolling-restart [13:29:53] ah, no that check is gone and it's all green again [13:31:15] Emperor: https://gerrit.wikimedia.org/r/c/operations/puppet/+/998822 [13:35:41] claime: p99 connection time & swift/ATS 5xx rate look better now [13:36:14] volans: ah, I guess I was just losing a race with that getting fully deployed [13:52:13] Emperor: what was the root cause for the 500s from ATS to Swift? [13:53:04] (also, the "server errors" graph on https://grafana.wikimedia.org/goto/0EfMNIhSz?orgId=1 still is quite elevated) [13:56:14] klausman: those errors are from the nodes I pulled from swift::storagehosts this morning; I need to stop swift/puppet on them. They're not service-impacting [13:59:09] (now done, so those will decline again) [14:13:50] Emperor: thank you! [16:02:58] btullis: if you get my changes too feel free to puppet-merge [16:03:06] Ack, was about to say the same. [16:03:36] thx [16:03:37] :) [16:43:53] _joe_: claime: for conf2004 itself and for the codfw maintenance, Traffic will take care of switching it to conf2006 and restarting pybal [16:44:02] is there anything else we need to do for conf2004 other than to downtime it? [16:44:37] <_joe_> sukhe: thanks [16:45:35] _joe_: so nothing special for conf2004 itself? [16:45:45] <_joe_> sukhe: no [16:45:59] thanks! [16:46:04] fabfur: ^ [16:53:16] ok [16:56:33] <_joe_> i mean, besides that it rejoined the cluster, but if it doesn't, alerts will fire loudly [16:57:54] anyway, pybal will still use 2006 after the migration or we need to rollback it? [16:58:37] <_joe_> it would be better to roll it back, yes [16:59:09] ok [17:43:27] taavi: congrats on nabbing gerrit patch number 999000 [17:43:36] there should totally be a prize for whoever gets 1 million [17:44:59] topranks: It'll probably be a bot. [17:45:11] we can get volan.s to volint the puppet repo for us; that's the quickest path to 1 million :) [17:53:50] <_joe_> James_F: 500k was Daimona though :) [17:53:58] True. [17:54:58] <_joe_> the first time libraryupgrader has it is 600k [17:55:35] Mark got 123, Ryan L. got 12 and 1234, Trevor got 12345, Aaron got 123456, and m.utante got 654321. [17:55:48] <_joe_> and from there on, it's all libraryupgrader and l10bot [17:55:50] <_joe_> :/ [17:55:55] <_joe_> so yeah, you're probably right [17:56:16] The alternative is humans making those patches, more slowly and less consistently, TBF. [17:56:59] <_joe_> here I was thinking "hah, I'll time my scap patch properly and..." I realized that's on gitlab now [17:57:11] Ack. [17:57:32] Clearly we need to replace every space in operations/dns.git with a tab, and each line's replacement needs to be a different patch, right? ;_) [17:58:01] <_joe_> James_F: right :D [18:01:01] James_F: filing a task [18:01:21] Oh gods, what have I wrought? Etc. [18:06:41] <_joe_> James_F: you have provided some SRE a good idea for some childish trolling. You should know better. [18:07:00] Fair. [18:23:36] !incidents [18:23:37] 4432 (UNACKED) [2x] NELHigh sre (tcp.timed_out) [18:23:37] 4429 (RESOLVED) [26x] ProbeDown sre (probes/service) [18:23:37] 4431 (RESOLVED) [2x] HaproxyUnavailable cache_text global sre () [18:23:37] 4430 (RESOLVED) [2x] PHPFPMTooBusy appserver sre (php7.4-fpm.service) [18:23:45] !ack 4432 [18:23:45] 4432 (ACKED) [2x] NELHigh sre (tcp.timed_out) [18:25:43] based on logstash, this looks to be connectivity trouble specifically with Spectrum Business (AS20115) [18:26:24] and with upload-lb.ulsfo [18:31:40] it's apparently gone now, and traffic from that asn looks pretty normal as far as i can tell [18:32:49] oh boy [18:34:15] brennen: there's a sizable decrease in the rate of saved edits that correlates very well with the rollout of wmf.17 to group1 https://sal.toolforge.org/production?p=0&q=1.42.0-wmf.17&d= https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=1707400738993&to=1707417237042 [18:34:21] is that expected for some reason? [18:35:08] cdanis: not to my knowledge [18:35:33] hmm [18:40:38] cdanis: The edit rate seems to decrease before the deployment; Is "deployment" meaning the beginning or end of the rollout? [18:41:14] the end [18:44:43] <_joe_> so, such a decrease can only be wikidata, which is in group1. I do see changes in https://www.wikidata.org/wiki/Special:RecentChanges?hidecategorization=1&limit=50&days=7&urlversion=2, including bot edits [18:45:37] <_joe_> so either a) we move bot edits to a separate counter or b) we've moved them to prometheus or c) we've broken metrics reporting, but I don't think there is an underlying real issue with edits in group 1 [18:45:37] _joe_: yeah we were also discussing on -operations heh [18:45:58] I still think it's worth rolling back group1 and seeing what happens to the metric, tbh [18:46:35] <_joe_> +1