[08:15:52] <jelto>	 GitLab needs a short maintenance break at around 10:00 UTC. Should not take more than 15 minutes.
[08:22:31] <_joe_>	 jelto: do we have any plan to make gitlab more HA so that we don't need maintenance breaks?
[08:23:07] <_joe_>	 if we have to transfer the SRE core repos (I'm thinking of at least puppet, alerts, deployment-charts) to gitlab, I wouldn't be confortable with regular downtimes of this duration
[08:23:42] <_joe_>	 when gerrit is in maintenance I can expect a 2-5 minutes of downtime tops, and it's less frequent by at least an order of magnitude
[08:30:30] <jelto>	 _joe_: we have https://phabricator.wikimedia.org/T323201 to track this. And I agree, having at least one long downtime per month and often multiple ones is quite disruptive.
[08:30:30] <jelto>	 Last time we discussed that topic we came to the conclusion that the complexity from single instance (omnibus) to HA is also magnitudes higher and needs a lot more infrastructure and also more engineers. So we prioritized other features more. But we can re-evaluate that and I can add this feedback to the existing task.
[08:30:59] <_joe_>	 will do :)
[08:31:05] <_joe_>	 I can take care of it 
[08:33:17] <jelto>	 thank you :)
[10:05:44] <jelto>	 GitLab is back, maintenance finished.
[11:34:30] <klausman>	 Anyone aware of issues with the docker registry? I have several hosts that are stcuk in the middle of image pulls
[11:36:23] <klausman>	 (codfw, for that matter)
[12:07:58] <claime>	 klausman: I raised the nginx timeouts yesterday, but I don't think that would result in stuck images
[12:08:11] <klausman>	 Yeah, me neither
[12:08:45] <klausman>	 I see a bunch of error curvs go up in the Swift dashboard, but I can't tell what the base issue is. I have 0 knowledge about Swift and how it works at WMF
[12:09:05] <klausman>	 https://grafana.wikimedia.org/goto/9PpVmIhIk?orgId=1
[12:09:20] <klausman>	 Note "Server errors" and ATS->Swift 500s
[12:10:03] <claime>	 If it's a swift issue, idk either
[12:10:15] <claime>	 Emperor you around?
[12:10:43] <klausman>	 (side note: by now the image fetches that were slow have completed, so I don't have an easy repro atm)
[12:11:16] <klausman>	 But the graphs would still give me pause
[12:11:38] <claime>	 We need to do some cleanup on the registry hosts, they're at 90+% disk usage, but that wouldn't explain the behaviour you're seeing either
[12:13:24] <klausman>	 I also don't see anything obvious firing on AM or Icinga
[12:26:34] <moritzm>	 FYI, I'll be stopping Puppet fleet-wide for about 20m starting in five minutes. if that's a bad time for anyone, let me know
[12:42:06] <moritzm>	 and it's back on
[13:19:49] <Emperor>	 claime: looking (sorry, was fighting DB's stupid website)
[13:23:57] <Emperor>	 is a bit of an uptick in p99 connection establishment in codfw, I'll give it a kick
[13:29:22] <Emperor>	 has anyone been tweaking alerts/icinga recently? ms-fe2010 is in state "unknown" because "NRPE: Command 'check_check_systemd_state' not defined" ... and that's blocking my rolling-restart
[13:29:53] <Emperor>	 ah, no that check is gone and it's all green again
[13:31:15] <volans>	 Emperor: https://gerrit.wikimedia.org/r/c/operations/puppet/+/998822
[13:35:41] <Emperor>	 claime: p99 connection time & swift/ATS 5xx rate look better now
[13:36:14] <Emperor>	 volans: ah, I guess I was just losing a race with that getting fully deployed
[13:52:13] <klausman>	 Emperor: what was the root cause for the 500s from ATS to Swift?
[13:53:04] <klausman>	 (also, the "server errors" graph on https://grafana.wikimedia.org/goto/0EfMNIhSz?orgId=1 still is quite elevated)
[13:56:14] <Emperor>	 klausman: those errors are from the nodes I pulled from swift::storagehosts this morning; I need to stop swift/puppet on them. They're not service-impacting
[13:59:09] <Emperor>	 (now done, so those will decline again)
[14:13:50] <klausman>	 Emperor: thank you!
[16:02:58] <volans>	 btullis: if you get my changes too feel free to puppet-merge
[16:03:06] <btullis>	 Ack, was about to say the same.
[16:03:36] <volans>	 thx
[16:03:37] <volans>	 :)
[16:43:53] <sukhe>	 _joe_: claime: for conf2004 itself and for the codfw maintenance, Traffic will take care of switching it to conf2006 and restarting pybal
[16:44:02] <sukhe>	 is there anything else we need to do for conf2004 other than to downtime it?
[16:44:37] <_joe_>	 sukhe: thanks 
[16:45:35] <sukhe>	 _joe_: so nothing special for conf2004 itself? 
[16:45:45] <_joe_>	 sukhe: no
[16:45:59] <sukhe>	 thanks!
[16:46:04] <sukhe>	 fabfur: ^
[16:53:16] <fabfur>	 ok
[16:56:33] <_joe_>	 i mean, besides that it rejoined the cluster, but if it doesn't, alerts will fire loudly 
[16:57:54] <fabfur>	 anyway, pybal will still use 2006 after the migration or we need to rollback it? 
[16:58:37] <_joe_>	 it would be better to roll it back, yes
[16:59:09] <fabfur>	 ok
[17:43:27] <topranks>	 taavi: congrats on nabbing gerrit patch number 999000
[17:43:36] <topranks>	 there should totally be a prize for whoever gets 1 million 
[17:44:59] <James_F>	 topranks: It'll probably be a bot.
[17:45:11] <sukhe>	 we can get volan.s to volint the puppet repo for us; that's the quickest path to 1 million :)
[17:53:50] <_joe_>	 James_F: 500k was Daimona though :)
[17:53:58] <James_F>	 True.
[17:54:58] <_joe_>	 the first time libraryupgrader has it is 600k 
[17:55:35] <James_F>	 Mark got 123, Ryan L. got 12 and 1234, Trevor got 12345, Aaron got 123456, and m.utante got 654321.
[17:55:48] <_joe_>	 and from there on, it's all libraryupgrader and l10bot
[17:55:50] <_joe_>	 :/
[17:55:55] <_joe_>	 so yeah, you're probably right 
[17:56:16] <James_F>	 The alternative is humans making those patches, more slowly and less consistently, TBF.
[17:56:59] <_joe_>	 here I was thinking "hah, I'll time my scap patch properly and..." I realized that's on gitlab now
[17:57:11] <James_F>	 Ack.
[17:57:32] <James_F>	 Clearly we need to replace every space in operations/dns.git with a tab, and each line's replacement needs to be a different patch, right? ;_)
[17:58:01] <_joe_>	 James_F: right :D
[18:01:01] <sukhe>	 James_F: filing a task
[18:01:21] <James_F>	 Oh gods, what have I wrought? Etc.
[18:06:41] <_joe_>	 James_F: you have provided some SRE a good idea for some childish trolling. You should know better.
[18:07:00] <James_F>	 Fair.
[18:23:36] <herron>	 !incidents
[18:23:37] <sirenbot>	 4432 (UNACKED)  [2x] NELHigh sre (tcp.timed_out)
[18:23:37] <sirenbot>	 4429 (RESOLVED)  [26x] ProbeDown sre (probes/service)
[18:23:37] <sirenbot>	 4431 (RESOLVED)  [2x] HaproxyUnavailable cache_text global sre ()
[18:23:37] <sirenbot>	 4430 (RESOLVED)  [2x] PHPFPMTooBusy appserver sre (php7.4-fpm.service)
[18:23:45] <herron>	 !ack 4432
[18:23:45] <sirenbot>	 4432 (ACKED)  [2x] NELHigh sre (tcp.timed_out)
[18:25:43] <cdanis>	 based on logstash, this looks to be connectivity trouble specifically with Spectrum Business (AS20115)
[18:26:24] <cdanis>	 and with upload-lb.ulsfo
[18:31:40] <cdanis>	 it's apparently gone now, and traffic from that asn looks pretty normal as far as i can tell
[18:32:49] <cdanis>	 oh boy
[18:34:15] <cdanis>	 brennen: there's a sizable decrease in the rate of saved edits that correlates very well with the rollout of wmf.17 to group1 https://sal.toolforge.org/production?p=0&q=1.42.0-wmf.17&d=   https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=1707400738993&to=1707417237042
[18:34:21] <cdanis>	 is that expected for some reason?
[18:35:08] <brennen>	 cdanis: not to my knowledge
[18:35:33] <cdanis>	 hmm
[18:40:38] <brett>	 cdanis: The edit rate seems to decrease before the deployment; Is "deployment" meaning the beginning or end of the rollout?
[18:41:14] <cdanis>	 the end
[18:44:43] <_joe_>	 so, such a decrease can only be wikidata, which is in group1. I do see changes in https://www.wikidata.org/wiki/Special:RecentChanges?hidecategorization=1&limit=50&days=7&urlversion=2, including bot edits
[18:45:37] <_joe_>	 so either a) we move bot edits to a separate counter or b) we've moved them to prometheus or c) we've broken metrics reporting, but I don't think there is an underlying real issue with edits in group 1
[18:45:37] <cdanis>	 _joe_: yeah we were also discussing on -operations heh
[18:45:58] <cdanis>	 I still think it's worth rolling back group1 and seeing what happens to the metric, tbh
[18:46:35] <_joe_>	 +1