[07:40:29] <elukey>	 hello folks
[07:41:07] <elukey>	 I created https://phabricator.wikimedia.org/T337825 to move varnishkafka to pki, and have a way to automatically restart those systemd units when a new cert is issued
[07:41:26] <elukey>	 lemme know if you like the overall idea or if you prefer another road (I'll also seek DE's approval of course)
[08:57:33] <_joe_>	 vgutierrez / fabfur can I get a seal of approval on https://gerrit.wikimedia.org/r/c/operations/puppet/+/924080 ?
[09:35:37] * vgutierrez checking
[09:40:20] <_joe_>	 vgutierrez: thanks
[09:40:35] <_joe_>	 with "proceed with caution" you mean run puppet on a host first, see if it works
[09:40:43] <_joe_>	 then deploy everywhere, correct?
[12:12:57] <wikibugs>	 10Traffic, 10Wikibase Product Platform, 10Patch-For-Review: Beta wikidata rejects PATCH requests - https://phabricator.wikimedia.org/T336659 (10Ottomata)
[12:15:36] <bblack>	 !log lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703
[12:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:39] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[12:15:52] <bblack>	 wrong window heh
[12:35:53] <Emperor>	 Hi folks; T338083 is on the clinic duty triage pile - it's an access request to allow an external user of maps per https://wikitech.wikimedia.org/wiki/Maps/External_usage ; I think it's not within the terms of use, but previously such requests were (I think) handled by traffic. Do you know who should be saying yes/no to such requests?
[12:35:53] <stashbot>	 T338083: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083
[12:36:49] <bblack>	 I don't think we've traditionally been the arbiter
[12:37:30] <Emperor>	 I'm sort of tempted to say "this doesn't seem to be within the terms of use", but I feel it's not really my domain (and probably above my pay grade)
[12:38:36] <bblack>	 no, there was a process/team for evaluating these, I think.  trying to track that down
[12:38:41] <bblack>	 I just know it's not SRE
[12:39:19] <wikibugs>	 10Wikimedia-Apache-configuration, 10Infrastructure-Foundations, 10SRE, 10Security-Team, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10MatthewVernon)
[12:39:47] <Emperor>	 thanks!
[12:43:38] <bblack>	 well, nobody's really added new sites under such a regime in a long time apparently (if ever, since we went to that policy)
[12:44:07] <bblack>	 it looks like the last addition was https://gerrit.wikimedia.org/r/c/operations/puppet/+/670229 in 2021, but it was from a staffer adding access to a known consultant for some internal project or something.
[12:44:34] <bblack>	 so, I'm gonna guess that the process was never really well-established anywhere
[12:45:04] <Emperor>	 "yay"
[12:45:42] <bblack>	 https://phabricator.wikimedia.org/T261424#6419685 was when I pointed out this would be a problem a few years ago lol
[12:46:44] <bblack>	 best answer on the ticket as to our standards was chris's:
[12:46:53] <bblack>	 - One bright line that we could draw is "affiliates and other community projects that are listed on metawiki". I think that would be both reasonable and workable. (Although I think any additions to the list of allowed sites should be reactive to requests, not proactive based on everything already on meta.
[12:47:57] <taavi>	 I don't think this request is even trying to claim it's related to Wikimedia projects, so to me it seems clear it's not allowed
[12:48:42] <bblack>	 yeah I'd tend to agree, I just really wish tech staff weren't the ones responsible for the policy.  But given the state of affairs, we should probably reject it ourselves and point at the policy.
[12:48:48] <Emperor>	 Yeah, I think this is a reasonably clear "no", unless there's some affiliation link which the requestor hasn't provided. Happy to say as much
[12:49:48] <Emperor>	 thanks for at least confirming there isn't some procedure I should have followed ;-)
[12:50:05] <bblack>	 I thought there was, but I think that was my backwards-looking rose-colored glasses
[14:01:00] <cdanis>	 Emperor: not really, it was mostly me spitballing on a task at the time, as bblack posted :)
[14:47:13] <wikibugs>	 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2009.codfw.wmnet` - lvs2009.codfw.wmnet (**WARN**)   - Downtimed ho...
[15:16:53] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul)
[15:49:06] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:55:42] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye
[16:01:14] <wikibugs>	 10Traffic, 10SRE: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10Krinkle) >>! In T86651#973435, @mark wrote: > FWIW: An alternative sh implementation that I've written for an old kernel and fixes some of these issues (a looong time ago), lives [[ http://svn.wikimedia.org/viewvc/mediawi...
[16:17:12] <wikibugs>	 10Wikimedia-Apache-configuration, 10Infrastructure-Foundations, 10SRE, 10Security-Team, and 4 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10sbassett)
[16:37:21] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[16:37:37] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye completed...
[16:59:33] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul)
[17:02:18] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul)
[17:15:48] <mutante>	 sanity/typo review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/926605 would be appreciated. it's ATS config change - disabling https://sitemap.wikimedia.org
[17:16:07] <mutante>	 also it's at the same time "fyi, killing sitemaps" :p
[17:18:44] <sukhe>	 mutante: seems like we have sitemaps.wm.org in a bunch of other places too
[17:18:47] <sukhe>	 do we care about those?
[17:20:31] <mutante>	 sukhe: if they are in operations/puppet and dns, I already got them on my list and just about the order of decom'ing.  if they are in OTHER repos though.. please show hme
[17:21:22] <mutante>	 eventually I would merge all on https://gerrit.wikimedia.org/r/q/topic:sitemaps so far
[17:21:39] <mutante>	 sukhe: oh, YES, thanks for reminding me! there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/926611
[17:21:46] <mutante>	 I already made this and I am not sure about it :)
[17:21:55] <mutante>	 but that is a question for traffic basically
[17:22:21] <mutante>	 because it appears in VTL and VCT, that is what you mean I bet :)
[17:22:28] <mutante>	 VCL
[17:22:34] <mutante>	 VTC.. even
[17:24:17] <mutante>	 yea, so I change my request to first https://gerrit.wikimedia.org/r/c/operations/puppet/+/926611  :)
[17:24:25] <mutante>	 no rush or need for realtime
[17:24:38] <sukhe>	 all good, looking :)
[18:04:14] <wikibugs>	 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @DAlangi_WMF we talked about this the other day, can you sahre...
[18:04:29] <wikibugs>	 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) a:03DAlangi_WMF
[19:27:48] <btullis>	 Does anyone know what the pybal alert on lvs1018 is about? I've got a UBN ticket about the wikireplicas that I think I caused and I'm looking for clues about how to resolve: T338172
[19:27:49] <stashbot>	 T338172: Can't connect to analytics replicas from Toolforge - https://phabricator.wikimedia.org/T338172
[19:36:30] <sukhe>	 btullis: some pybal maintenance was ongoin
[19:36:48] <sukhe>	 should clear up soonish otherwise I will check 
[19:37:18] <sukhe>	 brett: can you see if lvs1018 cleared up please? getting to a computer shortly
[19:37:51] <btullis>	 sukhe: OK, thanks.
[19:38:40] <brett>	 btullis: still an issue? Theoretically there shouldn't have been any disruption
[19:40:41] <btullis>	 brett: Many thanks. The issue seems to have been resolved, but I don't know what caused the problem with wikireplicas so I'm still on-edge. 
[19:41:23] <brett>	 The report was opened two hours ago and I only started one hour ago
[19:41:25] <btullis>	 This is what I did to switch the wikireplicas-a service from dbproxy1018 to dbproxy1019 - about 30 seconds apart: https://phabricator.wikimedia.org/T315426#8903257
[19:43:11] <btullis>	 This has worked previously, although it's not an ideal solution. We had one report of the wikireplicas being down from much earlier in the day, 11:55 UTC - https://phabricator.wikimedia.org/T338172#8903899
[19:46:44] <wikibugs>	 10Traffic, 10Patch-For-Review: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall)
[19:46:51] <btullis>	 So I don't know whether or not I caused this incident, or whether I was unlucky to have started it during a period of LVS maintenance, or something else.
[19:50:26] <sukhe>	 btullis: I don't think it was you
[19:50:34] <sukhe>	 it just happened because lvs1018 had pybal disabled at that time
[19:50:38] <sukhe>	 (back)
[19:51:35] <sukhe>	 so just a case of bad timing that lvs1018 was undergoing maintenance at that time, definitely not you :)
[19:52:02] <sukhe>	 so this issue is resolved now I guess?
[19:52:16] <brett>	 Yeah, they mentioned it was
[19:52:37] <sukhe>	 ah so the question was what caused it
[19:52:43] <sukhe>	 yeah that's my theory
[19:57:21] <bblack>	 yeah but there's two different pooledness states: there's the etcd-controlled one and health-controlled one
[19:57:47] <bblack>	 if there's only two backends (as is the case here I think), and one's auto-depooled for bad healthchecks and you manual-depool the other, you end up in this scenario.
[19:57:56] <bblack>	 (or if you etcd-depool both, of course)
[19:58:51] <btullis>	 Thanks all so much. 
[19:59:20] <btullis>	 <3 Makes sense and I feel better. Will try to write it up when I get a moment.
[20:00:01] <bblack>	 oh right, we disabled health on these though, so it should be all etcd
[20:01:54] <bblack>	 looking at the pybal logs, I can see all the changes, but lvs1018 wasn't being worked on at that time (~16:19->21-ish UTC)
[20:02:15] <bblack>	 my guess would be the one you flipped over to wasn't actually working right
[20:02:43] <bblack>	 there was another bout of activity ~17:58, which seemed to go back to both-pooled
[20:03:08] <bblack>	 and then disabled one of them
[20:03:09] <bblack>	 hmmm
[20:03:28] <bblack>	 and then at 19:06, we had a pybal restart
[20:04:21] <bblack>	 (well, a stop, which does a failover to lvs1020.  then later lvs1018 started back up)
[20:05:10] <bblack>	 I'm trying to reconstruct a timeline
[20:05:16] <bblack>	 but it looks like, from pybal logs:
[20:06:01] <bblack>	 <stready state before things>: db1018:pooled, db1019:depooled
[20:06:24] <bblack>	 <16:19>: db1018:pooled, db1019:pooled
[20:06:42] <bblack>	 <16:21>: db1018:depooled, db1019:pooled
[20:07:02] <bblack>	 <17:58>: db1018:pooled, db1019:pooled
[20:07:31] <bblack>	 <17:58>: db1018:pooled, db1019:depooled (~18 seconds after the earlier change, which takes us back to the initial steady state)
[20:08:02] <bblack>	 I would guess that 1019 was never actually able to service users at some other level
[20:09:30] <bblack>	 (all times UTC)
[20:10:32] <bblack>	 as for lvs1018 operations:
[20:11:27] <bblack>	 <13:33>: lvs1018 restarted (quick restart, by me)
[20:12:06] <bblack>	 <19:06>: lvs1018 pybal stopped (by brett)
[20:12:26] <bblack>	 <19:24>: lvs1018 pybal started (by brett)
[20:12:31] <bblack>	 so I don't think it aligns
[20:12:58] <bblack>	 I think those were brett anyways, but that's the stop -> start the logs show
[20:14:25] <sukhe>	 yes brett's 
[20:14:36] <sukhe>	 during which pybal was stopped and puppet disabled 
[20:15:02] <bblack>	 yeah but that just means lvs1020 was doing the same things for the same service
[20:15:20] <sukhe>	 right 
[20:21:14] <bblack>	 I wouldn't rule out pybal state-bugs of course, especially with that diff error showing in icinga
[20:56:52] <wikibugs>	 10Traffic, 10Patch-For-Review: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall)
[21:30:14] <wikibugs>	 10Traffic: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall) 05In progress→03Resolved