[07:40:29] hello folks [07:41:07] I created https://phabricator.wikimedia.org/T337825 to move varnishkafka to pki, and have a way to automatically restart those systemd units when a new cert is issued [07:41:26] lemme know if you like the overall idea or if you prefer another road (I'll also seek DE's approval of course) [08:57:33] <_joe_> vgutierrez / fabfur can I get a seal of approval on https://gerrit.wikimedia.org/r/c/operations/puppet/+/924080 ? [09:35:37] * vgutierrez checking [09:40:20] <_joe_> vgutierrez: thanks [09:40:35] <_joe_> with "proceed with caution" you mean run puppet on a host first, see if it works [09:40:43] <_joe_> then deploy everywhere, correct? [12:12:57] 10Traffic, 10Wikibase Product Platform, 10Patch-For-Review: Beta wikidata rejects PATCH requests - https://phabricator.wikimedia.org/T336659 (10Ottomata) [12:15:36] !log lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703 [12:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:39] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [12:15:52] wrong window heh [12:35:53] Hi folks; T338083 is on the clinic duty triage pile - it's an access request to allow an external user of maps per https://wikitech.wikimedia.org/wiki/Maps/External_usage ; I think it's not within the terms of use, but previously such requests were (I think) handled by traffic. Do you know who should be saying yes/no to such requests? [12:35:53] T338083: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083 [12:36:49] I don't think we've traditionally been the arbiter [12:37:30] I'm sort of tempted to say "this doesn't seem to be within the terms of use", but I feel it's not really my domain (and probably above my pay grade) [12:38:36] no, there was a process/team for evaluating these, I think. trying to track that down [12:38:41] I just know it's not SRE [12:39:19] 10Wikimedia-Apache-configuration, 10Infrastructure-Foundations, 10SRE, 10Security-Team, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10MatthewVernon) [12:39:47] thanks! [12:43:38] well, nobody's really added new sites under such a regime in a long time apparently (if ever, since we went to that policy) [12:44:07] it looks like the last addition was https://gerrit.wikimedia.org/r/c/operations/puppet/+/670229 in 2021, but it was from a staffer adding access to a known consultant for some internal project or something. [12:44:34] so, I'm gonna guess that the process was never really well-established anywhere [12:45:04] "yay" [12:45:42] https://phabricator.wikimedia.org/T261424#6419685 was when I pointed out this would be a problem a few years ago lol [12:46:44] best answer on the ticket as to our standards was chris's: [12:46:53] - One bright line that we could draw is "affiliates and other community projects that are listed on metawiki". I think that would be both reasonable and workable. (Although I think any additions to the list of allowed sites should be reactive to requests, not proactive based on everything already on meta. [12:47:57] I don't think this request is even trying to claim it's related to Wikimedia projects, so to me it seems clear it's not allowed [12:48:42] yeah I'd tend to agree, I just really wish tech staff weren't the ones responsible for the policy. But given the state of affairs, we should probably reject it ourselves and point at the policy. [12:48:48] Yeah, I think this is a reasonably clear "no", unless there's some affiliation link which the requestor hasn't provided. Happy to say as much [12:49:48] thanks for at least confirming there isn't some procedure I should have followed ;-) [12:50:05] I thought there was, but I think that was my backwards-looking rose-colored glasses [14:01:00] Emperor: not really, it was mostly me spitballing on a task at the time, as bblack posted :) [14:47:13] 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2009.codfw.wmnet` - lvs2009.codfw.wmnet (**WARN**) - Downtimed ho... [15:16:53] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [15:49:06] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:55:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye [16:01:14] 10Traffic, 10SRE: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10Krinkle) >>! In T86651#973435, @mark wrote: > FWIW: An alternative sh implementation that I've written for an old kernel and fixes some of these issues (a looong time ago), lives [[ http://svn.wikimedia.org/viewvc/mediawi... [16:17:12] 10Wikimedia-Apache-configuration, 10Infrastructure-Foundations, 10SRE, 10Security-Team, and 4 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10sbassett) [16:37:21] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:37:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye completed... [16:59:33] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [17:02:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [17:15:48] sanity/typo review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/926605 would be appreciated. it's ATS config change - disabling https://sitemap.wikimedia.org [17:16:07] also it's at the same time "fyi, killing sitemaps" :p [17:18:44] mutante: seems like we have sitemaps.wm.org in a bunch of other places too [17:18:47] do we care about those? [17:20:31] sukhe: if they are in operations/puppet and dns, I already got them on my list and just about the order of decom'ing. if they are in OTHER repos though.. please show hme [17:21:22] eventually I would merge all on https://gerrit.wikimedia.org/r/q/topic:sitemaps so far [17:21:39] sukhe: oh, YES, thanks for reminding me! there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/926611 [17:21:46] I already made this and I am not sure about it :) [17:21:55] but that is a question for traffic basically [17:22:21] because it appears in VTL and VCT, that is what you mean I bet :) [17:22:28] VCL [17:22:34] VTC.. even [17:24:17] yea, so I change my request to first https://gerrit.wikimedia.org/r/c/operations/puppet/+/926611 :) [17:24:25] no rush or need for realtime [17:24:38] all good, looking :) [18:04:14] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @DAlangi_WMF we talked about this the other day, can you sahre... [18:04:29] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) a:03DAlangi_WMF [19:27:48] Does anyone know what the pybal alert on lvs1018 is about? I've got a UBN ticket about the wikireplicas that I think I caused and I'm looking for clues about how to resolve: T338172 [19:27:49] T338172: Can't connect to analytics replicas from Toolforge - https://phabricator.wikimedia.org/T338172 [19:36:30] btullis: some pybal maintenance was ongoin [19:36:48] should clear up soonish otherwise I will check [19:37:18] brett: can you see if lvs1018 cleared up please? getting to a computer shortly [19:37:51] sukhe: OK, thanks. [19:38:40] btullis: still an issue? Theoretically there shouldn't have been any disruption [19:40:41] brett: Many thanks. The issue seems to have been resolved, but I don't know what caused the problem with wikireplicas so I'm still on-edge. [19:41:23] The report was opened two hours ago and I only started one hour ago [19:41:25] This is what I did to switch the wikireplicas-a service from dbproxy1018 to dbproxy1019 - about 30 seconds apart: https://phabricator.wikimedia.org/T315426#8903257 [19:43:11] This has worked previously, although it's not an ideal solution. We had one report of the wikireplicas being down from much earlier in the day, 11:55 UTC - https://phabricator.wikimedia.org/T338172#8903899 [19:46:44] 10Traffic, 10Patch-For-Review: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall) [19:46:51] So I don't know whether or not I caused this incident, or whether I was unlucky to have started it during a period of LVS maintenance, or something else. [19:50:26] btullis: I don't think it was you [19:50:34] it just happened because lvs1018 had pybal disabled at that time [19:50:38] (back) [19:51:35] so just a case of bad timing that lvs1018 was undergoing maintenance at that time, definitely not you :) [19:52:02] so this issue is resolved now I guess? [19:52:16] Yeah, they mentioned it was [19:52:37] ah so the question was what caused it [19:52:43] yeah that's my theory [19:57:21] yeah but there's two different pooledness states: there's the etcd-controlled one and health-controlled one [19:57:47] if there's only two backends (as is the case here I think), and one's auto-depooled for bad healthchecks and you manual-depool the other, you end up in this scenario. [19:57:56] (or if you etcd-depool both, of course) [19:58:51] Thanks all so much. [19:59:20] <3 Makes sense and I feel better. Will try to write it up when I get a moment. [20:00:01] oh right, we disabled health on these though, so it should be all etcd [20:01:54] looking at the pybal logs, I can see all the changes, but lvs1018 wasn't being worked on at that time (~16:19->21-ish UTC) [20:02:15] my guess would be the one you flipped over to wasn't actually working right [20:02:43] there was another bout of activity ~17:58, which seemed to go back to both-pooled [20:03:08] and then disabled one of them [20:03:09] hmmm [20:03:28] and then at 19:06, we had a pybal restart [20:04:21] (well, a stop, which does a failover to lvs1020. then later lvs1018 started back up) [20:05:10] I'm trying to reconstruct a timeline [20:05:16] but it looks like, from pybal logs: [20:06:01] : db1018:pooled, db1019:depooled [20:06:24] <16:19>: db1018:pooled, db1019:pooled [20:06:42] <16:21>: db1018:depooled, db1019:pooled [20:07:02] <17:58>: db1018:pooled, db1019:pooled [20:07:31] <17:58>: db1018:pooled, db1019:depooled (~18 seconds after the earlier change, which takes us back to the initial steady state) [20:08:02] I would guess that 1019 was never actually able to service users at some other level [20:09:30] (all times UTC) [20:10:32] as for lvs1018 operations: [20:11:27] <13:33>: lvs1018 restarted (quick restart, by me) [20:12:06] <19:06>: lvs1018 pybal stopped (by brett) [20:12:26] <19:24>: lvs1018 pybal started (by brett) [20:12:31] so I don't think it aligns [20:12:58] I think those were brett anyways, but that's the stop -> start the logs show [20:14:25] yes brett's [20:14:36] during which pybal was stopped and puppet disabled [20:15:02] yeah but that just means lvs1020 was doing the same things for the same service [20:15:20] right [20:21:14] I wouldn't rule out pybal state-bugs of course, especially with that diff error showing in icinga [20:56:52] 10Traffic, 10Patch-For-Review: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall) [21:30:14] 10Traffic: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 (10BCornwall) 05In progress→03Resolved