[05:15:37] 10serviceops, 10Platform Team Initiatives (Session Management Service (CDP2)), 10User-Clarakosi: Package table_properties utility for Debian - https://phabricator.wikimedia.org/T226551 (10Aklapper) a:05holger.knust→03None Removing task assignee due to inactivity, as this open task has been assigned for m... [05:18:28] 10serviceops, 10MediaWiki-Docker, 10Release-Engineering-Team (Seen), 10User-brennen: Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10Aklapper) a:05fsero→03None Removing task assignee due to inactivity, as this open task has been assign... [05:19:10] 10serviceops, 10SRE, 10Scap, 10Goal, 10User-jijiki: SRE FY2019 Q3:TEC6: First steps towards Canary Deployments - https://phabricator.wikimedia.org/T213156 (10Aklapper) a:05jijiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails s... [07:34:22] <_joe_> jayme: do you know if we have any reference to the kubernetes pod IPs CIDRs in puppet? [07:34:48] _joe_: I think we don't [07:35:55] applies to service IP CIDRs as well [07:36:23] <_joe_> the context is: effie needs to open postgres on the maps hosts to the tegola pods [07:36:40] <_joe_> so if we had those CIDR defs somewhere in puppet, that could be done [07:37:02] my thought is that we could add a network contant for those [07:38:34] pulling that data from netbox is not possible I guess, right? [07:39:39] <_joe_> it /should/ be possible, but more importantly [07:39:49] <_joe_> do we have those CIDRs in deployment-charts, right? [07:39:58] yeah [07:40:12] as well as in modules/network/data/data.yaml [07:40:14] <_joe_> we might just make those parameters we recover from a yaml file we generate from puppet [07:40:21] <_joe_> oh so it is there [07:40:30] <_joe_> so it is in puppet already [07:40:38] yeah, but in a static yaml [07:40:54] not in hiera I mean [07:40:54] <_joe_> it's exactly where I needed it to be :P [07:41:02] jayme: one step at a time, but I see your point [07:41:58] _joe_: ok network/data/data.yaml helps, I will work something with it after I am back, and use modules/base/templates/firewall/defs.erb [07:42:26] cool than you both, that was very helpful [07:42:30] <_joe_> ack :) [07:43:01] is that network/data/data.yaml something that is available to every host by default? [07:43:01] <_joe_> reminder: I'll replay how to figure out what the problem was wrong with apcu at 09:00Z [07:43:17] <_joe_> jayme: it is to base::firewall IIRC [07:43:44] ah, sweet [07:43:50] <_joe_> as it includes the class network::constants [07:44:10] <_joe_> which starts with [07:44:12] <_joe_> $network_data = loadyaml("${module_path}/data/data.yaml") [07:44:55] <_joe_> so you can get the eqiad pod IPs with $netowrk::constants::network_data['private']['private1-kubepods-eqiad'] [07:45:16] <_joe_> err no it's a bit more complex than that sorry [07:45:41] <_joe_> you use the slice_network_constants function [07:45:46] no need for a perfect example :) I was just curious [08:52:42] 10serviceops, 10MediaWiki-Docker, 10Release-Engineering-Team (Seen), 10User-brennen: Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10Joe) a:03Joe We have basic documentation here https://wikitech.wikimedia.org/wiki/Kubernetes/Images now,... [08:57:05] _joe_, jayme - there was a question in #sre about an emergency fix for the Growth team, it looks reasonable but if you could check/review and +1/-1 it would be great [08:57:51] <_joe_> elukey: yes one sec [08:57:56] <3 [09:01:47] 10serviceops, 10MediaWiki-Docker, 10Release-Engineering-Team (Seen), 10User-brennen: Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10Joe) 05Open→03Resolved [09:01:51] 10serviceops, 10Documentation: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10Joe) [09:06:21] _joe_: you already set up a meet for the replay that I've missed? [09:08:55] <_joe_> jayme: no, I just lost the time because I was reading a gem [09:09:06] <_joe_> from 2008, in the mediawiki-config repo [09:09:08] we (jelto & myself) are working together on decom and install new appservers, sharing how to generate mcrouter certs right now etc [09:09:08] <_joe_> sorry [09:09:40] <_joe_> mutante: ok, I'd say please stop and we'll meet in 5 minutes [09:11:45] <_joe_> jayme: if you're curious, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/fonts/+/refs/heads/master/README this is what made me lose sense of time [09:12:13] <_joe_> I'll let you find out if we're still using ploticus or not [09:12:17] _joe_: mutante: jelto: effie: https://meet.google.com/kbw-qhuv-kyw [09:12:20] _joe_: oh? you meant a debug session with Jelto and you? should I see that and join? [09:12:33] <_joe_> mutante: I pinged you too multiple times :P [09:12:39] <_joe_> in the past few days [09:12:43] <_joe_> so yes, including you [09:14:59] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1267.eqiad.wmnet` - m... [09:36:05] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) The main person working on this is Kunal and he was busy with deploying shellbox for Score... [09:48:42] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1268.eqiad.wmnet` - m... [09:51:29] hey folks I have a question [09:51:52] /etc/kubernetes has different perms on apiserver vs kubelet nodes [09:52:04] - apiserver: 700 kube:kube [09:52:13] - kubelet node: 755 root:root [09:52:55] and of course /etc/kubernetes is defined two times so I think we'd need some "standardization" of perms to deploy a kubelet of apiservers [09:53:32] subdirs of /etc/kubernetes looks to have stricter perms on apiserver, so in theory the root:root 755 option might be good [09:53:46] in case I can file a code review only for it [10:11:06] idk exactly why that is but I would assume it's not on exact purpose and root:root 755 should be fine. kube should not write to /etc/kubernetes anyways [10:11:39] I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/702898/ [10:13:26] there is still a !defined, but I am not sure about a better compromise [10:14:50] create it in the debian packages would probably be a clean approach :) [10:16:21] <_joe_> elukey: let me take a look [10:17:17] jayme: sure but I would like not to rebuild kubernetes packages for this change :D [10:17:33] elukey: I know. I was half joking [10:18:00] yes yes :) [10:19:50] I mean, you could add a CR for the debian packaging anyways and we just include it in the next build and then get rid of the puppet code for it [10:20:24] <_joe_> elukey: review done [10:20:45] _joe_ thanks, amending [10:27:03] _joe_ I get a jenkins -1 if I try to require a class into a profile [10:27:22] <_joe_> oh sorry those are /profiles/ sigh [10:27:35] one of them [10:27:37] <_joe_> ok let me look at that whole class hierarchy [10:29:13] we could probably have it in k8s::kubelet instead [10:29:33] <_joe_> yeah that's what I was thinking [10:29:58] but there really are a lot of things depending on it [10:30:05] like k8s::kubeconfig [10:30:25] <_joe_> also depends on /etc/kubernetes being present? [10:30:46] <_joe_> elukey: can I go on and modify your patch a bit? [10:30:59] yeah, as most of/all(?) configs get written to /etc/kubernetes as well [10:33:19] _joe_ of course yes [10:40:24] (running errand will read later) [10:48:06] <_joe_> elukey: uhhh I ended up modifying a lot of stuff, I will probably limit myself to the basics in your patch :P [12:54:47] _joe_: so about the kubepod networks, I was discussing with moritz, we could temporarily allow this [12:55:14] <_joe_> this what? :P [12:55:32] adding it globally [12:55:39] and we can reverse it later [12:56:57] and discuss with alex [13:07:09] <_joe_> ok, I don't think we need to wait for alex, but I would at least add a note about not ab-using it [13:07:47] <_joe_> also, do you think we should be adding the ml-serve networks there? [13:08:09] <_joe_> I would err on the side of "no" as those are separate in functionality [13:11:48] 10serviceops, 10Performance-Team, 10SRE, 10Thumbor, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10AntiCompositeNumber) [13:15:31] _joe_ I ran pcc and it failed, there are some duplicate declarations :( [13:15:48] on contint and release nodes [13:15:52] the rest looks good [13:15:55] <_joe_> elukey: heh maybe I should've gone to fix contint too [13:16:03] <_joe_> I did so in the followup patch actually [13:17:13] _joe_ I saw https://gerrit.wikimedia.org/r/c/operations/puppet/+/702912 but I thought it was WIP, jenkins complains about style violations [13:18:03] <_joe_> yes I think we can just get away with allowing that, but I'll look at the pcc violations first [13:19:01] ack thanks, sorry for the time waste :( [13:19:47] the other step that I'll do is avoid base::expose_puppet_certs duplicate declaration (but with some hiera config it should be easier) and then in theory profile::kubernetes::node should be deployable on masters [13:22:31] 10serviceops, 10Wikimedia-Site-requests, 10Performance Issue: Re-assess which "expensive" query pages are run on cron for Wikimedia sites - https://phabricator.wikimedia.org/T175088 (10Umherirrender) >>! In T175088#7123240, @Umherirrender wrote: > * Created T283975 to get a cron for `OrphanedTimedText` > * C... [13:23:40] 10serviceops, 10TimedMediaHandler-TimedText, 10MW-1.37-notes (1.37.0-wmf.9; 2021-06-07): Create cron job to update query page Special:OrphanedTimedText on wikis with $wgEnableLocalTimedText = true - https://phabricator.wikimedia.org/T283975 (10Umherirrender) 05Stalled→03Resolved a:03Umherirrender https... [13:31:39] <_joe_> elukey: can you paste me the pcc link? [13:32:18] _joe_ https://puppet-compiler.wmflabs.org/compiler1001/30102/ [13:32:27] <_joe_> ty [13:45:23] <_joe_> elukey: ok now it should be ok [13:48:21] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 (10JMeybohm) [13:48:46] running pcc :) [13:50:41] <_joe_> I already did [13:50:53] ah! https://puppet-compiler.wmflabs.org/compiler1001/30105/, looks good [13:50:59] I think we can merge if you are ok! [13:55:12] jayme: ok for you the changes to /etc/kubernetes? [13:56:12] elukey: go ahead. Let me know when you've merged and I'll run puppet on a staging master to double check [14:01:51] I am disabling puppet on all nodes with /etc/kubernetes, then I'll roll it out to see all changes [14:02:00] okay [14:04:06] ok all ready if you want to check the first nodes (enable + run puppet) [14:05:03] ack, give me 5min [14:17:40] elukey: looks good to me on staging-codfw master and node [14:17:58] (restarted all kube services to be sure) [14:18:13] ack will proceed in a bit with the rest [14:22:31] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) [14:22:59] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) p:05Triage→03High [14:28:34] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 (10JMeybohm) [14:31:49] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) @jijiki it would not justify such a huge performance shift, by any measure. I am even veering towards disabling onhost memcached, for the latest discoveries of bad interactions with... [15:40:34] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) I created a first basic dashboard for the mwdebug deployment and I noticed what the major issue was immediately: I dedicated just 2k maximum opcache scripts, which bottomed out even... [16:30:27] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10dancy) 05Open→03Resolved a:03dancy [16:52:42] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Next steps: * refactor how `base::expose_puppet_certs` is used in kubernetes profiles, since if profile::... [17:27:49] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10wkandek) Is this the dashboard? https://grafana.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=70&orgId=1&from=1625227688488&to=1625246654342 [17:39:59] wkandek: yes [17:40:04] (that's the dashboard) [18:46:36] oooooooo [18:49:03] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) It would also be nice if the cookbook could check all services, and then fail if at least one didn't verify... [19:38:06] 10serviceops, 10Technical-blog-posts, 10Datacenter-Switchover: Story idea for Blog: June 2021 DC Switchover - https://phabricator.wikimedia.org/T286080 (10Legoktm) [23:28:31] 10serviceops, 10Wikimedia-Logstash, 10observability, 10GitLab (Initialization), 10User-brennen: Logging for GitLab - https://phabricator.wikimedia.org/T274462 (10brennen) Thanks for digging into the layout of this so thoroughly. > What is the "current" file? Is this file different than gitaly_hooks or g... [23:48:02] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [23:48:55] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Resolved→03Open Reopening to follow up on the failure to fully serve all the static files. To followup...