[08:22:28] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 (10JMeybohm) 05Open→03Resolved I'm boldly closing this. [08:32:11] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [08:44:52] Is somebody has a couple of minutes, I'd appreciate a reading of https://wikitech.wikimedia.org/wiki/Dragonfly [08:44:55] *If [08:53:15] jayme: I like it, clear and detailed [08:53:46] one little question - I didn't know about the systemctl revert command, but I am wondering if it could potentially get stale in the future [08:54:11] say that we apply other changes to the docker unit other than the one for dragonfly (if it is possible) [08:54:36] jayme: looks good to me [08:57:34] elukey: I'm not sure about that as well and I have not figured out how tot revert a specific override. The puppet code creates a file puppet-override.conf which looks to me as if every override will end up in there, though [08:57:42] thanks for reading [08:57:59] it would make sense to move away from the DOMAIN_NETWORKS srange for supernode/8002 (to only allow connections from the current site (so either codfw/eqiad)) to also prevent unencrypted cross DC connections on the ferm level [08:58:26] but that's not straightforward with out current constants and would need some more work first [08:59:42] I also thought about limiting access to the peers of the network. But that would require to query the puppet db, which is not nice. Limiting to a DC network sounds better, though [09:00:54] we could add a variant of DOMAIN_NETWORKS derived from network constants which also includes the DC e.g. [09:01:17] elukey: with "systemctl revert" we'd revert to the state of docker.service as shipped in the deb [09:02:05] and class dragonfly::dfdaemon currently only adds a HTTP_PROXY environment to the service, so that should be robust [09:03:05] moritzm: yes yes what I am wondering is if that command executed in $months could be robust as well, namely if there is the use case of any other overrides added elsewhere that could be potentially reverted with systemctl revert. If this is not the case I am fine with the command :) [09:04:25] I think elukey got a point there. I could change the docs to say to just remove the HTTPS_PROXY env from /etc/systemd/system/docker.service.d/puppet-override.conf, reload systemd daemon and restart docker [09:04:41] systemd::unit only allows for a single override, it's hardcoded in puppet based on the name of the systemd unit, so we shouldn't run into any issues there [09:05:11] until someone fixed that limitation :D [09:05:20] jayme: boldly made some minor edits/added links: https://wikitech.wikimedia.org/w/index.php?title=Dragonfly&type=revision&diff=1921726&oldid=1921723 [09:05:35] mutante: just saw that, thanks [09:05:56] let's assume it's not a limitation, but rather a design choice to prevent issues like this :-) [09:06:38] but regardless, it makes sense to explain what makes docker use the dfdaemon (the HTTPS_PROXY env) for anyone debugging issues [09:06:53] instead of just mentioning the revert [09:09:01] ack [09:11:53] done [09:12:25] perfect thanks for the clarification :) [09:26:44] let me know when ml-serve wants to join the P2P fun :) [09:28:23] jayme: we have very light docker images I don't think we'll need it :P [09:28:53] apart from your 600MB sidecars you mean :-P [09:29:09] :D :D [09:29:28] jokes aside, as soon as we reach to a "stable" state we'll surely add it [09:31:03] do if it makes sense. I did not do any testing to try to figure out at which point it does though. I think it only really makes sense if you have larger layers + deployments with a bunch of replicat + a log of nodes) [09:31:34] *replicas, *lot of [09:41:56] there are 44 lingering "xz -d -c -q" processes on deneb, some dating back back to May, all happening at 00:00 [09:42:11] this is when docker-report runs via the systemd timer, does that ring a bell to anyone? [09:51:11] hm. not here [09:52:13] <_joe_> not here either... [09:56:15] moritzm: it looks like you're not the only one to wonder about that https://stackoverflow.com/questions/61869808/why-service-docker-status-has-multiple-process-runing-xz-d-c-q [09:57:01] (the question was removed from SO, but there was no answer anyways according to the version of the page cached by google) [09:57:11] ah [09:57:51] they're all in the /system.slice/docker.service cgroup BTW [09:57:57] see systemctl status docker [09:59:17] I'll file a task later [10:05:29] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) a:03jijiki [10:08:18] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Zbyszko) a:03Zbyszko [10:39:02] 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Marostegui) @bd808 the original launch date 12th is still on? Should we expect traffic on this database from tomorrow? [10:49:45] <_joe_> jayme: somehow kubernetes10001 has disk-type=ssd [10:49:48] <_joe_> but that's a lie :) [10:59:53] kubernetes10001 is a lie on it's own :p [11:02:37] hm, maybe I made a misstake when setting the labels [11:03:03] there is a chance that 1002 is reporting the same thing :/ [11:05:28] there is a chance that every node that is not ganeti has the wrong label. [11:05:32] I'll check [11:08:47] yeah...so. That did absolutely not work :D [11:20:14] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Zbyszko) The most important parameter to streaming updater is related to storage - we have a huge surplus of computing resour... [11:46:07] _joe_, effie: also the *017 SSDs do no longer have SSD in their model name :/ [11:55:52] <_joe_> jayme: ok we will create a better fact [11:55:55] _joe_, effie: I fixed the labels manually and merged a code fix [11:56:10] _joe_: yeah, we definitely need to [11:56:29] for now it should be fine (as kubelet does not update the labels) [12:10:38] _joe_: do you know if it is possible to add something to core facts (disks)? Or do we have to come up with a new fact hash for that? [12:11:57] <_joe_> I think it's possible but I have to look it up [12:12:31] I'm trying to, but I have kind of a bad googel mojo regarding puppet stuff :) [15:31:58] jayme: the dragonfly page is nice, the suggestion I have is to link to your benchmarks/testing and provide some basic figures on how much better/faster the P2P network is than everything pulling from the registries directly [15:40:17] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [15:40:39] 10serviceops, 10MW-on-K8s, 10SRE: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 (10dancy) 05Open→03Resolved [15:54:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) Thank you all for this work! [16:04:55] 10serviceops, 10GitLab, 10Release-Engineering-Team, 10User-brennen: GitLab patch release: 13.12.10: Resolves "Username ending with MIME type format is not allowed" errors - https://phabricator.wikimedia.org/T288631 (10brennen) [17:14:56] 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7275101, @Marostegui wrote: > @bd808 the original launch date 12th is still on? Should we expect traffic on this database from tomorrow? I do not e... [17:16:54] It seems self-evident that Toolhub will not deploy into the production Kubernetes cluster tomorrow. I really need help from someone with authority to make decisions about things in the k8s cluster to help push over the finish line for T280881. [17:17:13] * bd808 keeps forgetting that stashbot is not here [17:21:53] 10serviceops, 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [17:34:47] bd808: If the db / other resources are set up, I think we could get it into the staging cluster today [17:35:00] let me at least create your k8s accounts, namespaces [17:38:10] Getting into staging would be pretty awesome, but rushing that today is not necessary. I need to switch focus to getting the 5 Wikimania talks ready in all honesty. [17:39:04] * legoktm nods [17:39:08] I was deluding myself right up to end of work last night that this might still all happen to "plan", but it would take hero mode from multiple people now and that is just not fair. [17:39:51] as for someone with authority making decisions, you already got a +1 from j.oe which is pretty much all you need IMO [17:41:19] I think the main "obstacle" left if everything else is ready is getting the LVS setup, that requires pairing with someone from Traffic, and I've never set up a public LVS endpoint before [17:44:05] *nod* I've seen it done from the watching irc discussions side, but never done more than that myself. There also still seem to be some questions about whether or not egress restrictions will be applied to the namespace and if a nutcracker sidecar is needed from things _j.oe_ has said in passing. [17:46:07] legoktm: why does it need a public lvs endpoint? It'll be behind the cdn layer afaics [17:46:40] er, that's what I meant [17:47:12] https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service where it talks about public-facing [17:49:05] if I'm using the wrong terminology, I hope it underscores that I've never done it before :p [17:50:14] majavah: as far as I understand the issue, the k8s cluster does not have a bare metal ingress solution (like the nginx-ingress we have in Toolforge) and the LVS setup is needed to serve that function (routing from varnish into the k8s cluster to the Toolhub Service object). [17:51:37] my understanding is that the k8s service needs an internal/"low-traffic" endpoint, not an external one with a public ip, what "public LVS endpoint" would be [17:52:23] s/would be/would in my view be/ [17:54:42] I think you're right [17:55:23] looking at hierdata/common/service.yaml, all the publicly exposed services are still low-traffic [17:58:37] you want to have toolhub.wikimedia.org point to text-lb (so a cname to dyna. I think), and then ats can be told to route requests with that host header to toolhub.discovery.wmnet with private addresses [18:02:12] makes sense [18:02:18] I got tripped up by "public-facing" [18:04:15] *nod* sounds right to me too based on what I'm seeing in the wikitech page. [18:05:34] legoktm: https://gerrit.wikimedia.org/r/c/operations/dns/+/711637/ [18:07:03] seems too easy :p [18:07:08] but thank you :) [18:07:52] also, you'll still want toolhub.wikimedia.org (the public name) on the tls certificates used by envoy in kubernetes, even if it's accessed with toolhub.discovery.wmnet, otherwise ats will not like them [18:10:38] * legoktm nods [18:11:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/711648/ does the ats routing bit, but you'll want to setup the discovery name first [18:13:23] <3 [18:13:46] bd808: does toolhub emit the correct cache-control headers to be cached in varnish? [18:14:15] also, I'll need the values of DB_PASSWORD and WIKIMEDIA_OAUTH2_SECRET to put in private puppet [18:14:31] assuming I can randomly generate any value for DJANGO_SECRET_KEY [18:16:28] legoktm: It emits a "Vary: accept-language, cookie" header, which I think is what varnish will need. [18:21:12] legoktm: yes, DJANGO_SECRET_KEY is any random string. The DB_PASSWORD is in mwmaint1002.eqiad.wmnet:/home/bd808/toolhub (the "toolhub" value). I have not requested the OAuth grant yet, but can do that today and get the secrets. [18:22:04] ack [18:22:35] how long should the secret key be? [18:28:07] for reference, I'm generating a cert with: alt_names: ['toolhub.wikimedia.org', 'toolhub.discovery.wmnet', 'toolhub.svc.codfw.wmnet', 'toolhub.svc.eqiad.wmnet'] [18:28:13] majavah: ^ [18:28:27] lgtm [18:28:59] legoktm: most things I'm seeing in random guidance on the web seem to say 50 chars. [18:29:28] ok [18:30:21] that also matches the horrible "nobody set the key" fallback that Django can generate internally -- https://github.com/django/django/blob/main/django/core/management/utils.py#L77-L82 [18:31:23] * bd808 sneaks out to find food [18:38:16] ok, you should be able to check the secrets at deploy1002:/etc/helmfile-defaults/private/toolhub/eqiad.yaml [18:38:45] unless I missed something you should be able to deploy to staging/codfw/eqiad clusters [18:39:46] 10serviceops, 10SRE, 10Services, 10Toolhub, and 2 others: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [18:42:25] 10serviceops, 10SRE, 10Services, 10Toolhub, and 2 others: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) We're still missing the OAuth2 key/secret, but otherwise I think it should be possible to deploy to the staging/eqiad/codfw clusters now once the helmfile.d part is... [19:25:16] legoktm: <3 thank you [19:45:12] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) mw1267-mw1268 decom'd and removed [19:51:39] <_joe_> legoktm: so you don't need to set up a public endpoint for toolhub, I imagine we'd need to add configuration to the edge though [20:33:53] yep [20:39:40] bd808: is it OK if I assign port 4011 to toolhub on https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports ? I need to include a port in the LVS config [21:17:58] legoktm: sure. And then we need to adjust that port in the charts correct? [21:18:59] yes, or set it properly in the helmfile.d/ values [21:54:27] bd808: I posted the patches for adding toolhub to LVS (and rebased majavah's on top of them), once toolhub is deployed to the codfw/eqiad clusters, you can ask someone from ServiceOps/Traffic to roll it out for you [21:55:20] \o/ thank you for the flurry of attention to this! [21:56:06] :)) very excited to see this go live [23:01:03] https://grafana.wikimedia.org/d/IjzWoqG7k/score?orgId=1 <-- shows how often Score needs to shell out via Shellbox (spoiler: not very often)