[00:13:30] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10colewhite) [07:04:07] 10serviceops, 10MW-on-K8s: Better naming for mw-on-k8s pods - https://phabricator.wikimedia.org/T325071 (10Joe) Right now the name we give to pods is `$chart-$release`; what we want to is the namespace name and the cluster name, so, going from `mediawiki-main--` to `main.mw-web.eqiad- 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) [07:42:27] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) p:05Triage→03High [07:52:35] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) Another note is we have the git repository flagged as shared between user by setting `core.sharedRepository... [07:58:17] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) a:03hashar [08:52:22] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) a:03elukey [09:01:39] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Joe) Basically we set `wgIPInfoGeoIP2EnterprisePath` to `/usr/share/GeoIPInfo/`. That directory gets populated on the appserve... [09:37:47] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) a:03Clement_Goubert [09:52:54] doing a short, low-weight thumbor test with a single host [09:59:50] 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10JMeybohm) p:05Triage→03Medium [10:20:19] hnowlan: keep us posted, curious to know what happens [10:24:48] so far looks okay but it's early days yet [10:31:24] been watching https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-30m&orgId=1&to=now&refresh=30s [10:31:47] depooling for the moment as someone is coming to disconnect my floor's internet, sigh [10:36:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Thanks @Dzahn. We should probably add that final step to the [hardware troubleshooting runbook](https://wikitech.wikimedia.org/wiki/SRE/Dc-ope... [10:43:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Added documentation to avoid forgetting this step, DC-Ops feel free to revert or ask me to move it elsewhere if you feel it shouldn't be there. [10:55:46] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10Joe) p:05Triage→03High In production, we're running exim as `sendmail`, and the configuration has some bits that look mediawiki-related: ` # Catch unqualified e-mail addresses from MediaWiki unquali... [11:36:40] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10MatthewVernon) `qualify_domain` defaults to the mail server hostname if not otherwise set (and is the domain applied to senders who don't have a domain part, likewise recipients unless that's specified sep... [11:58:21] repooled again, will add a few more hosts at the same weight [11:59:31] looking okay resource-wise also https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=thumbor&var-pod=All&from=now-1h&to=now [12:06:32] per-format processing time in k8s is significantly higher, but there's a lot of variables at work there between the metal and k8s instances I guess [12:08:57] How significantly? [12:09:13] You're getting throttled hard on CPU [12:11:23] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10hnowlan) [12:13:24] claime: compare processing time for eqiad and eqiad k8s here https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-30m&orgId=1&to=now&refresh=30s [12:14:13] oof [12:16:20] hnowlan: I think that could be caused by the thumbor pods getting CPU throttled, wdyt? [12:19:24] <_joe_> it is definitely that [12:19:47] So, give'em more? [12:27:09] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868075 :) [12:27:42] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Ladsgroup) This is not really user-impacting, specially given that mw-on-k8s is on test2wiki only but I think it should show up in next week's Tech news regardle... [12:51:59] hnowlan: Yeah it's still getting throttled to hell and back [12:52:41] Down to -28s for one o_o [12:53:48] lmao yep [12:53:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#8466377, @Ladsgroup wrote: > - Inviting tech users to test our the new infra and let us know of issues early on. A related note. 2 th... [12:56:24] At this point it's either less instances per pod (and more replicas) or increase the per-pod limits for CPU. Given we're already increasing memory limits to give us slack for costly operations I lean towards the former. We're currently at 4 per pod, going to 3 per pod would give us room to do 2.5 per pod [12:57:19] I mean there's no harm in trying [13:00:19] In any case if you want it to perform relatively close to bare metal, it can't get throttled [13:00:37] As soon as that kicks in... [13:02:55] Average throttling time is still better than before raising CPU limits, so spreading the load out over more replicas of less workers, with the current CPU limit makes total sense [13:02:56] given some of the stuff that happens on the metal thumbor instances, throttling is somewhat unavoidable [13:03:09] but getting to an acceptable level of throttling is going to be the objective [13:03:17] Yeah of course, but not 9s average on all pods :D [13:03:44] getting to an acceptable level of throttling is going to be the objective < Agreed [13:09:05] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868081 if you'd be so kind. I'll deploy/repool post-lunch [13:15:08] Done, going away to lunch too [13:16:52] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10akosiaris) Triaging for #serviceops, removing #SRE. [14:26:43] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10Joe) >>! In T325131#8466245, @MatthewVernon wrote: > `qualify_domain` defaults to the mail server hostname if not otherwise set (and is the domain applied to senders who don't have a domain part, likewise... [14:59:27] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) With {T270191} I've changed the zone of k8s ganeti workers to to their respective ganeti cluster and g... [15:00:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) [15:04:35] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [15:05:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Add kubernetes 1.17+ topology annotations - https://phabricator.wikimedia.org/T270191 (10JMeybohm) 05Open→03Resolved This is done (and unfortunately led to {T325056}). I've created {T325066} to follow up on removing the deprecated anno... [15:17:34] 10serviceops, 10SRE, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10LSobanski) [15:22:01] 10serviceops, 10SRE, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10jijiki) 05Open→03Resolved a:03jijiki Given that this task was opened when the infra was completely different, I am bluntly closing this task. I am happy to re-open if/w... [15:23:53] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) [15:24:24] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) 05Open→03In progress [15:25:05] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) p:05Triage→03Medium [15:33:22] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) [15:36:04] Is anyone opposed to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/867733, which moves the staging database (Cassandra) for echostore & sessionstore from eqiad to codfw? [15:37:21] It's codfw & eqiad staging moving from an eqiad database to a codfw, so it seems like a lateral move, but I figure it can't hurt to ask (maybe it creates additional complexity during switchovers?) [15:38:53] <_joe_> +1 [15:39:06] <_joe_> it's staging, so no [15:39:11] <_joe_> no added complexity [15:39:39] <_joe_> we regularly run with sessionstore in just one dc pooled, too, it's fully active/active (when we schedule the pods, that is ;P) [15:40:06] :) [16:04:03] _joe_: I've deployed that to echostore (staging), any idea why writes to the eqiad endpoint (https://staging.svc.eqiad.wmnet:8082/echoseen/v1/) would go to the new cluster, but writes to the codfw one would come back to the old eqiad DB? [16:04:29] <_joe_> the codfw staging? [16:04:37] <_joe_> it's not deployed to normally [16:04:43] https://staging.svc.codfw.wmnet:8082/echoseen/v1/ [16:04:44] oh [16:04:47] <_joe_> jayme: ^^ (sorry, meeting) [16:04:59] well that would explain it I guess [16:05:14] hmm?! [16:05:46] <_joe_> jayme: urandom deployed echostore to staging, and was wondering why the codfw staging is not showing the new behaviour [16:06:17] got it [16:06:28] <_joe_> can you explain how that works? [16:06:39] <_joe_> I guess he'd want to stop echostore in codfw staging [16:06:55] urandom: we hide a bit the fact that there are two staging clusters so that we (as in sre) can play around with the staging-codfw one [16:07:26] you could either deploy to staging-codfw as well, or we stop echostore there - depending on your need [16:08:19] jayme: what happens in a changeover? [16:08:40] does "staging" then moved to codfw? [16:08:45] s/moved/move/ [16:09:08] when we do DC switchover you mean? In that case we usually do not move staging over [16:09:22] I see [16:09:54] we do so when we want to upgrade staging-eqiad for example. But in that case we would make sure that the latests revisions of "everything" has been deployed to staging-codfw as well before doing so [16:10:22] I don't know that matters then, I guess (shutting down v. deploying). 🤔 [16:10:45] and it's usually more a matter of 1,2 days for staging-codfw to be the "active" staging cluster [16:10:59] *1-2 :) [16:12:09] jayme: you mean, codfw will eventually catch up to eqiad? [16:13:08] yeah. If we plan to switch staging to point to staging-codfw rather then staging-eqiad (temporarily) we will deploy all outstanging updates there as well [16:13:23] but, as said, we can also do so now for consistency sake [16:13:29] oh... so it doesn't otherwise happen routinely [16:13:33] nope [16:14:37] how many services are running in both DCs? what criteria was used to choose which do? [16:15:20] I guess I'm thinking we can just shut these down, but I'm wondering whether I'm missing some benefit [16:15:38] in prod, all services run in both DCs. In staging it's a matter of when we last switched I guess [16:15:47] auh [16:15:54] the main benefit is for the k8s maintainers, not the service maintainers tbh [16:16:20] because we (as in k8s maintainers) have a cluster to test changes without interrupting staging for service owners [16:16:26] gotcha; I guess we should just them down then. [16:16:36] yeessh... shut them down, ofc [16:17:42] but, let me ask: It's not a "real" problem, is it? I'm asking because it might come back at some point because of a staging cluster switch [16:18:50] it wouldn't normally be a problem, but I will need to shutdown/decommission the db cluster those services are connected to in eqiad [16:19:19] ok. that's understood. Just wanted to make sure we're not creating a snowflake situation here :) [16:19:57] jayme: on an unrelated note, should helmfile -i apply be timing out? [16:20:29] no, I don't think so. It will sit there forever waiting for your answer :) [16:20:31] https://www.irccloud.com/pastebin/7tP9pBxt/ [16:20:56] ah, after you hit "y" - I see [16:21:01] yes :) [16:21:28] so, yeahno. Your change has not been deployed (or better: your change has been rolled back) because the new Pods did not become ready [16:21:46] retrying does not seem to be helping matters [16:21:58] it probably won't, yes :) [16:22:20] this is more like a "you probably did something wrong" scenario [16:22:31] those are my fav [16:22:41] I would suspect missing network policy egress rules to the new host [16:22:46] *hosts [16:25:07] jayme: that seems right. Odd that the one service has such fine-grained access defined when the other does not [16:26:25] ah, I now see the paste sais sessionstore not echostore [16:27:09] urandom: so echostore worked and sessionstore failed? [16:27:19] jayme: yes. [16:27:33] should echostore also be that granular? [16:28:25] it's just tcp/9042 [16:29:08] probably. Let me check [16:29:38] urandom: check the networkpolicy map in helmfile.d/services/sessionstore/values.yaml [16:30:04] which lists all the restbase nodes [16:30:06] yes [16:30:36] that's what I mean, it specifies each host & port where-as echostore just the port [16:31:28] oh, interesting. I wasn't aware. Tbh. we should not globally allow 9042 I guess [16:31:48] so if you don't object. Please add granular rules there as well [16:32:21] 👍 [16:36:22] cool, thanks! [16:46:57] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10Joe) The simplest solution I can think of is installing something like ssmtp (or the still-maintained msmtp) that don't need a running daemon for sendmail to work, and configure it to talk to the kubernete... [16:49:19] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:49:25] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) 05Open→03Resolved [16:58:01] claime: there is a pending DNS diff in netbox that needs to be "merged" with the sre.dns.netbox cookbook about aux-k8s-ingress [16:58:12] (see also the related icinga alert) [17:05:02] volans: yes, I'm sorry about that, the dns patch isn´t merged yet [17:05:26] volans: https://gerrit.wikimedia.org/r/c/operations/dns/+/868100 [17:07:06] claime: sure, but the change in netbox is already live, so you need to run the dns cookbook, unfortunately, it's a noop in prod as it's the special corner case of svc [17:07:16] only the gerrit patch is authoritative in this case... :( [17:07:27] so you can freely run the cookbook now and then merge later gerrit [17:07:28] volans: ok, doing [17:07:53] I just merged gerrit [17:08:07] sorry for the trouble, that's pending T270071 [17:08:08] ack [17:14:31] volans: I merged gerrit, ran authdns update, and ran the cookbook [17:14:58] <3 thanks! [17:15:39] volans: forcing recheck does not trigger recovery though [17:17:22] it should recover automatically in few minutes, the check is async [17:17:32] NRPE just checks a file content [17:18:00] ah right ok [17:18:17] Sorry for the noise [17:18:26] nah no prob, it's icinga... not your fault :D [17:18:42] your expectations were plausible in a normal world :D [17:19:05] heh [17:31:54] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10jijiki) 05Open→03Resolved All hosts are in production. [17:38:39] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10jijiki) [17:45:21] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10Southparkfan) [[ https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster | Standalon... [17:51:56] _joe_: I was looking at adding hourly httpbb runs against mw-on-k8s, in addition to the hourly run against the canary appserver -- do you think I should just point the traffic at mw-web.discovery.wmnet, or is there something smarter I can do? [17:53:18] <_joe_> rzl: either that or mwdebug.discovery.wmnet which is the mw-debug deployment [17:53:52] okay cool [17:53:58] Yeah both should work [18:03:14] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) For WMCS Standalone puppetmasters I am not sure how it should behave. I have discove... [18:08:56] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10taavi) Sort of related for standalone puppetmasters: {T152059} [18:10:04] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10vaughnwalters) Another example of a place this is used, in the Campaign Events extension: navigate to https://test2.wikipedia.org/wiki/Event:K8testevent click Register for event click Register email shoul... [20:38:02] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10sbassett) >>! In T325147#8466272, @STran wrote: > - afaik, the the scope of security-api has changed (for now). Whatever's being implemented is for IPInfo's specific use case. Yes and... [20:42:14] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10sbassett) > @SCherukuwada asked if this had to use a relational database and I don't have a good answer. @sbassett originally specced the schemas and might have a better informed answe... [20:50:16] 10serviceops: Evaluate out redis_misc cluster - https://phabricator.wikimedia.org/T325243 (10jijiki) [20:50:29] 10serviceops: Evaluate out redis_misc cluster - https://phabricator.wikimedia.org/T325243 (10jijiki) p:05Triage→03Low [20:59:08] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10jhathaway) I think @Joe’s suggestion of using msmtp or another sendmail compatible client makes sense for short term solves. Are there any avenues for user injected data into the mail() and subsequent send... [21:00:41] 10serviceops, 10Cloud-Services: cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10jijiki) [21:12:53] 10serviceops, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10bd808) [21:18:26] 10serviceops, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10bd808) It looks like we also have mcrouter on the same hosts since {rOPUP971912ae9d9713eb9c592cf82... [21:50:06] 10serviceops, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10jijiki) That sounds alright, but if wikitech is still using redis for sessions (via nutcracker), t... [23:37:46] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Dzahn) >>! In T288375#8465821, @Joe wrote: >...`/usr/share/GeoIPInfo/`. That directory gets populated on...