[04:33:37] 10serviceops: Productionise thumbor1005, thumbor1006, thumbor2005 and thumbor2006 - https://phabricator.wikimedia.org/T285477 (10jijiki) [04:33:55] 10serviceops, 10decommission-hardware: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10jijiki) [04:34:02] 10serviceops, 10decommission-hardware: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10jijiki) [04:36:08] 10serviceops, 10ChangeProp, 10SRE, 10SCB, and 2 others: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10jijiki) 05Open→03Declined Bluntly closing this, no updates/findings for a long time [05:36:22] 10serviceops: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 (10Marostegui) >>! In T269746#7314773, @Dzahn wrote: > @Kormat @marostegui Per Jaime's comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/715233 I am pinging you guys on the ticket to let you know about chan... [05:50:54] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10jijiki) We have undeployed mw in staging, so for the time being we could mark this as resolved. In the future, if we deploy to staging again, a bandai... [06:57:09] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) Essentially a puppet setting yes, `rsync::server::wrap_with_stunnel` for the server bits and then e.g. `rsync::quickdatacopy` has the option to turn on ssl on the... [07:02:45] 10serviceops, 10SRE Observability (FY2021/2022-Q1), 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10fgiunchedi) >>! In T210137#7315629, @colewhite wrote: >>>! In T210137#7314353, @fgiunchedi wrote: >> Ok I have a working patch to parse `omfwd` messages a... [07:11:04] 10serviceops, 10GitLab (Initialization): GitLab Puma reduced availability due to automated restart - https://phabricator.wikimedia.org/T289454 (10fgiunchedi) Thank you for taking a look @Jelto ! I think we can fix this properly once the job availability alert is in alertmanager, by e.g. being a little more tol... [07:17:56] hello folks, I am looking into calico's GlobalNetworkPolicies for ml-serve (ah ah ah I know it will be fun) and as starting step I am taking a look to what we use for the main clusters [07:18:26] and I am a little confused by [07:18:28] > # This allows egress from all pods to all pods. Ingress still needs to be allowed by the destination, though. [07:18:46] where are the Ingress rules defined?? [07:18:54] if any [07:19:56] Luca, there is a networkpolicy.yaml file for each service, please drink more coffee [07:21:20] at this point it may be more convenient to create the "ml-services" or whatever structure under helmfile.d [07:50:44] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) p:05Triage→03Medium [07:50:51] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) p:05Triage→03Medium [07:55:13] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [07:56:30] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) 05Open→03Resolved [08:02:07] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10fgiunchedi) p:05Triage→03Medium [08:03:48] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#7261361, @Ottomata wrote: > I think that will do it. helm template looks good locally. > > @JMeybohm is it ok tha... [08:04:43] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10fgiunchedi) p:05Triage→03Medium [08:05:10] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:16] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Remove the non-TLS k8s service will be handled via T255871 [08:05:22] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:32] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:44] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871 [08:06:02] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) 05Open→03Resolved Remove the non-TLS k8s service will be handled via T255871 [08:07:53] 10serviceops, 10MediaWiki-Uploading, 10SRE, 10Traffic, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10fgiunchedi) p:05Triage→03Medium [08:07:59] 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10fgiunchedi) p:05Triage→03Medium [08:57:07] 10serviceops: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 (10Kormat) >>! In T269746#7317485, @Marostegui wrote: > They do share the same password, however there should not be any cross-dc queries and if they cannot be avoided by any means, they do need to be done via SSL. **N.B... [09:06:00] 10serviceops: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 (10Marostegui) Very good point, thanks @Kormat! [09:14:09] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) 05Open→03Resolved This is done [09:14:35] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) a:05JMeybohm→03Jelto [09:19:49] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Use a separate key for service account token issuer - https://phabricator.wikimedia.org/T275026 (10JMeybohm) [09:23:40] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm) [09:23:49] 10serviceops, 10Analytics, 10Event-Platform, 10SRE, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JMeybohm) [09:29:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: envoy service proxy: Add networkpolicy egress rule for enabled listeners - https://phabricator.wikimedia.org/T264076 (10JMeybohm) While not exactly a duplicate, this has been implemented as part of T253058 already. [12:34:19] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Deploy Flink (rdf-streaming-updater) to kubernetes (k8s) - https://phabricator.wikimedia.org/T264006 (10Gehel) [12:35:22] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Deploy Flink (rdf-streaming-updater) to kubernetes (k8s) - https://phabricator.wikimedia.org/T264006 (10Gehel) 05Open→03Resolved a:03Gehel [15:01:30] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) That's for this writeup Let me start by saying that of the 3 solutions, the basic idea of the 3rd one should be the one that we aim for in the long run (but not now). Deploym... [15:35:58] akosiaris: jelto: If you have another 10 min after the meeting we could have a short discussion about the rbac/helm3 thing to unblock [15:36:14] sure [15:36:19] ack [16:39:07] 10serviceops, 10Kubernetes: Evaluate and enable audit logging for kubeapi-server - https://phabricator.wikimedia.org/T290020 (10Jelto) [16:39:15] 10serviceops, 10Kubernetes: Evaluate and enable audit logging for kubeapi-server - https://phabricator.wikimedia.org/T290020 (10Jelto) p:05Triage→03Low [16:43:11] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call `ml-services` dir under `helmfile.d` (see... [16:45:07] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Evaluate and enable audit logging for kubeapi-server - https://phabricator.wikimedia.org/T290020 (10JMeybohm) [18:06:06] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10dancy) >>! In T284628#7317547, @jijiki wrote: > We have undeployed mw in staging, so for the time being. In the future, if we deploy to staging again,... [18:42:00] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10Legoktm) We discussed a few different options during the ServiceOps meeting today: * Do testing on production clusters which have SSDs - prefer not to... [19:25:23] 10serviceops, 10Wikimedia-Site-requests, 10Technical-Debt, 10User-Majavah: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Jdforrester-WMF) [[https://integration.wikimedia.org/ci/job/apple-search-pipeline-publish/1/console]]: `'pu... [20:35:41] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [20:42:38] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) @JMeybohm, I merged that and am trying to apply for eventgate-logging-external staging. Diff looks good: ` 20:23:31 [@deploy1002... [21:21:50] 10serviceops, 10GitLab, 10Release-Engineering-Team (Next), 10User-brennen: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 (10brennen) [21:22:21] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10dancy) >>! In T284628#7319705, @Legoktm wrote: > A timer is simpler, we could just pull `:latest` every minute or so. I'm not exactly sure how CI woul...