[00:57:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @jijiki @Dzahn this is all ready for service Thank you. [05:34:35] 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) @Krinkle I agree that we should come up with a complete solution for this. I will close this task and we can continue this discussion... [05:36:33] 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) 05Open→03Resolved p:05Triage→03Medium [05:43:56] 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10Joe) For the record, the mwdebug cluster on kubernetes has its own servergroup. [05:46:23] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994 (10jijiki) p:05Triage→03Medium [06:43:05] 10serviceops, 10MW-on-K8s, 10SRE: Repartition mediawiki servers - https://phabricator.wikimedia.org/T291918 (10Joe) p:05Triage→03High I think the title is misleading, I spent 10 minutes trying to figure out what partitioning schemes had to do with moving to kubernetes :D Amending it. [06:43:37] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) [07:18:12] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) The first scenario I proposed in T290536 goes as follows: * One cluster for first deploy/debug purposes (kube-mwdebug) * One cluster to serve internal requests to t... [07:49:58] 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) The servergroup comes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/546448 which set the environment variable in Apache.... [07:58:55] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Volans) >>! In T290190#7386065, @Papaul wrote: > @Volans I was able to get thumbor2005 installed without adding the MAC address but the install failed a... [08:15:33] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10Mvolz) >>! In T291707#7383868, @akosiaris wrote: >>>! In T291707#7383407, @Mvolz wrote: >> The PDF connection might be a red herring because although that's what happened in the past, attemptin... [08:19:30] 10serviceops, 10Release-Engineering-Team, 10Scap: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) a:03hashar [08:29:55] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10akosiaris) >>! In T291707#7387047, @Mvolz wrote: >>>! In T291707#7383868, @akosiaris wrote: >>>>! In T291707#7383407, @Mvolz wrote: >>> The PDF connection might be a red herring because althoug... [08:37:04] I haven't written any puppet in ages, but here is my latest crazy idea: add a `canary: true` field to mediawiki logs originating from mediawiki servers [08:37:35] the ultimate use is to have a Kibana dashboard that only list errors originating from canaries which we can point to if scap abort due to errors on canaries [08:38:03] the series of patches adds a `CANARY` env variable much like `SERVERGROUP`: https://gerrit.wikimedia.org/r/q/bug:T291870 [08:41:11] <_joe_> hashar: I dont' think adding a field to logs really works for logstash indexing without further work [08:41:30] <_joe_> but I'd rather modify the way we treat SERVERGROUP, which is something I wanted to do anyways [08:42:13] I talked quickly about it with Filippo, it seems the log stack will be able to find `extra.canary` is a boolean and index it as such [08:42:45] <_joe_> yeah I'm saying we should have an easier way than that using the servergroup variable [08:42:47] I thought about introducing some new servergroup such as jobrunner_canaries, then for most purposes they are just jobrunner and should share the same servergroup [08:42:55] hence why I added an extra field [08:43:18] <_joe_> hashar: I think the better approach is to tweak management of SERVERGROUP in mediawiki-config [08:43:28] which seems straightforward and with little impact. I even made it to only be injected when the host is a canary to avoid adding that field to every single records [08:43:49] well I explicitly did not want to mess with servergroup ;) [08:44:06] <_joe_> oh we are messing with it for kubernetes anyways [08:44:17] else we would have to adjust any logic doing stuff like SERVERGROUP==='appserver' to also OR SERVERGROUP==='appserver_canary' [08:44:53] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10WMDE-Fisch) [08:45:48] and for k8s I don't have the use case for scap, I have no idea how we batch deployment to k8s [08:46:23] <_joe_> hashar: let me take the time to try to show my idea for handling SERVERGROUP [08:46:47] I explicitly do not want to touch servergroup! :] [08:49:44] then maybe it is not that complicated to expand servergroup for canaries, I just felt it was easier to add a dedicated field for my purpose [09:31:52] --- [09:32:24] hello folks, I am trying to add a new cluster role 'deploy-kserve' (like the flink one), does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724448/ make sense? [09:32:37] (or any variation of it, just to keep things separate between clusters) [09:56:23] 10serviceops, 10Observability-Metrics, 10Patch-For-Review, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10fgiunchedi) While investigating this with @elukey I noticed `mtail_lines_total` has stopped increasing for centrallog in march (!), a... [11:14:51] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7383383, @akosiaris wrote: >>>! In T290536#7383272, @jijiki wrote: > > > That's currently my preferred way cause it's determi... [11:36:59] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10jijiki) [11:42:24] 10serviceops, 10Scap, 10Patch-For-Review: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 (10jijiki) >>! In T291990#7387469, @Jgiannelos wrote: > Just a heads up, this is currently blocking us from pushing a couple of changes to kartotherian to test our prod environments in... [11:42:42] 10serviceops, 10Scap, 10Patch-For-Review: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 (10jijiki) p:05Triage→03High [12:27:17] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7386927, @Joe wrote: > The first scenario I proposed in T290536 goes as follows: > * One cluster for first deploy/debug purposes (kube-mwdebug) >... [13:09:35] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes, 10Patch-For-Review: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10akosiaris) Change merged and deployed. All pods restarted to pick up the change as well. Let's monitor this over the ne... [13:12:04] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#7387656, @jijiki wrote: > Naming things is hard though, I do not agree with the `kube` prefix, in the future after baremetal mediawiki servers are go... [13:14:03] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfo... [14:04:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [14:28:36] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387775, @Joe wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, in the future... [14:30:22] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387778, @Joe wrote: > I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our deb... [14:39:27] 10serviceops, 10Kubernetes: Document how k8s logging works - https://phabricator.wikimedia.org/T289639 (10akosiaris) p:05Triage→03Low I 've went ahead and added quite a bit of information to https://wikitech.wikimedia.org/wiki/Kubernetes/Logging page. Let me know if that helps and what more you would like... [14:43:06] jelto: o/ (if you have time) - I am getting an error similar to https://github.com/roboll/helmfile/issues/1536 when deploying the ml-service with helmfile, it seems that the easy solution is just to add createNamespace: false to helfileDefaults [14:43:31] the error is: [14:43:31] Error: namespaces is forbidden: User "revscoring-editquality-deploy" cannot create resource "namespaces" in API group "" at the cluster scope [14:43:49] if I check with --debug the helm3 command indeed tries to --create-namespace etc.. [14:44:19] so I am wondering, what is the best path for the moment? Add the createNaspace: false setting to the service helmfile config, or a more global one? [14:46:37] you got a helm chart that tries to create the namespace too ? [14:47:23] nono I checked, I am reasonably sure that it is not the kubeflow chart [14:49:03] I wonder what that --create-namespace does [14:49:21] you probably want to avoid it and handle namespaces on your own in admin_ng/ [14:49:38] cause I doubt it sets all the rules we do [14:53:09] so yeah, I would go down the createNamespace: false route if it helps [14:53:16] akosiaris: yep yep I am already doing it, this is why I was asking.. I can add the createNamespace: false flag to the ml service dir only for the moment [14:53:26] and then we can decide later on where to put it [14:53:37] for the moment IIUC admin_ng is the only other thing that uses helm3 [14:56:30] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724757 :) [14:59:09] (going to test it) [15:02:59] worked! [15:03:00] * elukey dances [15:09:27] <_joe_> these are the moments where we're grateful we work remotely [15:09:47] <_joe_> (we can only imagine you dancing vs actually seeing you dance) [15:11:02] yes yes agreed! [15:13:57] <_joe_> elukey: so you deployed a service with helm3 successfully, it's indeed cause for celebration [15:25:00] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*... [15:36:12] _joe_ at the moment it doesn't pull the docker image for some i/o timeout error that I am trying to debug, I am going to celebrate once solved it :D [15:36:46] <_joe_> how big is that image? [15:42:16] I think around 1.1G (it was worse weeks ago, this is the "slim" version :) [15:43:08] but it is weird since I have deployed with kubectl some time ago and the image was pulled correctly (and it was bigger) [15:43:18] so it may be some knative weird setting/corner-case [15:43:54] (knative is trying to create the first revision of the service but it failed due to the i/o timeout [15:44:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [15:51:10] 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10Krinkle) If I understand correctly, the now-removed approach involved a single Logstash link that would enumerate all the canaries. The cana... [15:57:34] <_joe_> elukey: it's strange indeed [15:57:43] <_joe_> look at the docker logs of the involved node [15:58:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet executed with errors: - thu... [16:11:24] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Quick note: I had to add `createNamespace: false` to my service's `helmDefaults` to avoid the following error: ` Error: namespaces is forbidden: User "revscoring-editquality-dep... [16:23:55] _joe_ so the error is more subtle, it seems that knative's controller tries to get the digest/sha for the docker image, failing with "failed to resolve image to digest: Get \"https://docker-registry.discovery.wmnet/v2/\": dial tcp 10.2.1.44:443: i/o timeout". So I think before docker pulls, like a weird connection issue (but we don't really have firewall rules yet..) [16:24:22] and before I was able to pull an image from docker-registry.wikimedia.org now that I see from docker image list [16:24:28] (so not the discovery endpoint) [16:24:47] <_joe_> elukey: so it works with the external images and not the internal endpoint? [16:25:19] _joe_ I can retry and check, not sure about this last one (I was just checking the old images on the nodes) [16:25:48] <_joe_> elukey: do you have a hole in the egress to allow reaching the docker registry from a kubernetes pod? I guess not [16:26:07] <_joe_> but I fear the problem is that the internal registry refuses to talk to you anyways [16:26:32] _joe_: I have no GlobalNetworkPolicy for the moment, so IIUC egress holes are not needed [16:34:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [16:47:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*... [17:00:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [17:01:01] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) This is ready for service [17:10:40] akosiaris: did you have time to figure out yet if you want to tie giving `exec` rights in the k8s namespaces to a full permissions redesign or not? Thanks to help from l.egoktm and _j.oe_ yesterday I have exciting new errors to figure out for Toolhub related to the network protections and no known host that I can recreate them on interactively. [17:11:44] 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10Krinkle) [17:48:01] ottomata: did you end up deploying my keepAlive change to all eventgates, or just the main? [17:55:10] finished all this morning [17:55:44] 10serviceops, 10Wikimedia-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) So, 503 is solved - we have zero 503 e... [17:56:07] oh cool. Now we don't have any 503s. But, we still have 504s [17:56:19] that's gonna be my next adventure. [18:07:26] you are a hero [20:20:18] 10serviceops, 10Observability-Metrics, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10herron) >>! In T246470#7387278, @fgiunchedi wrote: > While investigating this with @elukey I noticed `mtail_lines_total` has stopped increasing for centrall...