[00:57:04] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @jijiki @Dzahn this is all ready for service   Thank you.
[05:34:35] <wikibugs>	 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) @Krinkle I agree that we should come up with a complete solution for this. I will close this task and we can continue this discussion...
[05:36:33] <wikibugs>	 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) 05Open→03Resolved p:05Triage→03Medium
[05:43:56] <wikibugs>	 10serviceops, 10Analytics-Radar, 10WikimediaDebug, 10observability, and 4 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10Joe) For the record, the mwdebug cluster on kubernetes has its own servergroup.
[05:46:23] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Provide an mwdebug functionality on kubernetes  - https://phabricator.wikimedia.org/T276994 (10jijiki) p:05Triage→03Medium
[06:43:05] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Repartition mediawiki servers - https://phabricator.wikimedia.org/T291918 (10Joe) p:05Triage→03High I think the title is misleading, I spent 10 minutes trying to figure out what partitioning schemes had to do with moving to kubernetes :D Amending it.
[06:43:37] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe)
[07:18:12] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) The first scenario I proposed in T290536 goes as follows: * One cluster for first deploy/debug purposes (kube-mwdebug) * One cluster to serve internal requests to t...
[07:49:58] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) The servergroup comes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/546448 which set the environment variable in Apache....
[07:58:55] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Volans) >>! In T290190#7386065, @Papaul wrote: > @Volans I was able to get thumbor2005 installed without adding the MAC address but the install failed a...
[08:15:33] <wikibugs>	 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10Mvolz) >>! In T291707#7383868, @akosiaris wrote: >>>! In T291707#7383407, @Mvolz wrote: >> The PDF connection might be a red herring because although that's what happened in the past, attemptin...
[08:19:30] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) a:03hashar
[08:29:55] <wikibugs>	 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10akosiaris) >>! In T291707#7387047, @Mvolz wrote: >>>! In T291707#7383868, @akosiaris wrote: >>>>! In T291707#7383407, @Mvolz wrote: >>> The PDF connection might be a red herring because althoug...
[08:37:04] <hashar>	 I haven't written any puppet in ages, but here is my latest crazy idea:  add a `canary: true` field to mediawiki logs originating from mediawiki servers
[08:37:35] <hashar>	 the ultimate use is to have a Kibana dashboard that only list errors originating from canaries which we can point to if scap abort due to errors on canaries
[08:38:03] <hashar>	 the series of patches adds a `CANARY` env variable much like `SERVERGROUP`: https://gerrit.wikimedia.org/r/q/bug:T291870
[08:41:11] <_joe_>	 hashar: I dont' think adding a field to logs really works for logstash indexing without further work
[08:41:30] <_joe_>	 but I'd rather modify the way we treat SERVERGROUP, which is something I wanted to do anyways
[08:42:13] <hashar>	 I talked quickly about it with Filippo, it seems the log stack will be able to find `extra.canary` is a boolean and index it as such
[08:42:45] <_joe_>	 yeah I'm saying we should have an easier way than that using the servergroup variable
[08:42:47] <hashar>	 I thought about introducing some new servergroup such as jobrunner_canaries,  then for most purposes they are just jobrunner and should share the same servergroup
[08:42:55] <hashar>	 hence why I added an extra field
[08:43:18] <_joe_>	 hashar: I think the better approach is to tweak management of SERVERGROUP in mediawiki-config
[08:43:28] <hashar>	 which seems straightforward and with little impact. I even made it to only be injected when the host is a canary to avoid adding that field to every single records
[08:43:49] <hashar>	 well I explicitly did not want to mess with servergroup ;)
[08:44:06] <_joe_>	 oh we are messing with it for kubernetes anyways
[08:44:17] <hashar>	 else we would have to adjust any logic doing stuff like SERVERGROUP==='appserver' to also OR SERVERGROUP==='appserver_canary'
[08:44:53] <wikibugs>	 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10WMDE-Fisch)
[08:45:48] <hashar>	 and for k8s I don't have the use case for scap, I have no idea how we batch deployment to k8s
[08:46:23] <_joe_>	 hashar: let me take the time to try to show my idea for handling SERVERGROUP
[08:46:47] <hashar>	 I explicitly do not want to touch servergroup! :]
[08:49:44] <hashar>	 then maybe it is not that complicated to expand servergroup for canaries, I just felt it was easier to add a dedicated field for my purpose
[09:31:52] <elukey>	 ---
[09:32:24] <elukey>	 hello folks, I am trying to add a new cluster role 'deploy-kserve' (like the flink one), does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724448/ make sense?
[09:32:37] <elukey>	 (or any variation of it, just to keep things separate between clusters)
[09:56:23] <wikibugs>	 10serviceops, 10Observability-Metrics, 10Patch-For-Review, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10fgiunchedi) While investigating this with @elukey I noticed `mtail_lines_total` has stopped increasing for centrallog in march (!), a...
[11:14:51] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki)  >>! In T290536#7383383, @akosiaris wrote: >>>! In T290536#7383272, @jijiki wrote: >   >  > That's currently my preferred way cause it's determi...
[11:36:59] <wikibugs>	 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10jijiki)
[11:42:24] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 (10jijiki) >>! In T291990#7387469, @Jgiannelos wrote: > Just a heads up, this is currently blocking us from pushing a couple of changes to kartotherian to test our prod environments in...
[11:42:42] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 (10jijiki) p:05Triage→03High
[12:27:17] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7386927, @Joe wrote: > The first scenario I proposed in T290536 goes as follows: > * One cluster for first deploy/debug purposes (kube-mwdebug) >...
[13:09:35] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes, 10Patch-For-Review: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10akosiaris) Change merged and deployed. All pods restarted to pick up the change as well. Let's monitor this over the ne...
[13:12:04] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#7387656, @jijiki wrote: > Naming things is hard though, I do not agree with the `kube` prefix, in the future after baremetal mediawiki servers are go...
[13:14:03] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfo...
[14:04:37] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet
[14:28:36] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387775, @Joe wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, in the future...
[14:30:22] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387778, @Joe wrote: > I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our deb...
[14:39:27] <wikibugs>	 10serviceops, 10Kubernetes: Document how k8s logging works - https://phabricator.wikimedia.org/T289639 (10akosiaris) p:05Triage→03Low I 've went ahead and added quite a bit of information to https://wikitech.wikimedia.org/wiki/Kubernetes/Logging page. Let me know if that helps and what more you would like...
[14:43:06] <elukey>	 jelto: o/ (if you have time) - I am getting an error similar to https://github.com/roboll/helmfile/issues/1536 when deploying the ml-service with helmfile, it seems that the easy solution is just to add createNamespace: false to helfileDefaults
[14:43:31] <elukey>	 the error is:
[14:43:31] <elukey>	 Error: namespaces is forbidden: User "revscoring-editquality-deploy" cannot create resource "namespaces" in API group "" at the cluster scope
[14:43:49] <elukey>	 if I check with --debug the helm3 command indeed tries to --create-namespace etc..
[14:44:19] <elukey>	 so I am wondering, what is the best path for the moment? Add the createNaspace: false setting to the service helmfile config, or a more global one?
[14:46:37] <akosiaris>	 you got a helm chart that tries to create the namespace too ?
[14:47:23] <elukey>	 nono I checked, I am reasonably sure that it is not the kubeflow chart
[14:49:03] <akosiaris>	 I wonder what that --create-namespace does
[14:49:21] <akosiaris>	 you probably want to avoid it and handle namespaces on your own in admin_ng/
[14:49:38] <akosiaris>	 cause I doubt it sets all the rules we do
[14:53:09] <akosiaris>	 so yeah, I would go down the createNamespace: false route if it helps
[14:53:16] <elukey>	 akosiaris: yep yep I am already doing it, this is why I was asking.. I can add the createNamespace: false flag to the ml service dir only for the moment
[14:53:26] <elukey>	 and then we can decide later on where to put it
[14:53:37] <elukey>	 for the moment IIUC admin_ng is the only other thing that uses helm3
[14:56:30] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724757 :)
[14:59:09] <elukey>	 (going to test it)
[15:02:59] <elukey>	 worked!
[15:03:00] * elukey dances
[15:09:27] <_joe_>	 these are the moments where we're grateful we work remotely
[15:09:47] <_joe_>	 (we can only imagine you dancing vs actually seeing you dance)
[15:11:02] <elukey>	 yes yes agreed!
[15:13:57] <_joe_>	 elukey: so you deployed a service with helm3 successfully, it's indeed cause for celebration
[15:25:00] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*...
[15:36:12] <elukey>	 _joe_ at the moment it doesn't pull the docker image for some i/o timeout error that I am trying to debug, I am going to celebrate once solved it :D
[15:36:46] <_joe_>	 how big is that image?
[15:42:16] <elukey>	 I think around 1.1G (it was worse weeks ago, this is the "slim" version :)
[15:43:08] <elukey>	 but it is weird since I have deployed with kubectl some time ago and the image was pulled correctly (and it was bigger)
[15:43:18] <elukey>	 so it may be some knative weird setting/corner-case
[15:43:54] <elukey>	 (knative is trying to create the first revision of the service but it failed due to the i/o timeout
[15:44:37] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet
[15:51:10] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10Krinkle) If I understand correctly, the now-removed approach involved a single Logstash link that would enumerate all the canaries. The cana...
[15:57:34] <_joe_>	 elukey: it's strange indeed
[15:57:43] <_joe_>	 look at the docker logs of the involved node
[15:58:24] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet executed with errors: - thu...
[16:11:24] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Quick note: I had to add `createNamespace: false` to my service's `helmDefaults` to avoid the following error:  ` Error: namespaces is forbidden: User "revscoring-editquality-dep...
[16:23:55] <elukey>	 _joe_ so the error is more subtle, it seems that knative's controller tries to get the digest/sha for the docker image, failing with "failed to resolve image to digest: Get \"https://docker-registry.discovery.wmnet/v2/\": dial tcp 10.2.1.44:443: i/o timeout". So I think before docker pulls, like a weird connection issue (but we don't really have firewall rules yet..)
[16:24:22] <elukey>	 and before I was able to pull an image from docker-registry.wikimedia.org now that I see from docker image list
[16:24:28] <elukey>	 (so not the discovery endpoint)
[16:24:47] <_joe_>	 elukey: so it works with the external images and not the internal endpoint?
[16:25:19] <elukey>	 _joe_ I can retry and check, not sure about this last one (I was just checking the old images on the nodes)
[16:25:48] <_joe_>	 elukey: do you have a hole in the egress to allow reaching the docker registry from a kubernetes pod? I guess not
[16:26:07] <_joe_>	 but I fear the problem is that the internal registry refuses to talk to you anyways
[16:26:32] <elukey>	 _joe_: I have no GlobalNetworkPolicy for the moment, so IIUC egress holes are not needed
[16:34:26] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet
[16:47:57] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*...
[17:00:50] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul)
[17:01:01] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) This is ready for service
[17:10:40] <bd808>	 akosiaris: did you have time to figure out yet if you want to tie giving `exec` rights in the k8s namespaces to a full permissions redesign or not? Thanks to help from l.egoktm and _j.oe_ yesterday I have exciting new errors to figure out for Toolhub related to the network protections and no known host that I can recreate them on interactively.
[17:11:44] <wikibugs>	 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10Krinkle)
[17:48:01] <Pchelolo>	 ottomata: did you end up deploying my keepAlive change to all eventgates, or just the main?
[17:55:10] <ottomata>	 finished all this morning
[17:55:44] <wikibugs>	 10serviceops, 10Wikimedia-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) So, 503 is solved - we have zero 503 e...
[17:56:07] <Pchelolo>	 oh cool. Now we don't have any 503s. But, we still have 504s
[17:56:19] <Pchelolo>	 that's gonna be my next adventure.
[18:07:26] <ottomata>	 you are a hero
[20:20:18] <wikibugs>	 10serviceops, 10Observability-Metrics, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10herron) >>! In T246470#7387278, @fgiunchedi wrote: > While investigating this with @elukey I noticed `mtail_lines_total` has stopped increasing for centrall...