[02:20:21] 10serviceops, 10Wikimedia-Logstash, 10GitLab (Initialization), 10SRE Observability (FY2021/2022-Q1), 10User-brennen: Logging for GitLab - https://phabricator.wikimedia.org/T274462 (10lmata) [02:22:25] 10serviceops, 10Citoid, 10SRE, 10SRE Observability, and 2 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata) [02:24:52] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) [02:24:58] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10lmata) [02:27:38] 10serviceops, 10SRE, 10SRE Observability, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10lmata) [02:28:51] 10serviceops, 10SRE, 10SRE Observability, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10lmata) [02:32:32] 10serviceops, 10SRE, 10SRE Observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10lmata) [02:40:19] 10serviceops, 10SRE Observability, 10Wikimedia-General-or-Unknown, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10lmata) [08:08:14] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Pginer-WMF) [08:12:03] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) `base::expose_puppet_certs` is used in both master and node profiles, with different settings: * on master... [08:14:02] hello folks [08:14:08] * jayme afraid [08:14:35] ahhh I love when my presence is so well received by colleagues [08:15:39] I just left a note on the calico-on-master-nodes adventure related to expose puppet certs, it seems the last obstacle [08:16:04] (at least on paper, I am sure there will be more fireworks when we'll deploy it on ml nodes) [08:19:09] elukey: you mean what's on the task I guess? [08:20:43] jayme: yes exactly, I assumed that wikibugs left a note related to me spamming this chan :D [08:21:15] yeah...just wanted to make sure I'm not missing anything in gerrit or so [08:21:52] nono I have another CR opened but it is blocked by duplicate declarations (the last one being expose_puppet_certs) [08:22:32] or not, maybe it is only permissions related, not duplicate declaration, but anyway an issue [08:22:40] going to re-run pcc just to make sure [08:29:36] elukey: in case of the master, is the puppet cert and key even used? [08:30:02] I can't seem to find a reference to it and lsof sais no [08:30:59] for service accounts (k8s) and tls (apiserver) we use the dedicated certificates [08:37:38] jayme: I was checking that too, I assumed there was a reason to expose the certs [08:38:03] I'm sure there was one at some point :) [08:38:19] checking git blame [08:39:02] ahh interesting https://gerrit.wikimedia.org/r/c/operations/puppet/+/343787/2/modules/profile/manifests/kubernetes/master.pp [08:39:10] so it seems the answer is no [08:39:33] or maybe yes for toolforge, not for "prod" [08:39:45] I'd assume something around the ssl_*_path, yes [08:40:09] not confusing at all [08:40:14] IDK exactly about toolforge but my understanding always was that they use something completely different [08:40:17] indeed [08:40:39] did I mention the puppet code around k8s could need some love? :D [08:41:06] jayme: I can turn the option off for ml-serve and check that it works fine, then add a comment around the expose puppet cert explaining what we went through [08:42:04] elukey: yes please. We should use your ticket to figure this out and refactor the code to something better [08:43:21] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) After a chat with Janis we reviewed the master's code and found https://gerrit.wikimedia.org/r/c/operation... [08:43:27] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) We also talked about using Istio Ingress in the past (envoy-based) which could be a good fit as well and we could share technology and resources with ML... [08:43:41] or just turn it off for our use cases, leaving a comment in the code, we can keep it in there for flexibility [08:43:57] (turn off the hiera flag for masters I mean) [08:44:01] going to test it :) [08:46:17] s/flexibility/cofusion/ ;-) [08:49:09] hahahaha [09:07:01] just did a roll restart of all daemons on ml-serve-ctrl, all good [09:12:18] nice [09:26:34] so now https://gerrit.wikimedia.org/r/c/operations/puppet/+/702645 looks clean in pcc [09:26:39] the idea is to [09:26:53] 1) add /dev/vdb to ml-serve-ctrl* [09:27:05] 2) merge the change and hope that the kubelet comes up without horrors [09:27:11] 3) add BGP rules [09:27:15] does it make sense? [09:29:01] yeah [09:29:19] one thing you could double check is metrics of the "node services" [09:29:38] How nodes are selected for scraping by prometheus I mean [09:29:51] to make sure the master nodes are scraped as well [09:29:58] good point! [09:32:22] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Kormat) >>! In T285806#7191055, @wkandek wrote: > Thanks everybody for the feedback on the communications for the DC switchover process. We will spe... [10:32:28] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) My current understandin... [10:36:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @Cmjohnson @Jclark-ctr We would like to start putting those servers in production, is it possible to update or complete any actions... [10:36:32] coming back to https://phabricator.wikimedia.org/T285219#7205091 after vacation – this issue does not seem to confined to cxserver or wikidata. Can someone help to debug it further? [10:37:02] mutante: here? [10:58:03] first attempt of kubelet on master not great, I forgot the docker profiles :D [11:27:53] effie: what's up? currently meeting Jelto [12:26:30] jayme: kubelet + docker up on ml-serve-ctrl! [13:28:09] I guess that now there is [13:28:12] 1) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104 [13:28:33] 2) helmfile -e -l name=calico sync (to create the pods on ml-serve-ctrl) [13:29:10] I am following https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Networking, even if 2) then 1) looks more reasonable in my head [13:29:19] but I guess that both needs to be run close together [13:34:50] mutante: I was looking at the decom tasks, as we need to make some room for mc* hosts, but DCops is out this week [13:44:05] effie: ACK, no problem. I/we will remove some more (after the canary replacement). actually you could help us by reviewing Jelto's change that is part of replacing canaries, and meanwhile I will make some more patches [14:32:53] elukey: hmm...hasn't calico already been scheduled on the master nodes? [14:33:56] read: I'd assume the sync is not necessary [14:35:27] jayme: interesting, now I see the calico pod (via docker ps) only on ml-serve-ctrl1002 [14:35:44] but on 1001 I see the /pause one [14:35:49] mutante: k [14:36:35] jayme: ah my bad it looks ok yes, no need for sync.. so probably the homer CR is enough [14:39:34] 10serviceops, 10Release Pipeline: Production buster-nodejs10-devel image has npm 5.x, which is not actually compatible with node 10.x - https://phabricator.wikimedia.org/T284112 (10Jdforrester-WMF) >>! In T284112#7198157, @Legoktm wrote: > Is npm 5.x flat out incompatible with node10, or is it more subtle than... [14:40:26] double checked IPs - LGTM [14:41:47] <3 [14:44:11] not sure if bird (inside of calico-node) will retry all the time, so maybe you need to kill the calico-node pods on the masters. But is would be nice to know if if just starts working after the homer CR is applied [14:46:00] jayme: ack yes, will report back once applied [14:46:07] cool [15:38:04] so the pods don't come up nicely, I see [15:38:05] bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory [15:38:19] so I have forgot something for sure [15:39:28] elukey: homer change already merged? [15:40:21] ahhh there is profile::calico::kubernetes::bgp_peers in puppeeeeettt [15:40:40] jayme: yep! but the ibgp session stay in "Connect" so bird is not collaborating [15:40:42] ouch...sorry for not spotting [15:42:01] nono my bad [15:42:42] basically https://gerrit.wikimedia.org/r/c/operations/puppet/+/704131 [15:47:33] elukey: you'll have to add those to the masters role as well [15:47:56] to have ferm open the ports [15:48:12] sigh right [15:55:48] hi, I'd like to deploy the flink chart to staging tomorrow if this is ok with you [16:02:27] I think there is some work left for us to do. I can probably take a look tomorrow morning if that's okay [16:02:31] dcausse: 1 [16:03:18] jayme: it would be great, thanks! [16:04:59] jayme: bgp sessions established! \o/ [16:05:08] elukey: yay! [16:08:03] dcausse: but ofc. feel free to merge already to have the chain green [16:17:37] thanks, I'll merge tomorrow morning and make sure https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/693416 is green [16:48:42] 10serviceops, 10SRE, 10Traffic, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki) [16:48:56] 10serviceops, 10SRE, 10Traffic, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki) [16:49:04] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [16:54:39] 10serviceops, 10Maps, 10Patch-For-Review, 10User-jijiki: Deploy tegola-vector-tiles to kubernetes - https://phabricator.wikimedia.org/T283159 (10jijiki) tegola-vector-tiles is deployed to staging, but it is non functional as we need to create postgres users which are allowed to connect from the kubernetes... [16:58:03] 10serviceops, 10Maps, 10User-jijiki: tegola-vector-tiles: load balancing reads between postgress servers - https://phabricator.wikimedia.org/T286494 (10jijiki) [16:58:15] 10serviceops, 10Maps, 10User-jijiki: tegola-vector-tiles: load balancing reads between postgres servers - https://phabricator.wikimedia.org/T286494 (10jijiki) [17:38:01] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Kubelets / calico / bird are deployed on the ml-serve-ctrl nodes, but the istio webhook svc seems not reac... [21:10:48] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) a:05jeena→03None [22:10:04] 10serviceops, 10Shellbox: Benchmark Shellbox - https://phabricator.wikimedia.org/T286384 (10Legoktm) thanks! Forked to https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&refresh=1m&from=now-6h&to=now Note that `deployment="$namespace"` didn't work for me, I had to switch it to `kubernetes_namespace="...