[02:20:21] <wikibugs>	 10serviceops, 10Wikimedia-Logstash, 10GitLab (Initialization), 10SRE Observability (FY2021/2022-Q1), 10User-brennen: Logging for GitLab - https://phabricator.wikimedia.org/T274462 (10lmata)
[02:22:25] <wikibugs>	 10serviceops, 10Citoid, 10SRE, 10SRE Observability, and 2 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata)
[02:24:52] <wikibugs>	 10serviceops, 10Icinga, 10SRE, 10SRE Observability: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata)
[02:24:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10lmata)
[02:27:38] <wikibugs>	 10serviceops, 10SRE, 10SRE Observability, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10lmata)
[02:28:51] <wikibugs>	 10serviceops, 10SRE, 10SRE Observability, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10lmata)
[02:32:32] <wikibugs>	 10serviceops, 10SRE, 10SRE Observability: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10lmata)
[02:40:19] <wikibugs>	 10serviceops, 10SRE Observability, 10Wikimedia-General-or-Unknown, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10lmata)
[08:08:14] <wikibugs>	 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Pginer-WMF)
[08:12:03] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) `base::expose_puppet_certs` is used in both master and node profiles, with different settings: * on master...
[08:14:02] <elukey>	 hello folks
[08:14:08] * jayme afraid
[08:14:35] <elukey>	 ahhh I love when my presence is so well received by colleagues
[08:15:39] <elukey>	 I just left a note on the calico-on-master-nodes adventure related to expose puppet certs, it seems the last obstacle
[08:16:04] <elukey>	 (at least on paper, I am sure there will be more fireworks when we'll deploy it on ml nodes)
[08:19:09] <jayme>	 elukey: you mean what's on the task I guess?
[08:20:43] <elukey>	 jayme: yes exactly, I assumed that wikibugs left a note related to me spamming this chan :D
[08:21:15] <jayme>	 yeah...just wanted to make sure I'm not missing anything in gerrit or so
[08:21:52] <elukey>	 nono I have another CR opened but it is blocked by duplicate declarations (the last one being expose_puppet_certs)
[08:22:32] <elukey>	 or not, maybe it is only permissions related, not duplicate declaration, but anyway an issue
[08:22:40] <elukey>	 going to re-run pcc just to make sure
[08:29:36] <jayme>	 elukey: in case of the master, is the puppet cert and key even used?
[08:30:02] <jayme>	 I can't seem to find a reference to it and lsof sais no
[08:30:59] <jayme>	 for service accounts (k8s) and tls (apiserver) we use the dedicated certificates
[08:37:38] <elukey>	 jayme: I was checking that too, I assumed there was a reason to expose the certs
[08:38:03] <jayme>	 I'm sure there was one at some point :)
[08:38:19] <elukey>	 checking git blame
[08:39:02] <elukey>	 ahh interesting https://gerrit.wikimedia.org/r/c/operations/puppet/+/343787/2/modules/profile/manifests/kubernetes/master.pp
[08:39:10] <elukey>	 so it seems the answer is no
[08:39:33] <elukey>	 or maybe yes for toolforge, not for "prod"
[08:39:45] <jayme>	 I'd assume something around the ssl_*_path, yes
[08:40:09] <elukey>	 not confusing at all
[08:40:14] <jayme>	 IDK exactly about toolforge but my understanding always was that they use something completely different
[08:40:17] <jayme>	 indeed
[08:40:39] <jayme>	 did I mention the puppet code around k8s could need some love? :D
[08:41:06] <elukey>	 jayme: I can turn the option off for ml-serve and check that it works fine, then add a comment around the expose puppet cert explaining what we went through
[08:42:04] <jayme>	 elukey: yes please. We should use your ticket to figure this out and refactor the code to something better
[08:43:21] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) After a chat with Janis we reviewed the master's code and found https://gerrit.wikimedia.org/r/c/operation...
[08:43:27] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) We also talked about using Istio Ingress in the past (envoy-based) which could be a good fit as well and we could share technology and resources with ML...
[08:43:41] <elukey>	 or just turn it off for our use cases, leaving a comment in the code, we can keep it in there for flexibility
[08:43:57] <elukey>	 (turn off the hiera flag for masters I mean)
[08:44:01] <elukey>	 going to test it :)
[08:46:17] <jayme>	 s/flexibility/cofusion/  ;-)
[08:49:09] <elukey>	 hahahaha
[09:07:01] <elukey>	 just did a roll restart of all daemons on ml-serve-ctrl, all good
[09:12:18] <jayme>	 nice
[09:26:34] <elukey>	 so now https://gerrit.wikimedia.org/r/c/operations/puppet/+/702645 looks clean in pcc
[09:26:39] <elukey>	 the idea is to 
[09:26:53] <elukey>	 1) add /dev/vdb to ml-serve-ctrl*
[09:27:05] <elukey>	 2) merge the change and hope that the kubelet comes up without horrors
[09:27:11] <elukey>	 3) add BGP rules
[09:27:15] <elukey>	 does it make sense?
[09:29:01] <jayme>	 yeah
[09:29:19] <jayme>	 one thing you could double check is metrics of the "node services"
[09:29:38] <jayme>	 How nodes are selected for scraping by prometheus I mean
[09:29:51] <jayme>	 to make sure the master nodes are scraped as well
[09:29:58] <elukey>	 good point!
[09:32:22] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Kormat) >>! In T285806#7191055, @wkandek wrote: > Thanks everybody for the feedback on the communications for the DC switchover process. We will spe...
[10:32:28] <wikibugs>	 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) My current understandin...
[10:36:12] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @Cmjohnson  @Jclark-ctr  We would like to start putting those servers in production, is it possible to update or complete any actions...
[10:36:32] <Nikerabbit>	 coming back to https://phabricator.wikimedia.org/T285219#7205091 after vacation – this issue does not seem to confined to cxserver or wikidata. Can someone help to debug it further?
[10:37:02] <effie>	 mutante: here?
[10:58:03] <elukey>	 first attempt of kubelet on master not great, I forgot the docker profiles :D
[11:27:53] <mutante>	 effie: what's up? currently meeting Jelto
[12:26:30] <elukey>	 jayme: kubelet + docker up on ml-serve-ctrl!
[13:28:09] <elukey>	 I guess that now there is
[13:28:12] <elukey>	 1) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104
[13:28:33] <elukey>	 2) helmfile -e <my_cluster> -l name=calico sync (to create the pods on ml-serve-ctrl)
[13:29:10] <elukey>	 I am following https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Networking, even if 2) then 1) looks more reasonable in my head
[13:29:19] <elukey>	 but I guess that both needs to be run close together
[13:34:50] <effie>	 mutante: I was looking at the decom tasks, as we need to make some room for mc* hosts, but DCops is out this week 
[13:44:05] <mutante>	 effie: ACK, no problem. I/we will remove some more (after the canary replacement). actually you could help us by reviewing Jelto's change that is part of replacing canaries, and meanwhile I will make some more patches 
[14:32:53] <jayme>	 elukey: hmm...hasn't calico already been scheduled on the master nodes?
[14:33:56] <jayme>	 read: I'd assume the sync is not necessary 
[14:35:27] <elukey>	 jayme: interesting, now I see the calico pod (via docker ps) only on ml-serve-ctrl1002
[14:35:44] <elukey>	 but on 1001 I see the /pause one
[14:35:49] <effie>	 mutante: k 
[14:36:35] <elukey>	 jayme: ah my bad it looks ok yes, no need for sync.. so probably the homer CR is enough
[14:39:34] <wikibugs>	 10serviceops, 10Release Pipeline: Production buster-nodejs10-devel image has npm 5.x, which is not actually compatible with node 10.x - https://phabricator.wikimedia.org/T284112 (10Jdforrester-WMF) >>! In T284112#7198157, @Legoktm wrote: > Is npm 5.x flat out incompatible with node10, or is it more subtle than...
[14:40:26] <jayme>	 double checked IPs - LGTM
[14:41:47] <elukey>	 <3
[14:44:11] <jayme>	 not sure if bird (inside of calico-node) will retry all the time, so maybe you need to kill the calico-node pods on the masters. But is would be nice to know if if just starts working after the homer CR is applied
[14:46:00] <elukey>	 jayme: ack yes, will report back once applied
[14:46:07] <jayme>	 cool
[15:38:04] <elukey>	 so the pods don't come up nicely, I see
[15:38:05] <elukey>	 bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
[15:38:19] <elukey>	 so I have forgot something for sure
[15:39:28] <jayme>	 elukey: homer change already merged?
[15:40:21] <elukey>	 ahhh there is profile::calico::kubernetes::bgp_peers in puppeeeeettt
[15:40:40] <elukey>	 jayme: yep! but the ibgp session stay in "Connect" so bird is not collaborating
[15:40:42] <jayme>	 ouch...sorry for not spotting
[15:42:01] <elukey>	 nono my bad
[15:42:42] <elukey>	 basically https://gerrit.wikimedia.org/r/c/operations/puppet/+/704131
[15:47:33] <jayme>	 elukey: you'll have to add those to the masters role as well
[15:47:56] <jayme>	 to have ferm open the ports
[15:48:12] <elukey>	 sigh right
[15:55:48] <dcausse>	 hi, I'd like to deploy the flink chart to staging tomorrow if this is ok with you
[16:02:27] <jayme>	 I think there is some work left for us to do. I can probably take a look tomorrow morning if that's okay
[16:02:31] <jayme>	 dcausse: 1
[16:03:18] <dcausse>	 jayme: it would be great, thanks!
[16:04:59] <elukey>	 jayme: bgp sessions established! \o/
[16:05:08] <jayme>	 elukey: yay!
[16:08:03] <jayme>	 dcausse: but ofc. feel free to merge already to have the chain green
[16:17:37] <dcausse>	 thanks, I'll merge tomorrow morning and make sure https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/693416 is green
[16:48:42] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki)
[16:48:56] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10jijiki)
[16:49:04] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki)
[16:54:39] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10User-jijiki: Deploy tegola-vector-tiles to kubernetes - https://phabricator.wikimedia.org/T283159 (10jijiki) tegola-vector-tiles is deployed to staging, but it is non functional as we need to create postgres users which are allowed to connect from the kubernetes...
[16:58:03] <wikibugs>	 10serviceops, 10Maps, 10User-jijiki: tegola-vector-tiles: load balancing reads between postgress servers - https://phabricator.wikimedia.org/T286494 (10jijiki)
[16:58:15] <wikibugs>	 10serviceops, 10Maps, 10User-jijiki: tegola-vector-tiles: load balancing reads between postgres servers - https://phabricator.wikimedia.org/T286494 (10jijiki)
[17:38:01] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Kubelets / calico / bird are deployed on the ml-serve-ctrl nodes, but the istio webhook svc seems not reac...
[21:10:48] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) a:05jeena→03None
[22:10:04] <wikibugs>	 10serviceops, 10Shellbox: Benchmark Shellbox - https://phabricator.wikimedia.org/T286384 (10Legoktm) thanks! Forked to https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&refresh=1m&from=now-6h&to=now   Note that `deployment="$namespace"` didn't work for me, I had to switch it to `kubernetes_namespace="...