[00:22:59] 10serviceops, 10SRE, 10Patch-For-Review: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Hourly appserver tests are running on both cumin1001 (to mw1414) and cumin2001 (to mw2271). Weirdly, the tests time out in eqiad about half the time: ` Aug 25 20:53:28 cumin1001 systemd[... [06:38:36] bd808: wkandek: legoktm: There is deployment-charts/helmfile.d/services/_example_ which is/should be the building block for new service dirs [06:42:12] and yes, the service.port.nodePort is from when we had services running without TLS/envoy initially. So assert(service.port.nodePort === tls.public_port) == false in some cases, still [08:21:57] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) And some more input regarding RBAC and the replacement of Tiller service account: Currently two different users exist for a service deployment. One is the less privileged viewer... [09:01:52] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) Thanks for writing this up. As discussed already I'm in for option 2 as well as it keeps things "mostly" as they are. As you said someone with access to tiller (e.g. every depl... [09:09:04] bd808: ping me for your memcached needs when you are around [09:09:23] I am happy to sell you mcrouter if needed [09:21:02] ...along with the brooklyn bridge. cheap. [09:21:04] * apergos runs [09:37:49] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, 10Kubernetes: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10JMeybohm) [09:38:37] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jijiki) [09:46:35] 10serviceops, 10Platform Engineering, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ArielGlenn) Platform Engineering will take this, but if there are complications, we'll be back... [09:51:16] 10serviceops, 10Platform Engineering, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jcrespo) Thank you @ArielGlenn [09:58:21] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) p:05Triage→03High [12:06:42] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) [13:08:43] lol apergos, I am not selling other people's stuff ! [13:18:45] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10JMeybohm) [13:23:08] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) There is a ClusterRole named `deploy` already for the aggregation of `view` and `pods/portForward` permissions. So I would prefer using the names `` and ` 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) AIUI the logs do reach Elasticsearch, but they lack the Kubernetes API metadata and therefore do... [13:27:31] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) And that since... ` /var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: mmkubernetes: fa... [13:40:25] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) Yeah...great: >>! From https://www.rsyslog.com/doc/master/configuration/modules/mmkubernetes.htm... [13:47:38] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10akosiaris) I 've met a similar issue the other day on kubernetes1014. It was a wikifeeds pod running for a good 3 months, having restarted 1... [13:58:58] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10JMeybohm) Did you try the logrotate approach? >>! From https://kubernetes.io/docs/concepts/cluster-administration/logging/ > An important con... [14:39:11] effie: o/ I'm finally awake. And with ~20 minutes until a wall of meetings hits me. [14:40:38] Toolhub prefixes it's keys, so we do not think we need to worry about namespace collisions in memcached. Usage should be very, very small by prod standards. Pretty much just a write through cache for sessions. [14:47:59] jayme: thanks for the pointer! My eye had completely missed that telling directory name. [14:50:40] yw. Added to wikitech as well now [14:55:19] bd808: let me discuss it a bit with service ops, if your dataset is very small [14:55:31] it would be an overkill to shard it across 18 servers per DC [14:56:14] otoh this infra is there, working, and you can easily replicate keys to both DCs if you want [14:57:20] yeah, I guess I do not have any strong opinion other than wanting to make this as close to normal as possible so that folks don't find strange surprises because of my django app hiding in a sea of MW and node things. :) [15:18:14] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) [15:18:36] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) p:05Triage→03High [15:26:36] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, 10Kubernetes: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10dancy) [16:02:01] legoktm: I'm audio broken again/still. Will join in ~2m [16:06:04] OK :) [17:37:07] rzl: fyi https://phabricator.wikimedia.org/T289779 relates to new ldap groups for sre's. if that sounds good i can look to add the group tomorrow [17:37:22] jbond: yeah, I was just looking! sounds great, thanks for doing the work [17:37:37] cool and no problem [17:37:42] cc arnoldokoth ^ [18:59:32] 10serviceops, 10GitLab, 10Release-Engineering-Team, 10User-brennen: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 (10brennen)