[00:00:27] <ryankemper>	 what's the best way to make sure it's still working as intended before proceeding to codfw/eqiad? the change should be a no-op but I would think that the change of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713934/4/helmfile.d/services/linkrecommendation/values.yaml wouldn't take effect until after the pod is kicked over?
[05:55:53] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2037.codfw.wmnet'...
[06:28:06] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2037.codfw.wmnet'] `  and were **ALL** successful.
[06:31:13] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki)
[06:31:24] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki)
[07:00:21] <jayme>	 ryankemper: your apply has roll-restarted linkrecommendation (as far as I can tell now). Although the process of running helmfile checks for readiness of the new pod and rolls back to the previous state if it does not get ready 
[07:13:24] <jayme>	 vgutierrez: do you need the follow ups reviewed as well already? Or is that something for later™
[07:26:44] <vgutierrez>	 That would be great
[07:27:00] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Create a  mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Ladsgroup)
[07:35:42] <jayme>	 damn :D
[07:42:00] <vgutierrez>	 O:)
[07:42:27] <vgutierrez>	 you'll get a few beers on the next IRL offsite
[07:44:15] <jayme>	 we all will be sooo drunk on the next offsite :D .. the more the longer this pandemic goes
[07:44:30] <vgutierrez>	 🍻
[07:53:59] <mutante>	 beercounter++
[07:54:22] <vgutierrez>	 I hope that's an uint64..
[07:54:32] <mutante>	 lol:)
[08:39:33] <jayme>	 vgutierrez: do you plan on upgrading envoy to 1.16?
[08:39:36] <jayme>	 1.16+
[08:40:00] <vgutierrez>	 jayme: right now I'm using the envoy-future component, using 1.18.3 on my tests in our WCMS environment
[08:40:32] <vgutierrez>	 ii  envoyproxy     1.18.3-1     amd64        Cloud-native high-performance edge/middle/service proxy
[08:40:45] <jayme>	 vgutierrez: ah, okay. AIUI the API v3 stuff you added will work only with >= 1.16
[08:40:59] <jayme>	 not sure what compatibility assumptions we currently have with the puppet code
[08:41:22] <vgutierrez>	 yeah, some stuff like OCSP stapling from prefetched responses are only available on V3
[08:41:31] <vgutierrez>	 and we definitely need that
[08:42:00] <jayme>	 not only available in API v3 but also only available in API v3 with envoy 1.16+, that's what I mean
[08:42:26] <jayme>	 so if someone comes around and configures v3 stuff with current fleet wide envoy version, it will break the config
[08:43:53] <vgutierrez>	 hmm yeah, but the existing profile uses v2
[08:44:20] <vgutierrez>	 we will be deploying our own.. profile::cache::envoy
[08:44:25] <jayme>	 I see...then it's probably okay to just mention it somewhere
[08:50:40] <vgutierrez>	 BTW, our version requirement for the edge instance is actually > 1.17
[08:50:50] <vgutierrez>	 TLS handshake timeout has been released with 1.17.0
[08:53:32] <jayme>	 maybe envoyproxy::tls_terminator should check for the envoy version installed (in case a TlsconfigV3 is provided) and fail if it
[08:53:57] <jayme>	 oh well, ignore that
[08:54:51] <jayme>	 you ocsp_path is optional, so it's totally possible to create API v3 config valid for envoy <1.16 I guess
[08:54:57] <vgutierrez>	 sure
[08:55:12] <vgutierrez>	 nothing that I've added affecting v2 is mandatory
[08:55:37] <vgutierrez>	 actually ocsp_path isn't included on the v2 listener template
[08:55:40] <vgutierrez>	 only in the v3 one
[08:56:05] <vgutierrez>	 oh sorry, you were mentioning v3 already :_)
[08:56:09] <jayme>	 I know. I was just talking about the v3 one as we want to migrate to it as well 
[08:56:13] <jayme>	 :)
[08:56:51] <jayme>	 so we both got carried away from thinking about the potential beers we could potentially have on a potential offsite :D
[08:58:45] <wikibugs>	 10serviceops, 10SRE, 10Traffic: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10jcrespo) Hey, @aborrero,  I cannot speak on behalf of the traffic/netops/serviceops team, but given that large files -IIRC- have a different workflow than smaller ones (multi-part upload) and spe...
[09:00:58] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Majavah)
[09:35:50] <wikibugs>	 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Addshore) Should we close this?
[10:29:45] <wikibugs>	 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) 05Open→03Resolved a...
[11:05:10] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2029.codfw.wmnet'...
[11:37:14] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2029.codfw.wmnet'] `  and were **ALL** successful.
[13:57:18] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2031.codfw.wmnet'...
[14:01:29] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. It can be set globally (like in [admin-ng](https://ger...
[14:25:20] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7304745, @Jelto wrote: > The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. > [...] Tha...
[14:34:21] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2031.codfw.wmnet'] `  and were **ALL** successful.
[15:13:02] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10lmata)
[15:28:02] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10JMeybohm) p:05Triage→03High
[19:31:16] <legoktm>	 maybe someone can take a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714606 and figure out what I'm doing wrong
[19:31:37] <legoktm>	 https://integration.wikimedia.org/ci/job/helm-lint/4979/console shows that it's falling back to the chart's default of main_app.version to "score" instead of using the ones specified in helmfile.d
[19:32:43] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10Legoktm)
[19:39:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10Legoktm) Just to clarify, are these the logs that are stored at `/var/log/pods/<deployment>/<container>/*.log`? I think setting up a quick lo...
[19:43:20] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki)
[21:56:32] <legoktm>	 re my deployment-charts question, I wasn't doing anything wrong, it was CI using the published shellbox chart, not the one I had modified in Git
[22:45:21] <bd808>	 Do we have any charts that add a sidecar to do log shipping? I'm pretty close to having ECS formatted log output from Toolhub to stderr. Now I need to figure out how to route that to the ELK cluster and I was kind of hoping that I don't have to implement my own Kafka client to do that.
[22:46:16] * bd808 assumes that Kafka queues are still the preferred input to ELK but may have missed a newer thing
[23:01:58] * legoktm was just discussing this earlier today
[23:03:26] <legoktm>	 flink-session-cluster appears to use log4j with kafka
[23:03:51] <legoktm>	 (just randomly grepping)
[23:04:07] <legoktm>	 bd808: note that all stderr from pods ends up in logstash automatically
[23:04:38] <legoktm>	 I don't know how the ECS part gets interpreted
[23:05:28] <bd808>	 hmmm... do you know what routes stderr to ELK? Maybe I can figure out if any special magic is needed.
[23:05:51] <legoktm>	 I don't, let me see
[23:06:07] <bd808>	 the ECS bit in my setup is just the log formatter emitting as json records with the expected ECS layout
[23:08:00] <bd808>	 https://stackoverflow.com/questions/21102293/how-to-write-to-kafka-from-python-logging-module actually doesn't look too ugly if I need to send directly to kafka, but it would be nice to not care :)
[23:10:05] <legoktm>	 seems like it's rsyslog, per https://phabricator.wikimedia.org/T207200 and profile::rsyslog::kubernetes
[23:11:12] <legoktm>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/539978/ might be helpful, "parse nested json from mmkubernetes"