[00:00:27] what's the best way to make sure it's still working as intended before proceeding to codfw/eqiad? the change should be a no-op but I would think that the change of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713934/4/helmfile.d/services/linkrecommendation/values.yaml wouldn't take effect until after the pod is kicked over? [05:55:53] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2037.codfw.wmnet'... [06:28:06] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2037.codfw.wmnet'] ` and were **ALL** successful. [06:31:13] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [06:31:24] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [07:00:21] ryankemper: your apply has roll-restarted linkrecommendation (as far as I can tell now). Although the process of running helmfile checks for readiness of the new pod and rolls back to the previous state if it does not get ready [07:13:24] vgutierrez: do you need the follow ups reviewed as well already? Or is that something for later™ [07:26:44] That would be great [07:27:00] 10serviceops, 10SRE, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Ladsgroup) [07:35:42] damn :D [07:42:00] O:) [07:42:27] you'll get a few beers on the next IRL offsite [07:44:15] we all will be sooo drunk on the next offsite :D .. the more the longer this pandemic goes [07:44:30] 🍻 [07:53:59] beercounter++ [07:54:22] I hope that's an uint64.. [07:54:32] lol:) [08:39:33] vgutierrez: do you plan on upgrading envoy to 1.16? [08:39:36] 1.16+ [08:40:00] jayme: right now I'm using the envoy-future component, using 1.18.3 on my tests in our WCMS environment [08:40:32] ii envoyproxy 1.18.3-1 amd64 Cloud-native high-performance edge/middle/service proxy [08:40:45] vgutierrez: ah, okay. AIUI the API v3 stuff you added will work only with >= 1.16 [08:40:59] not sure what compatibility assumptions we currently have with the puppet code [08:41:22] yeah, some stuff like OCSP stapling from prefetched responses are only available on V3 [08:41:31] and we definitely need that [08:42:00] not only available in API v3 but also only available in API v3 with envoy 1.16+, that's what I mean [08:42:26] so if someone comes around and configures v3 stuff with current fleet wide envoy version, it will break the config [08:43:53] hmm yeah, but the existing profile uses v2 [08:44:20] we will be deploying our own.. profile::cache::envoy [08:44:25] I see...then it's probably okay to just mention it somewhere [08:50:40] BTW, our version requirement for the edge instance is actually > 1.17 [08:50:50] TLS handshake timeout has been released with 1.17.0 [08:53:32] maybe envoyproxy::tls_terminator should check for the envoy version installed (in case a TlsconfigV3 is provided) and fail if it [08:53:57] oh well, ignore that [08:54:51] you ocsp_path is optional, so it's totally possible to create API v3 config valid for envoy <1.16 I guess [08:54:57] sure [08:55:12] nothing that I've added affecting v2 is mandatory [08:55:37] actually ocsp_path isn't included on the v2 listener template [08:55:40] only in the v3 one [08:56:05] oh sorry, you were mentioning v3 already :_) [08:56:09] I know. I was just talking about the v3 one as we want to migrate to it as well [08:56:13] :) [08:56:51] so we both got carried away from thinking about the potential beers we could potentially have on a potential offsite :D [08:58:45] 10serviceops, 10SRE, 10Traffic: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10jcrespo) Hey, @aborrero, I cannot speak on behalf of the traffic/netops/serviceops team, but given that large files -IIRC- have a different workflow than smaller ones (multi-part upload) and spe... [09:00:58] 10serviceops, 10SRE, 10Traffic, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Majavah) [09:35:50] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Addshore) Should we close this? [10:29:45] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) 05Open→03Resolved a... [11:05:10] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2029.codfw.wmnet'... [11:37:14] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2029.codfw.wmnet'] ` and were **ALL** successful. [13:57:18] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2031.codfw.wmnet'... [14:01:29] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. It can be set globally (like in [admin-ng](https://ger... [14:25:20] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7304745, @Jelto wrote: > The helm binary in `helmfile` can be set using the `--helm-binary` option or by setting `helmBinary` in the `helmfile.yaml`. > [...] Tha... [14:34:21] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2031.codfw.wmnet'] ` and were **ALL** successful. [15:13:02] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10lmata) [15:28:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10JMeybohm) p:05Triage→03High [19:31:16] maybe someone can take a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714606 and figure out what I'm doing wrong [19:31:37] https://integration.wikimedia.org/ci/job/helm-lint/4979/console shows that it's falling back to the chart's default of main_app.version to "score" instead of using the ones specified in helmfile.d [19:32:43] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10Legoktm) [19:39:59] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10Legoktm) Just to clarify, are these the logs that are stored at `/var/log/pods///*.log`? I think setting up a quick lo... [19:43:20] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [21:56:32] re my deployment-charts question, I wasn't doing anything wrong, it was CI using the published shellbox chart, not the one I had modified in Git [22:45:21] Do we have any charts that add a sidecar to do log shipping? I'm pretty close to having ECS formatted log output from Toolhub to stderr. Now I need to figure out how to route that to the ELK cluster and I was kind of hoping that I don't have to implement my own Kafka client to do that. [22:46:16] * bd808 assumes that Kafka queues are still the preferred input to ELK but may have missed a newer thing [23:01:58] * legoktm was just discussing this earlier today [23:03:26] flink-session-cluster appears to use log4j with kafka [23:03:51] (just randomly grepping) [23:04:07] bd808: note that all stderr from pods ends up in logstash automatically [23:04:38] I don't know how the ECS part gets interpreted [23:05:28] hmmm... do you know what routes stderr to ELK? Maybe I can figure out if any special magic is needed. [23:05:51] I don't, let me see [23:06:07] the ECS bit in my setup is just the log formatter emitting as json records with the expected ECS layout [23:08:00] https://stackoverflow.com/questions/21102293/how-to-write-to-kafka-from-python-logging-module actually doesn't look too ugly if I need to send directly to kafka, but it would be nice to not care :) [23:10:05] seems like it's rsyslog, per https://phabricator.wikimedia.org/T207200 and profile::rsyslog::kubernetes [23:11:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/539978/ might be helpful, "parse nested json from mmkubernetes"