[07:25:12] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) The two attached patches implement proposal #3  Now we just need to create the appropriate topic, named `mediawiki.httpd.accesslog` on both ka...
[07:45:25] <claime>	 o>
[07:45:29] * claime yawns
[08:29:17] <wikibugs>	 10serviceops: deploy1002: Check for large files in client bucket - https://phabricator.wikimedia.org/T324437 (10Clement_Goubert)
[08:29:40] <wikibugs>	 10serviceops: deploy1002: Check for large files in client bucket - https://phabricator.wikimedia.org/T324437 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low
[08:38:36] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) a:03Clement_Goubert
[09:12:16] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert)
[09:12:29] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) a:05Clement_Goubert→03None
[09:33:09] <elukey>	 claime: o/
[09:33:38] <claime>	 :)
[09:34:04] <elukey>	 saw the task passing by --^ - given the size of logs that the kafka topic will have to handle, I'd recommend to have a number of partitions that is not the default (1 replicated three times)
[09:34:30] <elukey>	 the number of partitions can be changed dinamycally after the task creation, so it can be done anytime
[09:34:57] <elukey>	 but keep it in mind (so traffic will be balanced and you will be able to use more consumers downstream etc..)
[09:35:18] <claime>	 Did I forget to copy from the original task that we need a high partition number? Yes I did lol. 
[09:35:19] <elukey>	 and logstash will likely have an easier life pulling those msgs
[09:36:10] <claime>	 I don't know how to calculate the partition number, or enforce it at creation with the open topic creation. Advice?
[09:43:29] <elukey>	 this is a good point, I think that if we keep each partition at max 2k/3k msg/s is probably better, but we'd also need to consider the number of brokers and the downstream consumers.. we should have the same number of partitions on each broker (traffic in/out is spread evenly) and the consumers should be able to leverage the high number of partitions (like having one thread/process for each 
[09:43:35] <elukey>	 partition). Logstash should be good without a lot of fine tuning, but Cole will likely have more insights on that pipeline (very ignorant about it) 
[09:44:25] <claime>	 a'ight, I'll add a bit more info to the task and I'll reach out to cole when he gets here
[09:46:23] <elukey>	 super
[09:47:02] <elukey>	 (didn't want to intrude in the task, but I worked on rebalancing topics in the past and it was so painful so now I try to warn people as much as I can :D)
[09:48:31] <claime>	 elukey: No worries, I'll add your input to the task, it's important to keep in mind, you're right
[09:50:57] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) As [[ https://phabricator.wikimedia.org/T265876#6559439 | noted in the parent task]], and quite an important infor...
[09:51:50] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert)
[09:53:57] <elukey>	 the topic can be created manually anytime, the last time I used
[09:54:00] <elukey>	 kafka topics --create --topic webrequest_sampled --partitions 3 --replication-factor 3
[09:54:03] <elukey>	 for example
[09:54:07] <elukey>	 (needs to be run on a kafka node)
[10:19:57] <godog>	 ack re: kafka topic, good point(s)
[10:20:14] <godog>	 do you know if I can re-enable puppet on phab1001 and then disable it again?
[10:20:21] <godog>	 Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'T280597 - dzahn');
[10:20:41] <claime>	 Hmm that's probably for the phab failover the other day
[10:21:00] <godog>	 yeah looks like it
[10:21:34] <claime>	 sobanski: ^ ?
[10:22:19] <sobanski>	 Looking
[10:24:41] <sobanski>	 godog: phab1001 is on its way out and shouldnt' be serving any actual purpose at this point. Why do we need to run Puppet on it?
[10:25:24] <godog>	 sobanski: I'm decom'ing icinga mgmt checks which depend on puppet running on the end hosts to actually remove the check
[10:25:57] <godog>	 (not a great system, agreed)
[10:26:30] <sobanski>	 :)
[10:27:55] <sobanski>	 Is it urgent or can it wait for Daniel to be around? This migration was a bit messy with disabling a few Phab features and I'm not 100% sure it'd be safe to do.
[10:28:25] <sobanski>	 Also, side question, if we decommissioned the host in the meantime, would you still be able to remove the icinga checks with it no longer physically present?
[10:28:41] <sobanski>	 Or do you need to be kept in the loop?
[10:29:24] <godog>	 on decom the host disappears from puppet so it is all good, no involvement on my part required yeah
[10:29:34] <godog>	 but yes it can wait
[10:30:15] <sobanski>	 If that's the case then it will disappear on its own soon enough anyway.
[10:30:23] <godog>	 mutante: ^ for when you are around, I'd need a single puppet run on phab1001 (or decom)
[10:30:29] <sobanski>	 If that's an acceptable solution and not blocking any other stuff.
[10:31:11] <godog>	 the puppet run would be best, unless the decom happens today, we'll see what Daniel says in a few hours
[10:31:30] <sobanski>	 👍
[10:33:38] <wikibugs>	 10serviceops: Upgrade ICU version for MediaWiki in preparation to move to debian bullseye - https://phabricator.wikimedia.org/T324447 (10Joe)
[10:33:57] <wikibugs>	 10serviceops: Upgrade ICU version for MediaWiki in preparation to move to debian bullseye - https://phabricator.wikimedia.org/T324447 (10Joe) p:05Triage→03Low
[10:35:23] <wikibugs>	 10serviceops, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10fgiunchedi) FWIW while looking into sth unrelated I found contint1001 crashed today since two days
[11:12:47] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Volume recommendation is apparently ~2/3k mps/partition, so we may want 5 partitions, not considering broker equil...
[11:21:14] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) p:05Triage→03Medium
[11:37:16] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) 05Open→03Resolved All done.
[11:42:42] <_joe_>	 elukey: 2/3k msg/s would mean 10 partitions probably :P
[11:42:52] <_joe_>	 not now, but when all traffic is moved
[11:43:07] <_joe_>	 also, can benthos also make use of multiple partitions?
[11:47:38] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) Amazing, thank you!
[12:38:18] <hnowlan>	 going to deploy the limit bumps for thumbor to admin_ng 
[12:38:54] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) Seems like it's always the same pod having trouble, it's the only one with events: ` Warning  Unhealthy  33m (x19 over 4d19h)  kubelet, kubernetes2017.codfw.wmnet  Readiness prob...
[12:39:41] <claime>	 hnowlan: ack
[12:50:49] <hnowlan>	 staging is done, but before I continue to anything prod-affecting: it's been a while since I've done an admin_ng change and I notice that releases for unrelated services get recreated/updated when doing a sync. I assume this is all fine given that the change was just for pod limits, but is that anything to be concerned or conservative about in prod?
[12:52:01] <jayme>	 hnowlan: what got recreated/updates in staging?
[12:52:02] <claime>	 How do you mean for unrelated services?
[12:52:06] <claime>	 Jej
[12:53:00] <hnowlan>	 jayme: https://phabricator.wikimedia.org/P42244 
[12:53:07] <hnowlan>	 they didn't show in the helmfile diff 
[12:53:44] <jayme>	 ah, okay. in that case they did not change. Did you wan "helmfile -e ... apply" or "sync"?
[12:53:55] <hnowlan>	 jayme: sync 
[12:55:04] <jayme>	 that's probably it then. You can go with "-i apply" to get the diff and a user input afterwards for confirmation. IIRC those updated releases should not show then
[12:56:30] <claime>	 iirc sync will run upgrade for all releases
[12:56:32] <jayme>	 sync will run "helm upgrade --install" for every release regardless if there is a diff or not. apply only does so for releases that have a diff
[12:56:44] <hnowlan>	 ah, okay. thanks! 
[13:02:30] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) Apparently that didn't fix it, it just moved the issue to a different pod...
[13:20:31] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos)
[13:29:41] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maps, 10Patch-For-Review: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos) a:05jijiki→03Jgiannelos
[14:05:46] <elukey>	 _joe_ yes benthos uses a consumer group so it can work with multiple partitions (but it needs to have multiple working threads of course)
[14:19:52] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10JMeybohm) There is a huge increase in 503 errors from envoy (service proxy): `upstream connect error or disconnect/reset before headers. reset reason: connection failure` but only in codfw: https...
[14:36:07] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) The sharp increase coincides with its redeployment https://sal.toolforge.org/log/OCvuyIQB8Fs0LHO5dmc0 I presume for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/...
[14:44:17] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Joe) I also deployed in eqiad at the same time, so there's no reason for that really. We can try deploying it again, but I'd rather look at the traffic patterns.
[14:50:33] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10JMeybohm) It's mostly `/en.wikipedia.org/v1/page/random/title` that fails, overall req/s does not have seemed to change much.
[14:54:06] <wikibugs>	 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Joe) Most of the errors are coming from a single pod: wikifeeds-production-67959bd8df-jwhks
[15:08:30] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 10WMDE-TechWish-Sprint-2022-11-29: Migrate our draft charts to newer scaffolding - https://phabricator.wikimedia.org/T324471 (10awight)
[15:08:49] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 10WMDE-TechWish-Sprint-2022-11-29: Migrate our draft charts to newer scaffolding - https://phabricator.wikimedia.org/T324471 (10awight) a:03awight
[15:21:57] <hnowlan>	 thumbor limits have been increased, I'd like to bump the per-instance memory if anyone has a sec https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/864773 
[17:08:17] <sobanski>	 godog: closing the loop, Daniel will decommision the host today
[17:52:54] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10LSobanski)
[17:54:48] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) 05Open→03Resolved a:03akosiaris ` ssh kubernetes1007.eqiad.wmnet dpkg -l docker.io |grep docker.io ii...
[18:01:48] <mutante>	 godog: I will decom phab1001 today to solve that. But also it was supposed to have 14 days downtime so I wonder why it alerted you or where it was in the way
[18:34:18] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn)
[18:59:58] <wikibugs>	 10serviceops, 10SRE, 10Toolhub, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) 05Open→03Resolved a:03Legoktm