[07:25:12] 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) The two attached patches implement proposal #3 Now we just need to create the appropriate topic, named `mediawiki.httpd.accesslog` on both ka... [07:45:25] o> [07:45:29] * claime yawns [08:29:17] 10serviceops: deploy1002: Check for large files in client bucket - https://phabricator.wikimedia.org/T324437 (10Clement_Goubert) [08:29:40] 10serviceops: deploy1002: Check for large files in client bucket - https://phabricator.wikimedia.org/T324437 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low [08:38:36] 10serviceops, 10MW-on-K8s, 10SRE, 10observability, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) a:03Clement_Goubert [09:12:16] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) [09:12:29] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) a:05Clement_Goubert→03None [09:33:09] claime: o/ [09:33:38] :) [09:34:04] saw the task passing by --^ - given the size of logs that the kafka topic will have to handle, I'd recommend to have a number of partitions that is not the default (1 replicated three times) [09:34:30] the number of partitions can be changed dinamycally after the task creation, so it can be done anytime [09:34:57] but keep it in mind (so traffic will be balanced and you will be able to use more consumers downstream etc..) [09:35:18] Did I forget to copy from the original task that we need a high partition number? Yes I did lol. [09:35:19] and logstash will likely have an easier life pulling those msgs [09:36:10] I don't know how to calculate the partition number, or enforce it at creation with the open topic creation. Advice? [09:43:29] this is a good point, I think that if we keep each partition at max 2k/3k msg/s is probably better, but we'd also need to consider the number of brokers and the downstream consumers.. we should have the same number of partitions on each broker (traffic in/out is spread evenly) and the consumers should be able to leverage the high number of partitions (like having one thread/process for each [09:43:35] partition). Logstash should be good without a lot of fine tuning, but Cole will likely have more insights on that pipeline (very ignorant about it) [09:44:25] a'ight, I'll add a bit more info to the task and I'll reach out to cole when he gets here [09:46:23] super [09:47:02] (didn't want to intrude in the task, but I worked on rebalancing topics in the past and it was so painful so now I try to warn people as much as I can :D) [09:48:31] elukey: No worries, I'll add your input to the task, it's important to keep in mind, you're right [09:50:57] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) As [[ https://phabricator.wikimedia.org/T265876#6559439 | noted in the parent task]], and quite an important infor... [09:51:50] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) [09:53:57] the topic can be created manually anytime, the last time I used [09:54:00] kafka topics --create --topic webrequest_sampled --partitions 3 --replication-factor 3 [09:54:03] for example [09:54:07] (needs to be run on a kafka node) [10:19:57] ack re: kafka topic, good point(s) [10:20:14] do you know if I can re-enable puppet on phab1001 and then disable it again? [10:20:21] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'T280597 - dzahn'); [10:20:41] Hmm that's probably for the phab failover the other day [10:21:00] yeah looks like it [10:21:34] sobanski: ^ ? [10:22:19] Looking [10:24:41] godog: phab1001 is on its way out and shouldnt' be serving any actual purpose at this point. Why do we need to run Puppet on it? [10:25:24] sobanski: I'm decom'ing icinga mgmt checks which depend on puppet running on the end hosts to actually remove the check [10:25:57] (not a great system, agreed) [10:26:30] :) [10:27:55] Is it urgent or can it wait for Daniel to be around? This migration was a bit messy with disabling a few Phab features and I'm not 100% sure it'd be safe to do. [10:28:25] Also, side question, if we decommissioned the host in the meantime, would you still be able to remove the icinga checks with it no longer physically present? [10:28:41] Or do you need to be kept in the loop? [10:29:24] on decom the host disappears from puppet so it is all good, no involvement on my part required yeah [10:29:34] but yes it can wait [10:30:15] If that's the case then it will disappear on its own soon enough anyway. [10:30:23] mutante: ^ for when you are around, I'd need a single puppet run on phab1001 (or decom) [10:30:29] If that's an acceptable solution and not blocking any other stuff. [10:31:11] the puppet run would be best, unless the decom happens today, we'll see what Daniel says in a few hours [10:31:30] 👍 [10:33:38] 10serviceops: Upgrade ICU version for MediaWiki in preparation to move to debian bullseye - https://phabricator.wikimedia.org/T324447 (10Joe) [10:33:57] 10serviceops: Upgrade ICU version for MediaWiki in preparation to move to debian bullseye - https://phabricator.wikimedia.org/T324447 (10Joe) p:05Triage→03Low [10:35:23] 10serviceops, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10fgiunchedi) FWIW while looking into sth unrelated I found contint1001 crashed today since two days [11:12:47] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Volume recommendation is apparently ~2/3k mps/partition, so we may want 5 partitions, not considering broker equil... [11:21:14] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) p:05Triage→03Medium [11:37:16] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) 05Open→03Resolved All done. [11:42:42] <_joe_> elukey: 2/3k msg/s would mean 10 partitions probably :P [11:42:52] <_joe_> not now, but when all traffic is moved [11:43:07] <_joe_> also, can benthos also make use of multiple partitions? [11:47:38] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) Amazing, thank you! [12:38:18] going to deploy the limit bumps for thumbor to admin_ng [12:38:54] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) Seems like it's always the same pod having trouble, it's the only one with events: ` Warning Unhealthy 33m (x19 over 4d19h) kubelet, kubernetes2017.codfw.wmnet Readiness prob... [12:39:41] hnowlan: ack [12:50:49] staging is done, but before I continue to anything prod-affecting: it's been a while since I've done an admin_ng change and I notice that releases for unrelated services get recreated/updated when doing a sync. I assume this is all fine given that the change was just for pod limits, but is that anything to be concerned or conservative about in prod? [12:52:01] hnowlan: what got recreated/updates in staging? [12:52:02] How do you mean for unrelated services? [12:52:06] Jej [12:53:00] jayme: https://phabricator.wikimedia.org/P42244 [12:53:07] they didn't show in the helmfile diff [12:53:44] ah, okay. in that case they did not change. Did you wan "helmfile -e ... apply" or "sync"? [12:53:55] jayme: sync [12:55:04] that's probably it then. You can go with "-i apply" to get the diff and a user input afterwards for confirmation. IIRC those updated releases should not show then [12:56:30] iirc sync will run upgrade for all releases [12:56:32] sync will run "helm upgrade --install" for every release regardless if there is a diff or not. apply only does so for releases that have a diff [12:56:44] ah, okay. thanks! [13:02:30] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) Apparently that didn't fix it, it just moved the issue to a different pod... [13:20:31] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos) [13:29:41] 10serviceops, 10Content-Transform-Team-WIP, 10Maps, 10Patch-For-Review: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos) a:05jijiki→03Jgiannelos [14:05:46] _joe_ yes benthos uses a consumer group so it can work with multiple partitions (but it needs to have multiple working threads of course) [14:19:52] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10JMeybohm) There is a huge increase in 503 errors from envoy (service proxy): `upstream connect error or disconnect/reset before headers. reset reason: connection failure` but only in codfw: https... [14:36:07] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Clement_Goubert) The sharp increase coincides with its redeployment https://sal.toolforge.org/log/OCvuyIQB8Fs0LHO5dmc0 I presume for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/... [14:44:17] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Joe) I also deployed in eqiad at the same time, so there's no reason for that really. We can try deploying it again, but I'd rather look at the traffic patterns. [14:50:33] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10JMeybohm) It's mostly `/en.wikipedia.org/v1/page/random/title` that fails, overall req/s does not have seemed to change much. [14:54:06] 10serviceops, 10Wikifeeds: wikifeeds.svc.codfw.wmnet flapping alerts - https://phabricator.wikimedia.org/T324412 (10Joe) Most of the errors are coming from a single pod: wikifeeds-production-67959bd8df-jwhks [15:08:30] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 10WMDE-TechWish-Sprint-2022-11-29: Migrate our draft charts to newer scaffolding - https://phabricator.wikimedia.org/T324471 (10awight) [15:08:49] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 10WMDE-TechWish-Sprint-2022-11-29: Migrate our draft charts to newer scaffolding - https://phabricator.wikimedia.org/T324471 (10awight) a:03awight [15:21:57] thumbor limits have been increased, I'd like to bump the per-instance memory if anyone has a sec https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/864773 [17:08:17] godog: closing the loop, Daniel will decommision the host today [17:52:54] 10serviceops, 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10LSobanski) [17:54:48] 10serviceops, 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) 05Open→03Resolved a:03akosiaris ` ssh kubernetes1007.eqiad.wmnet dpkg -l docker.io |grep docker.io ii... [18:01:48] godog: I will decom phab1001 today to solve that. But also it was supposed to have 14 days downtime so I wonder why it alerted you or where it was in the way [18:34:18] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) [18:59:58] 10serviceops, 10SRE, 10Toolhub, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) 05Open→03Resolved a:03Legoktm