[03:54:36] 06serviceops, 10MW-on-K8s, 10Observability-Logging: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640#9933335 (10colewhite) [07:09:50] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9933422 (10SGupta-WMF) Thank you @Scott_French and @mforns . I re-ran the pipe... [07:13:24] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9933423 (10ayounsi) IPIP encapsulation is a necessary step in the good direction, whatever solution we decide on for load balancing, for th... [07:25:21] 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#9933445 (10JMeybohm) [07:43:19] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9933471 (10JMeybohm) Great news! I'd say that concludes this task. Thanks for all of the help and patience getting this over the finish... [07:43:28] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9933474 (10JMeybohm) 05Open→03Resolved [09:28:08] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507#9933638 (10JMeybohm) [09:28:38] claime: sorry to pick on you but just wanted to check if you - or the team overall - are aware of the upcoming switch upgrades in Eqiad row e/f? [09:29:04] there are a handful of kubernetes hosts in racks E1 and E2 that we'll be doing next Tue/Wed [09:29:30] topranks: send me the task, I'll make sure we drain and cordon those nodes when you need them down [09:29:32] but some more in the other racks, Tue July 9th and Thu Jul 18th particularly busy [09:29:37] https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM/edit?gid=46473806#gid=46473806 [09:29:52] the tasks are linked on the tabs, if you look under 'all hosts' there for service ops you can get an overview [09:30:03] you're a gent thank you :) [09:33:06] hi folks, I know that you will probably hate me a little but I have a new envoy docker image to rollout (same version, just using Bookworm) [09:34:28] we're such a bothersome team elukey :P [09:35:05] aahahaha [09:35:20] topranks:is the data on the All Hosts tab pulled from the other tabs? [09:35:40] I've changed the server type, all the mw* servers there are actually kubernetes servers [09:35:42] so I see that the default version for k8s is definited in puppet, I'd say that we rollout a mesh change to some service as canary first.. preferences? [09:35:47] (not today) [09:37:37] claime: the answer is yes for the most part, but yep it's fine to change the server type for those [09:38:02] topranks: I've changed it on all the tabs, but I can't change the main tab [09:38:03] leave it with me - all the 'mw' hosts are kubernetes ones are they? [09:38:09] yeah [09:38:59] leave it with me, it was all basically keyed off the hostname's prefix, I didn't properly anticipate we'd have a different 'type' than that but let me make sure all is good [09:39:12] what you did is fine I'll update the main tab, it might be just locked [09:40:39] 06serviceops: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214#9933671 (10JMeybohm) a:03JMeybohm [09:41:41] fixed that now, if I use this template again I'll change it so that's more flexible [09:41:43] cheers [09:41:54] 06serviceops, 10MW-on-K8s, 10Observability-Logging: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640#9933672 (10Joe) a:03Joe [09:44:09] topranks: ok I've added all of them to my calendar, if for some reason I'm not there just ping anyone in the team [09:44:43] great I will do - thanks a bunch for the help! [09:50:59] 06serviceops: serviceops kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#9933721 (10JMeybohm) a:03JMeybohm [09:51:34] 06serviceops: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#9933725 (10JMeybohm) [10:28:49] 06serviceops, 103D, 06Commons, 13Patch-For-Review, 07Regression: STL 3D models broken: "Sorry, the file Undefined cannot be displayed since it is not present on the current page." - https://phabricator.wikimedia.org/T368301#9933826 (10hnowlan) The thumbor-side issues were a side-effect of the upgrade tha... [10:28:50] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9933828 (10cmooney) >>! In T368544#9933423, @ayounsi wrote: > An `ip route 0/0` rule would be needed to "clamp" the outbound MTU or MSS (us... [10:34:57] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9933846 (10MoritzMuehlenhoff) Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point. [10:36:29] 06serviceops, 103D, 06Commons, 07Regression: STL 3D models broken: "Sorry, the file Undefined cannot be displayed since it is not present on the current page." - https://phabricator.wikimedia.org/T368301#9933850 (10TheDJ) Thank you hnowlan In ` route(fileName) {... [10:45:16] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1412 to wikikube-worker1027 completed: - mw1412 (**PASS*... [10:45:54] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [10:50:58] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1413 to wikikube-worker1028 completed: - mw1413 (**PASS*... [10:51:38] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye [10:58:17] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1417 to wikikube-worker1029 completed: - mw1417 (**PASS*... [10:58:39] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye exec... [11:00:07] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:02:00] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:07:09] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye exec... [11:07:26] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:09:03] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1418 to wikikube-worker1030 completed: - mw1418 (**PASS*... [11:11:39] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1030.eqiad.wmnet with OS bullseye [11:15:46] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933931 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye exec... [11:15:53] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:17:46] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1450 to wikikube-worker1031 completed: - mw1450 (**PASS*... [11:18:33] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1031.eqiad.wmnet with OS bullseye [11:24:07] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye exec... [11:24:22] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye [11:29:46] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye exec... [11:30:03] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:30:40] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye exec... [11:31:22] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9933953 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [11:36:12] 06serviceops: Reduce disk usage of kafka-main - https://phabricator.wikimedia.org/T368714 (10JMeybohm) 03NEW [11:37:01] 06serviceops: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#9933991 (10JMeybohm) [11:37:02] 06serviceops: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214#9933990 (10JMeybohm) [11:37:03] 06serviceops: Reduce disk usage of kafka-main - https://phabricator.wikimedia.org/T368714#9933989 (10JMeybohm) [11:37:04] 06serviceops: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214#9933992 (10JMeybohm) 05Open→03Stalled [11:37:08] 06serviceops: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#9933995 (10JMeybohm) 05Open→03Stalled [11:38:36] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1030.eqiad.wmnet with OS bullseye executed with errors: - wi... [11:38:40] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1030.eqiad.wmnet with OS bullseye [11:44:14] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye executed with errors: - wi... [11:44:25] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye [11:54:06] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1031.eqiad.wmnet with OS bullseye completed: - wikikube-work... [12:05:26] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye executed with errors: - wi... [12:05:56] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:06:13] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934089 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1028.eqiad.wmnet with OS bullseye completed: - wikikube-work... [12:14:04] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1030.eqiad.wmnet with OS bullseye completed: - wikikube-work... [12:28:21] 06serviceops: Reduce disk usage of kafka-main - https://phabricator.wikimedia.org/T368714#9934154 (10JMeybohm) p:05Triage→03High [12:35:38] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye executed with errors: - wi... [12:37:17] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye [12:44:46] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye executed with errors: - wi... [12:47:12] 06serviceops, 10Wikidata, 10wmde-wikidata-tech, 03Discovery-Search (Current work), 13Patch-For-Review: Ensure that WDQS query throttling does not interfere with federation - https://phabricator.wikimedia.org/T361950#9934238 (10dcausse) [12:47:16] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [12:47:26] 06serviceops: Reduce disk usage of kafka-main - https://phabricator.wikimedia.org/T368714#9934243 (10JMeybohm) [12:47:30] 06serviceops: kafka-main replacement nodes don't fit kafka-main (storage wise) - https://phabricator.wikimedia.org/T368714#9934244 (10JMeybohm) [12:51:59] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9934253 (10mforns) Yay! Thanks @SGupta-WMF [13:01:17] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#9934300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bookworm [13:11:17] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1029.eqiad.wmnet with OS bullseye completed: - wikikube-work... [13:27:04] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye executed with errors: - wi... [13:30:07] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye [13:41:18] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#9934411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bookworm completed: - deploy1003 (**WARN**) - Downtimed on Icinga/Alertmanager... [13:42:18] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#9934424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye [13:46:37] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2300 to wikikube-worker2026 completed: - mw2300 (**WARN**... [13:59:41] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2298 to wikikube-worker2025 completed: - mw2298 (**PASS**... [14:05:27] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1027.eqiad.wmnet with OS bullseye comp... [14:07:35] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2306 to wikikube-worker2027 completed: - mw2306 (**PASS**... [14:10:47] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#9934528 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed: - deploy1003 (**PASS**) - Downtimed on Icinga/Alertmanager... [14:22:31] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2308 to wikikube-worker2028 completed: - mw2308 (**PASS**... [14:27:39] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2330 to wikikube-worker2029 completed: - mw2330 (**PASS**... [14:28:29] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2025.codfw.wmnet with OS bullseye [14:29:25] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2027.codfw.wmnet with OS bullseye [14:30:21] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2028.codfw.wmnet with OS bullseye [14:30:43] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2029.codfw.wmnet with OS bullseye [15:07:17] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2025.codfw.wmnet with OS bullseye compl... [15:09:33] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934813 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2029.codfw.wmnet with OS bullseye compl... [15:10:35] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934816 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2028.codfw.wmnet with OS bullseye compl... [15:11:19] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T368639#9934831 (10Clement_Goubert) [15:14:28] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9934852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2027.codfw.wmnet with OS bullseye compl... [15:14:34] swfrench-wmf: so, with your patch yesterday, it looks like all our mediawiki traffic is under service name `mediawiki-main` in jaeger. previously it was using the k8s namespace which meant you had separate `mw-web` and `mw-api-ext` service names searchable, for instance [15:15:14] it's not a huge deal, mostly I'm just wondering what the better UX is for users of tracing [15:17:06] cdanis: I think we should find a way to keep the k8s namespace [15:17:19] claime: it is an attribute (albeit in a kind of confusing way), it's just not in the service name [15:17:20] Especially now that we're rather clearly delimiting use cases by cluster [15:17:26] and yeah, I'm leaning towards that as well [15:17:39] service name is like *the* top-level field in traces, you *have* to specify one to do a search for instance [15:18:27] to be fair, it was me deploying your patch :) [15:18:39] right [15:18:44] haha [15:18:50] but yeah, if we can get "something that includes release" (i.e., canary or not) + namespace, that would be great [15:18:56] (I'm kidding of course) [15:19:04] (about my first comment) [15:19:04] is canary actually important to include here? [15:19:23] you'll be able to see it in the `node_id` field of a span, for instance [15:19:24] we do use it as a stage of release [15:19:55] IMO, we should be doing a longer canary soak (which makes the differentiation more relevant), but that's kind of a separate problem [15:19:56] it would be easy to include `canary_server=1` or something as an attribute using an ottl rule [15:20:11] so you could search based on that within `mw-api-ext` service nae, or whatever [15:20:31] idk, some of this question is understanding what our most important tracing use cases are [15:21:12] as is right now we have `mediawiki-canary` service with all of web, api-{int,ext}, etc, under it, which also seems not ideal [15:21:34] oh and `mw-debug` has instead become `mediawiki-pinkunicorn` [15:22:38] yeah it's app-release [15:22:46] $app-$release [15:22:49] yeah [15:25:04] +1 to having canary be an easily usable attribute, but entirely agreed that the logical service (~ namespace) should have greater priority / visibility from a UX perspective [15:25:47] cdanis: is this causing problems with existing tracing use cases in a way where I should look at how to split out the local_service change? [15:25:57] swfrench-wmf: it's not urgent [15:26:07] cool cool, just wanted to confirm :) [15:26:13] and the plan I had laid out wrt the LOCAL_ change did allow us flexibility [15:26:37] it's just the old default was grabbing the namespace as the fallback, which I liked and which j.ayme didn't ;) [15:29:22] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368743 (10hnowlan) 03NEW [15:31:44] is it possible that they way we deploy mediawiki might just be hard to support with the defaults that generally make sense for other services? [15:31:59] yes [15:32:10] i.e., some other escape hatch for specifying this explicitly (rather than trying to construct it) makes sense [15:32:33] I can think of a couple of other services that all derive from the same chart, and might all have the same release name [15:33:42] e.g., I think the AQS 2.0 services might be in a similar situation (where LOCAL_{app + release name} does not distinguish them) [15:33:58] but namespace does? [15:36:39] yeah, namespace is the service (e.g., "media-analytics") whereas I think they would all end up with "aqs-http-gateway(-main)" as the release name [15:36:48] yeah... [15:37:26] well, it's even more complicated than that, actually [15:37:54] they would get aqs-http-gateway-main as the servicename on the spans that come from their local Envoy tls terminator, if we enabled tracing on there (we haven't yet, for many service) [15:38:26] but they'll *also* get a span with whatever the upstream_cluster name is for that AQS service, when a mediawiki pod runs a request against them, or another pod of another service that uses the service mesh to reach them [15:38:50] (and that's why for instance we also have service names like `mw-api-int-async` appearing in traces) [15:41:02] ah, interesting! yeah, that makes sense - I'd not thought deeply about consistency of identifiers across hosts/contexts, but I could see this getting tricky [15:41:59] it's also okay if the names are different sometimes [15:45:18] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9935038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2026.codfw.wmnet with OS bullseye [15:45:22] but it's ideal if we have tracing enabled on both 'sides' of wherever that's true, so we get both names searchalbe [15:48:01] swfrench-wmf: can you think of any other services offhand where $app-$release hides a lot of data? [15:51:56] I guess I can figure out how ot dog through various metrics of he k8s api itself for this [15:52:01] s/ot dog/to dig/ [15:52:07] the services that use the aqs-http-gateway chart (7-ish services?) are the only ones come to mind at the moment ... I'll give it some thought and let you know if I can think of more [15:52:08] s/of he/or the/ [15:53:07] heh, yeah I was going to just cobble together a prom query to give me all existing combos of namespace and LOCAL_ upstream cluster :) [15:54:24] or local_service for older mesh deployments ;) [15:55:01] exactly, yeah [15:55:18] which come to think of it, AQS 2.0 services are probably still on [15:55:59] yup, mesh.configuration 1.7.0 [15:58:24] swfrench-wmf: eventgate is another such app [15:58:33] and miscweb although that's not as concerning [15:59:12] ah, yeah that makes sense [16:01:39] and shellbox :D [16:01:57] https://grafana.wikimedia.org/goto/PqgxJxQIg?orgId=1 [16:02:46] ah, yeah shellbox indeed might already be on 1.7.1 [16:04:44] alright, so I guess I would say this is a general enough problem that mediawiki doesn't need a one-off solution [16:06:20] the flink app names are also... interesting [16:11:44] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935144 (10Scott_French) Thanks so much, @SGupta-WMF. Alright, so I think we'... [16:12:10] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9935154 (10Scott_French) [16:24:58] <_joe_> cdanis: re: Node stanza in the envoy conf [16:25:04] <_joe_> we 100% need to start adding it [16:25:57] Envoy populates node_id from the hostname, so you get the pod name _joe_ _joe_ [16:26:17] and also since the trace comes from the pod IP, the otel collector also adds several attributes [16:26:18] <_joe_> cdanis: envoy allows you to add a "node" stanza to your config [16:26:39] <_joe_> so we can add information there, including ensuring node_id is what you want [16:27:57] I had wanted to add zone as well :) [16:28:02] but let’s discuss next week