[07:53:10] hello folks [07:53:22] I have merged the last refactoring patch for changeprop [07:54:04] deployed in staging and checked all the diffs, it is basically a no-op except some comments removed etc.. I'd avoid to deploy all cp instances just for those values, but lemme know if you prefer so [09:28:01] 10serviceops, 10CirrusSearch, 10Discovery-Search: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10dcausse) [09:40:22] 10serviceops, 10CirrusSearch, 10Discovery-Search: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10dcausse) [09:55:08] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 (10ayounsi) [09:55:12] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Stalled→03Resolved Automation is up and running. Doc updated: https://wikitech.wikimedia.org/w/in... [09:55:20] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10ayounsi) [09:59:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10JMeybohm) [10:15:53] elukey: sounds good [10:16:18] hnowlan: ack thanks! [10:54:22] 10serviceops, 10Kubernetes: kubernetes2047 lost all pods (unhealthy) - https://phabricator.wikimedia.org/T353233 (10Jelto) [11:50:11] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10JMeybohm) [12:43:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Clement_Goubert) [12:43:30] 10serviceops, 10SRE: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert) 05Open→03Resolved Nodes are in production. [12:51:49] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10ops-monitoring-bot) VM kubemaster2001.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G [12:55:57] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10ops-monitoring-bot) VM kubemaster1001.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G [13:09:28] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10ops-monitoring-bot) VM kubemaster2002.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G [13:09:53] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10ops-monitoring-bot) VM kubemaster1002.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G [13:23:19] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10JMeybohm) [13:45:33] jelto: eoghan: hello :) May I get a merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/922555 please? That is to rename the Unix user used by Jenkins :) [13:46:19] I have rereviewed the list of steps I have to do and which is in the commit message [13:46:57] I will do a host at a time and I don't anticipate much issues (the primary use case is for pipeline lib jobs and lot of them have been moved to gitlab) [14:00:49] hashar: I'm quite busy with a GitLab security topic probably most of today. We have office hours in 2 hours. Are you able to join them? I added the topic to our office hours and forwarded the invite [14:01:02] I just need a merge :) [14:01:48] I will not be available later this afternoon though :) [14:01:51] I will ask around [14:07:06] the submitted patch is totally new for me and I'm not on reviewer or cc. I'd like to take a look before merging this. So earliest would be tomorrow [14:07:35] I'll take a look shortly, just grabbing some lunch onw [14:14:10] it is a bit messy :) it is some cleanup I have encountered with John and I kind of forgot about the patch [14:15:35] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) sessionstore1004 Rack: A3 U: 23 CableID: 1865 Port: 21 sessionstore1005 Rack: C5 U:29 CableID: 1957 Port: 30 sessionstore1006 Rack: D6 U: 40 CableID: 5... [14:15:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [14:22:02] 10serviceops, 10Kubernetes, 10Patch-For-Review: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10JMeybohm) a:03JMeybohm [14:28:44] eoghan: I will ask later tonight (I am jetlagged). I might have to grab kid from school in an hour or so which is probably too short of a window [14:29:23] Ok. I'm looking through it now, will throw any questions into the change. [14:30:58] I guess that can help build confidence :) [14:41:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jclark-ctr) [14:46:21] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [14:46:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jclark-ctr) @Jhancock.wm I finished imaging these if you want to verify anything before closing out ticket [14:46:50] hashar: Is it possible to do this one instance at a time? i.e., disable puppet on both, re-enable on one and work through the steps, then re-enable on the other? [14:47:23] yeah kind of [14:48:19] the idea I had was to first disconnect the Jenkins agent [14:48:38] and yes run puppet instance by instance [14:48:43] then reconnect the jenkins agents [14:55:24] And the expected outcome of this is that we'll temporarily end up with two users, until we put out a new change removing the old user? Or do you intend to do the usermod command that jbond mentioned [14:55:42] manual deletion [14:56:08] to save on having to craft a change to ensure => absent the users [14:56:15] then another change to remove it [14:56:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Jclark-ctr) [14:57:25] hashar: Ok. Will you be around in 15 minutes? I need to pick up the kids but I can merge it when I get back and we can walk through deployment in case anything surprsing pops up. [14:59:08] it is going to be late, I have to grab my kids too :) [14:59:32] Ok. I'm not going to merge it until we're both around, but drop me a message when it suits you and we can work on it [14:59:32] we can do it tomorrow though, anytime will work [14:59:42] Sure [14:59:46] it is probably more reasonable :) [15:01:19] eoghan: would it work in the morning when you join? [15:03:57] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Wikimedia-production-error: Make changeprop-jobqueue error handling/httpbb tests better behaved: Uncaught Error: Class 'MWExceptionHandler' not found in /srv/mediawiki/rpc/RunSingleJob.php:42 - https://phabricator.wikimedia.org/T352265 (10matmarex) a:03m... [15:22:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) 05Open→03Resolved @Jclark-ctr looks good ty for your help! @Eevans all yours [15:23:42] 10serviceops, 10Machine-Learning-Team: Bump istio and Cert Manager Docker images to Bullseye - https://phabricator.wikimedia.org/T351933 (10elukey) 05Open→03Resolved a:03elukey [15:46:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Technical-Debt: Find a replacement for the unmaintained eventrouter - https://phabricator.wikimedia.org/T343787 (10fgiunchedi) Untagging o11y as there doesn't seem any immediate action, please reach out otherwise! [16:32:24] 10serviceops, 10CirrusSearch, 10Discovery-Search: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10JMeybohm) I'm not sure why it works for the other two. Prometheus does have established tcp connections to pods from mw-p-c-c-e but I can't... [16:51:32] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Patch-For-Review: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10dcausse) @JMeybohm thanks for taking a look! we'll include this template to see if this solves the issue. [16:51:46] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10dcausse) [17:24:51] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad - https://phabricator.wikimedia.org/T353224 (10dcausse) Confirming that envoy metrics are now properly flowing to prometheus for the cirrus-streaming-updater namespace