[06:22:10] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Joe) p:05Triage→03Medium [06:25:37] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10Joe) [06:56:11] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [07:00:35] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) m1-master and m2-master proxies failed over [07:01:15] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [07:24:10] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10Joe) Re-thinking about this: what we're really interes... [07:45:55] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 (10akosiaris) [07:46:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 (10akosiaris) wdqs was repooled yesterday, only things left are some old IP ranges cleanups and adding the 2 new nodes in the cluster. [08:15:01] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MoritzMuehlenhoff) [08:35:43] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 (10JMeybohm) [08:36:33] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: K8s etcd on bullseye show TLS errors in logs - https://phabricator.wikimedia.org/T329556 (10JMeybohm) [08:36:36] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 (10JMeybohm) 05Open→03Resolved I've removed all cergen certs and config for wikikube and ml clusters from private p... [08:47:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617 (10Marostegui) [08:59:58] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:00:26] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:02:02] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:20:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Allow to address Kubernets API servers from NetworkPolicy - https://phabricator.wikimedia.org/T287491 (10JMeybohm) I've added a CR leveraging the service selector in calico network policies: https://gerrit.wikimedia.org/r/c/operations/deplo... [09:59:16] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) a:05JMeybohm→03jijiki [10:02:49] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Clement_Goubert) [10:03:01] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [10:22:07] Hello! if anyone has some time free today I'd like to start moving device-analytics towards deployment in k8s: https://gerrit.wikimedia.org/r/c/operations/puppet/+/889960 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886358 [10:31:37] hnowlan: looking, you haven't reserved the IPs in netbox yet right? [10:35:06] claime: no, not yet [10:35:10] ack [10:58:12] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) 05Open→03Resolved [11:13:01] hnowlan: apart from missing dns, conftool-data and netbox ip config, it looks good to me. I suppose you plan on adding conftool-data in a separate patch? [11:14:57] claime: netbox done (as of a few minutes ago), dns here! https://gerrit.wikimedia.org/r/c/operations/dns/+/890398 [11:15:41] added conftool-data to the puppet CR, although is that too many moving parts in one review now? [11:16:44] Hmm maybe do dns + conftool data, then service catalog and the rest ? [11:17:03] That's how it's described in the doc iirc [11:18:46] cool, works for me [11:20:47] conftool stuff: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895716 [11:26:11] the IPs in netbox needs to be active and then you need to run the sre.dns.netbox cookbook, it's a noop for prod (as the data still comes from the ops/dns repo): https://netbox.wikimedia.org/search/?q=device-analytics [11:27:28] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [11:28:05] (as in doesn't matter if you do that now or later) [11:28:14] volans: ack, thanks :) thought that having them in reserved wouldn't do much given the need for the manual update in the repo [11:29:13] fixed [11:29:36] the idea is that they should come from netbox, we should resurrect the old abandoned T270071 [11:29:40] hnowlan: Do you want device-analytics to be active/active with geoip ? Because if you do, you're missing a declaration under $ORIGIN discovery.wmnet. [11:30:37] so we're keeping the 2 sources in sync for now, given they are active now you have to run the sre.dns.netbox cookbook or icinga will complain in a bit :) [11:35:55] claime: added, ty! [11:43:25] hnowlan: lgtm, don't forget to run the sre.dns.netbox cookbook [11:44:40] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1039.eqiad.wmnet with OS bullseye [11:45:30] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [11:45:40] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) 05Open→03Resolved [11:45:47] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:46:55] 10serviceops, 10SRE, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) 05Resolved→03In progress [11:47:04] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:47:53] 10serviceops, 10SRE, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) {T331285} done, switching `restbase-async` back to its standard state. [11:51:27] 10serviceops, 10SRE, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:53:41] claime: thanks! [11:53:53] 10serviceops, 10SRE, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:53:57] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:54:36] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [11:54:43] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:56:06] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:57:07] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 (10akosiaris) [11:57:29] 10serviceops, 10SRE, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) [11:57:36] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10akosiaris) [11:57:46] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 (10akosiaris) 05Open→03Resolved All tasks done. Resolving [11:58:03] hnowlan: so you don't have to look for why V-1 on dns patch, commented on the CR [11:58:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10akosiaris) [11:58:48] 10serviceops, 10SRE, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) 05Open→03Resolved Nodes added, resolving. Many thanks @jijiki [11:59:09] claime: just pushed a fix as you sent that :) Need to have the service fully set up in state production before adding to the mocks, so those records can wait [11:59:16] ack [11:59:46] That's my bad, I asked you to add them lol [12:01:18] heh nbd [12:02:15] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [12:02:25] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:02:33] 10serviceops, 10SRE, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) 05In progress→03Resolved [12:03:47] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) 05Open→03Resolved We are now out of the window of eqiad complete depool, according to schedule. [12:03:57] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:10:59] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Insert a header for specific domains at haproxy layer to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) [12:18:03] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1039.eqiad.wmnet with OS bullseye completed: - mc1039 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [12:27:05] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617 (10akosiaris) 05Open→03Resolved Cleanups in the k8s related infra done. toolhub m5 grants are being cleaned up in their own task T331508. [12:27:10] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10akosiaris) [12:31:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Insert a header for specific domains at haproxy layer to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Vgutierrez) We traditionally perform that kind of header mangling in varnish rather than on the TLS termination layer as we try... [12:40:05] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10jijiki) 05Open→03Resolved [12:40:12] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) [12:40:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) Changed the task title to reflect the direction of the discussion. [13:17:30] this is the week of helm being even more strange than usual [13:37:12] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895765/comments/7d388921_7615da59 [13:53:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) [13:54:49] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [13:55:13] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) 05Open→03Resolved flink-session-cluster still carries the hack (see description). As that isn't a problem (it will still work) and the chart... [14:23:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) 05Resolved→03In progress [14:23:46] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:25:29] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [14:49:33] 10serviceops, 10SRE: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [14:59:15] <_joe_> akosiaris, claime <3 [14:59:27] I've done nothing yet :D [14:59:39] But I'll take the <3 [15:00:27] <_joe_> claime: oh so you weren't involved with the balancing act? [15:01:46] _joe_: No, akosiaris did that all on his own [15:02:19] <_joe_> akosiaris: we need to remember to give new hires the session about how to grab unearned merit [15:02:29] hahaha [15:03:56] lol [15:14:36] I helped too (see what I did there?) [15:15:26] <_joe_> Amir1: we know you learned that lesson [15:15:33] <_joe_> I taught you [15:15:48] <_joe_> (see, that's mastery, right there) [15:16:03] I already have trouble claiming earned kudos [15:16:10] Unearned is a step too far [15:17:33] claime: noob [15:18:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [15:18:28] Amir1: carreerist :p [15:18:51] well done :D [15:19:58] so, I was told that capable people are rewarded by being given more work. Who wants to put the cluster a mw host belong too in netbox? Cause I had to paste datasets together to get the data pasted in the task above :P [15:20:43] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:20:56] Sorry, I'm ooo until end of March, Manuel sounds like a good candidate to help out though! [15:23:04] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [15:25:26] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [15:29:12] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [15:29:34] akosiaris: you mean this? [15:29:35] for R in {A..D}; do for K in {1..8}; do echo -n "$R$K: "; sudo cumin "A:mw-jobrunner and P{P:netbox::host%location ~ '${R}${K}.*codfw'}" 2>&1 | grep -v "DRY-RUN" ; done; done [15:29:49] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [15:30:27] akosiaris: sounds boring, can do that tomorrow [15:30:32] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) p:05Triage→03Medium [15:55:50] volans: AKSHUALLY, I went for: sudo cumin --force 'mw2*' 'echo -n $(hostname) " " ; find /etc/update-motd.d/ -name "05-role-*" -exec grep -o mediawiki::.* {} \;' |grep mediawiki | sort | sed -e 's/mediawiki:://' -e 's/appserver:://' -e 's/"//' > a [15:56:06] which is clearly O(1) while yours is O(n^2) [15:56:43] but I 'll give you a passing grade anyway. B- it is :P [15:56:52] mine doesn't do ssh... [15:56:59] just some queries [15:58:24] queries? I am sorry, you need to take this in #wikimedia-data-persistence [15:59:02] rotfl [15:59:09] API calls [15:59:10] sorry [15:59:11] :-P [15:59:17] back to you [15:59:26] unfortunately, meeting [15:59:33] I was enjoying this though ;-) [16:18:55] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [17:00:40] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10akosiaris) >>! In T216815#8672370, @jnuche wrote: > @akosiaris thanks for the feedback. > > Just to clarify, we can work around the issue currently, but it makes t... [20:38:05] 10serviceops, 10Keyholder, 10Scap, 10Datacenter-Switchover: scap can not ssh with keyholder on deploy2002 - https://phabricator.wikimedia.org/T331568 (10hashar) [20:47:41] 10serviceops, 10Keyholder, 10Scap, 10Datacenter-Switchover: scap can not ssh with keyholder on deploy2002 - https://phabricator.wikimedia.org/T331568 (10Dzahn) 05Open→03Resolved a:03Dzahn should be fixed. By restarting the proxy as you suggested. Test below works: ` [deploy2002:~] $ SSH_AUTH_SOCK=/r... [20:51:26] 10serviceops, 10Keyholder, 10Scap, 10serviceops-collab, 10Datacenter-Switchover: scap can not ssh with keyholder on deploy2002 - https://phabricator.wikimedia.org/T331568 (10Dzahn) [21:02:33] 10serviceops, 10Keyholder, 10Scap, 10serviceops-collab, 10Datacenter-Switchover: scap can not ssh with keyholder on deploy2002 - https://phabricator.wikimedia.org/T331568 (10hashar) 05Resolved→03Open I have confirmed it works. I am reopening so that the #datacenter-switchover documentation gets updat... [23:13:20] 10serviceops, 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) 05Open→03In progress a:03BCornwall [23:25:42] 10serviceops, 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [23:26:30] 10serviceops, 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) Removed `ircecho` from the list as it had `Requires=network.target`, which...