[00:17:16] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [00:29:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:29:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:29:58] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:33:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.539 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:33:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:34:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238 (owner: 10TrainBranchBot) [00:55:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238 (owner: 10TrainBranchBot) [01:31:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:30] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:47] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:47] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:28] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:32:32] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:29] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry) [03:46:30] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:47:16] (03Merged) 10jenkins-bot: Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry) [03:47:42] PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:27] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [03:48:51] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [03:51:17] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [03:51:52] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [03:52:12] RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:30] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:53:26] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:50] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:55:46] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [03:56:20] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [03:56:51] !log Updated cxserver to 2023-09-28-043003-production (T343450, T347389, T338689) [03:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:58] T343450: Enable MinT for closely-related languages based on community input - https://phabricator.wikimedia.org/T343450 [03:56:58] T338689: error translating court cases out of english - https://phabricator.wikimedia.org/T338689 [03:56:59] T347389: Integrate improved sentence segmentation algorithm in CXServer - https://phabricator.wikimedia.org/T347389 [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:41:44] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:44:24] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:44:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:24] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:49:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:05:52] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:06:46] PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:48] RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:22] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:30] (03PS6) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) [05:43:39] !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12]: Regular analytics weekly train [analytics/refinery@e954b12a] [05:44:35] Refinery deployment in progress [05:49:41] !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12]: Regular analytics weekly train [analytics/refinery@e954b12a] (duration: 06m 02s) [05:50:10] !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12] (thin): Regular analytics weekly train THIN [analytics/refinery@e954b12a] [05:50:17] !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12] (thin): Regular analytics weekly train THIN [analytics/refinery@e954b12a] (duration: 00m 06s) [05:51:02] !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e954b12a] [05:54:03] !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e954b12a] (duration: 03m 00s) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T0600) [06:19:43] !log Deployed refinery using scap, then deployed onto hdfs [06:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:53] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Indeed, looks about right :) For Puppet, if we can change the Hiera merge strategy to `hash` for `profile::bird::adve... [06:30:07] !log installing glibc security updates [06:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Sfaci) [06:58:42] 10SRE, 10SRE-Access-Requests: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Nahid) [07:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T0700) [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:48] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:53] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [07:12:05] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney) [07:12:16] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [07:19:53] !log Remove static routes for anycast prefixes - T347494 [07:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:57] T347494: Remove static routes for anycast prefixes - https://phabricator.wikimedia.org/T347494 [07:22:08] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) [07:22:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans) [07:24:57] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) p:05Triage→03Medium [07:25:05] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) Thanks for opening the access request. There is a official [access request form](https://phabricator.wi... [07:27:28] 10SRE, 10Infrastructure-Foundations, 10netops: Remove static routes for anycast prefixes - https://phabricator.wikimedia.org/T347494 (10ayounsi) 05Open→03Resolved All done. [07:30:43] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) [07:32:42] (03CR) 10David Caro: [C: 03+2] ci: manage cinder volume on Castor instance [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) (owner: 10Hashar) [07:33:06] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) @Antoine_Quhen Can you confirm and add your wikitech username and email address in the task description... [07:34:26] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2003.codfw.wmnet with OS bullseye [07:34:31] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [07:36:03] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) I was wondering what to do for all the appliances that have ntp.site.wikimedia.org configured. To me the best here is to... [07:36:51] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans) [07:38:55] (03CR) 10Hashar: gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [07:39:25] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans) [07:42:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:03] (03PS1) 10Muehlenhoff: Remove access for essexigyan [puppet] - 10https://gerrit.wikimedia.org/r/963256 [07:47:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:47:15] 10SRE, 10SRE-Access-Requests: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Aklapper) For context, https://meta.wikimedia.org/wiki/Special:CentralAuth?target=ZSoo%20(WMF) [07:50:34] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans) IMHO I think we should stick to the agreed format in T284614#7214588 and T284614#7222919 and rename (and re-slug) the 3 non matching ones into the... [07:53:33] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: host reimage [07:55:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [07:56:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: host reimage [07:59:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for essexigyan [puppet] - 10https://gerrit.wikimedia.org/r/963256 (owner: 10Muehlenhoff) [08:00:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:00:57] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Eigyan out of all services on: 2176 hosts [08:01:18] (03PS5) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) [08:01:27] (03CR) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [08:01:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Eigyan out of all services on: 2176 hosts [08:08:48] (03PS1) 10Muehlenhoff: Remove LDAP access for whatamidoing [puppet] - 10https://gerrit.wikimedia.org/r/963257 [08:11:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for whatamidoing [puppet] - 10https://gerrit.wikimedia.org/r/963257 (owner: 10Muehlenhoff) [08:14:26] (03PS1) 10Slyngshede: P:IDM Enable logging of remote IPs. [puppet] - 10https://gerrit.wikimedia.org/r/963258 [08:14:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2003.codfw.wmnet with OS bullseye [08:14:32] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [08:15:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43858/console" [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede) [08:19:10] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye [08:19:15] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [08:23:20] (03CR) 10Volans: "Thanks for taking the time to migrate this cookbook to the newer class API." [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [08:30:39] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney) [08:31:33] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney) [08:37:16] (03PS1) 10Muehlenhoff: Remove access for erayfield [puppet] - 10https://gerrit.wikimedia.org/r/963259 [08:38:35] (03PS1) 10David Caro: disable_tool: use the gitlab repository [puppet] - 10https://gerrit.wikimedia.org/r/963260 (https://phabricator.wikimedia.org/T327057) [08:42:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for erayfield [puppet] - 10https://gerrit.wikimedia.org/r/963259 (owner: 10Muehlenhoff) [08:43:21] 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) a:03Jclark-ctr Hi! Can we please have `cloudvirt-wdqs100[1-3]` moved to the WMCS racks, preferrably `E4` or `F4`? They will all need a s... [08:43:39] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EllenR out of all services on: 2175 hosts [08:44:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EllenR out of all services on: 2175 hosts [08:52:04] (03PS1) 10Muehlenhoff: Remove LDAP access for agueyte [puppet] - 10https://gerrit.wikimedia.org/r/963262 [08:52:43] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos) [08:55:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for agueyte [puppet] - 10https://gerrit.wikimedia.org/r/963262 (owner: 10Muehlenhoff) [08:57:03] (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos) [08:57:07] (03PS1) 10Jelto: admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001) [08:57:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52813 and previous config saved to /var/cache/conftool/dbconfig/20231004-085739-arnaudb.json [08:57:44] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:58:28] (03Merged) 10jenkins-bot: tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos) [08:59:00] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Jelto) p:05Triage→03Medium [09:01:58] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [09:02:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] rabbitmq: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff) [09:02:43] (03PS2) 10Arturo Borrero Gonzalez: rabbitmq: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff) [09:02:48] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [09:04:53] (03PS1) 10Muehlenhoff: Remove access for kmorgan [puppet] - 10https://gerrit.wikimedia.org/r/963267 [09:06:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for kmorgan [puppet] - 10https://gerrit.wikimedia.org/r/963267 (owner: 10Muehlenhoff) [09:08:07] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging KMorgan out of all services on: 2175 hosts [09:08:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging KMorgan out of all services on: 2175 hosts [09:12:29] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) Proposal looks good to me, minor nit would be to rename `ACAST_PS_ADVERTISE` to remove references to anycast to avoid con... [09:12:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P52814 and previous config saved to /var/cache/conftool/dbconfig/20231004-091245-arnaudb.json [09:13:54] (03PS1) 10Muehlenhoff: Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 [09:14:09] (03CR) 10CI reject: [V: 04-1] Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 (owner: 10Muehlenhoff) [09:16:30] (03PS1) 10Jgiannelos: tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 [09:19:43] (03PS2) 10Muehlenhoff: Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 [09:20:04] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) [09:21:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:22:32] (03CR) 10Muehlenhoff: [C: 03+1] admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001) (owner: 10Jelto) [09:25:01] (03CR) 10Jelto: [C: 03+2] admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001) (owner: 10Jelto) [09:25:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 (owner: 10Muehlenhoff) [09:25:50] !log sg912@deploy2002 Started deploy [airflow-dags/analytics@3b374a9]: (no justification provided) [09:26:11] (03PS1) 10Filippo Giunchedi: mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) [09:26:35] !log sg912@deploy2002 Finished deploy [airflow-dags/analytics@3b374a9]: (no justification provided) (duration: 00m 45s) [09:27:43] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging TsepoThoabala out of all services on: 2175 hosts [09:27:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P52815 and previous config saved to /var/cache/conftool/dbconfig/20231004-092752-arnaudb.json [09:28:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging TsepoThoabala out of all services on: 2175 hosts [09:28:40] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [09:33:06] (03CR) 10Clément Goubert: [C: 03+1] mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi) [09:33:55] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe2004.codfw.wmnet with OS bullseye [09:34:01] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [09:34:06] (03CR) 10Cathal Mooney: [C: 03+1] "As discussed there are some more improvements we can make here I think. The /30 tricks are cool but probably better to use /29." [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [09:34:37] (03CR) 10Filippo Giunchedi: [C: 03+2] mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi) [09:35:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [09:35:24] (03Merged) 10jenkins-bot: mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi) [09:35:50] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye [09:35:55] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [09:37:07] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:37:23] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:37:24] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:37:31] hold steady for !log wall [09:37:41] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:37:42] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:37:51] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [09:37:52] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [09:38:18] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [09:38:19] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:38:32] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:38:33] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:38:45] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:38:46] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:38:55] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [09:38:56] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [09:39:08] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [09:39:09] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [09:39:16] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [09:39:17] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [09:39:31] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [09:39:49] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Jelto) 05Open→03Resolved a:03Jelto Thanks for the request. I can confirm the accounts are linked since 2023-10-03. Sara Campos was added to wmf ldap group. I'm closing this task. Feel free t... [09:42:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52816 and previous config saved to /var/cache/conftool/dbconfig/20231004-094258-arnaudb.json [09:43:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [09:43:03] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:43:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [09:43:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52817 and previous config saved to /var/cache/conftool/dbconfig/20231004-094320-arnaudb.json [09:50:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) p:05Triage→03Medium [09:50:25] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos) [09:50:30] (03CR) 10Jgiannelos: [C: 03+2] tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos) [09:51:18] (03Merged) 10jenkins-bot: tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos) [09:51:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) Thanks for the access request. I need approval from @WDoranWMF as the manager and @odimitrijevic or @BTullis as the group owners for `analytics-admins`. [09:52:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10BTullis) Approved. [09:53:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) [09:56:02] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Limit global account linking to LDAP properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/961702 (owner: 10Slyngshede) [09:58:35] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687) [09:59:19] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Lucas_Werkmeister_WMDE) [09:59:47] (03CR) 10Cathal Mooney: [C: 03+1] "Sry that was my bad." [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1000) [10:00:36] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Jelto) This task popped up in the clinic duty board because it's a #sre-access-requests. However [wikitech](https://wikitech.wikimedia.org/wi... [10:01:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:02:35] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe2004.codfw.wmnet with OS bullseye [10:02:39] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [10:02:51] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye [10:02:56] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [10:04:34] (03PS1) 10Muehlenhoff: Remove LDAP access for elitre [puppet] - 10https://gerrit.wikimedia.org/r/963280 [10:07:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for elitre [puppet] - 10https://gerrit.wikimedia.org/r/963280 (owner: 10Muehlenhoff) [10:15:28] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) > Otherwise, it should be fairly straightforward: we add the VIP the same way we do for the anycast IPs, making sure to... [10:15:54] (03PS1) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [10:18:31] (03CR) 10Jbond: "see inline for comments questions" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [10:18:50] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:20:08] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [10:20:11] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [10:20:16] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [10:20:18] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [10:20:22] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:20:35] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-fe2004.codfw.wmnet with OS bullseye [10:20:41] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [10:21:00] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey) [10:21:12] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:24:54] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) [10:25:25] (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:25:30] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) [10:25:41] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:27:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:27:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:27:47] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:27:52] 10SRE, 10Infrastructure-Foundations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) >>! In T247045#9212891, @nshahquinn-wmf wrote: > https://os-reports.wikimedia.org/stretch.html now reports: >> A total of 0 hosts are runni... [10:29:11] !log filippo@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004'] [10:29:34] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004'] [10:30:44] (03PS1) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 [10:31:22] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:31:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43859/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:32:51] (03PS2) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 [10:33:03] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Peachey88) [10:33:17] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:33:18] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: thanos-fe2004 fail to boot into PXE - https://phabricator.wikimedia.org/T348119 (10fgiunchedi) [10:33:48] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43860/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:37:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:39:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon) [10:39:11] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:42] (03PS3) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 [10:40:08] 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10cmooney) p:05Triage→03Low [10:40:16] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:40:20] (03CR) 10Ladsgroup: mariadb: update the ssl-ca value used by mariadb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:40:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43861/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:43:03] (03PS4) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 [10:43:29] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:44:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43862/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [10:46:06] (03PS2) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) [10:47:36] (03CR) 10Kevin Bazira: "elukey, in T347475#9224115 you had suggested the number 2 but for this test I've used 8, we can always scale them down once we've tested a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:49:06] (03CR) 10Elukey: [C: 04-1] ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:50:37] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff >>! In T348103#9224094, @Jelto wrote: > @MoritzMuehlenhoff can I hand this over to you/your team?... [10:51:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) [10:52:05] (03PS1) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) [10:52:46] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:52:56] (03CR) 10CI reject: [V: 04-1] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff) [10:54:06] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:54:16] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:54:47] (03PS2) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) [10:55:34] (03CR) 10CI reject: [V: 04-1] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff) [10:57:43] (03PS3) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) [10:58:08] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye [10:58:14] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [10:58:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:27] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff) [11:00:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Server moves in codfw to support switch numbering scheme - https://phabricator.wikimedia.org/T348125 (10cmooney) p:05Triage→03Medium [11:00:59] (03PS3) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) [11:02:42] (03CR) 10Elukey: [C: 03+1] ml-services: increase recommendation-api-ng uwsgi workers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [11:02:54] (03CR) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [11:02:56] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Zsoo out of all services on: 2175 hosts [11:03:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:03:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Zsoo out of all services on: 2175 hosts [11:03:58] (03CR) 10Jbond: "ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:04:11] (03Abandoned) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond) [11:04:13] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [11:04:20] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: thanos-fe2004 fail to boot into PXE - https://phabricator.wikimedia.org/T348119 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm not sure how it happened but from the `ctrl-s` menu from broadcom... [11:04:22] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi) [11:05:00] (03Merged) 10jenkins-bot: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [11:05:07] (03PS2) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:06:30] (03PS2) 10Hnowlan: wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) [11:06:35] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Patch-For-Review: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10MoritzMuehlenhoff) 05Open→03Resolved @Nahid: I have removed Zxane's access to the "restricted" and "analytics-priv... [11:07:20] (03PS2) 10Jbond: mariadb: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [11:07:25] (03CR) 10Jbond: "updated thanks for the feedback" [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:08:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B server moves - port-block constraint / numbering - https://phabricator.wikimedia.org/T348125 (10cmooney) [11:10:01] (03PS9) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [11:10:04] (03CR) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:10:13] (03CR) 10Clément Goubert: [C: 03+1] wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:10:15] (03CR) 10CI reject: [V: 04-1] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:10:17] (03CR) 10Clément Goubert: [C: 03+1] service: add {edit,editor,page}-analytics services [puppet] - 10https://gerrit.wikimedia.org/r/962570 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:11:03] (03CR) 10Jbond: "fyi ill comeback to theses patches after the get_clusters patch is merged and working" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:11:13] (03CR) 10MVernon: [C: 03+2] aptrepo: install zip on aptrepo servers [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon) [11:12:23] (03CR) 10Jbond: [C: 03+1] gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:12:40] (03CR) 10Hnowlan: [C: 03+2] wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:12:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [11:13:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10WDoranWMF) Approved [11:14:17] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:14:35] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage [11:16:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) [11:17:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage [11:18:32] (03PS1) 10Jelto: admin: add sfaci to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101) [11:20:43] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:23:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:25:31] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:25:58] 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10cmooney) p:05Triage→03Medium [11:26:59] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [11:26:59] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:23] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:27:59] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:30:27] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:33:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2004.codfw.wmnet with OS bullseye [11:33:34] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [11:34:13] (03CR) 10Hnowlan: [C: 03+2] service: add {edit,editor,page}-analytics services [puppet] - 10https://gerrit.wikimedia.org/r/962570 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:35:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101) (owner: 10Jelto) [11:38:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) `ACAST_PS_ADVERTISE` is hardcoded in [[ https://github.com/unixsurfer/anycast_healthchecker | anycast_healthchecker ]]... [11:41:29] (03PS1) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 [11:42:55] (03PS2) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717) [11:43:12] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [11:43:14] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete, `thanos-fe*` ho... [11:43:28] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [11:43:33] (03CR) 10Jgiannelos: "Now that we use latest tegola we can enable JSON logs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [11:44:17] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, we have split `titan*` hosts a... [11:44:40] (03CR) 10Btullis: "check automatic" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:45:13] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:45:34] !log installing exim4 security updates [11:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:01] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:34] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:49:05] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:49:19] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:04] (03CR) 10Jelto: [C: 03+2] admin: add sfaci to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101) (owner: 10Jelto) [11:52:19] (03PS3) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:52:59] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:53:29] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:34] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:02:04] (03PS4) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:02:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:05:30] (03PS5) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:06:53] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:08:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) 05Open→03Resolved a:03Jelto sfaci has access to `analytics-admins` now (in the next 30 minute). I'm closing the task. Feel free to reopen if you have p... [12:11:15] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:11:24] 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10ayounsi) Nice rabbit hole! I found this: https://www.reddit.com/r/Juniper/comments/g12qxh/the_right_way_to_allow_traceroute_in_re_filter/ So it's possible... [12:11:27] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:07] (03PS6) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:12:32] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:14:31] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:31] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:33] PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:45:43] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:40] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 90 days, 0:00:00 on 22 hosts with reason: Downtime for graceful shutdown and later decom [12:46:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on 22 hosts with reason: Downtime for graceful shutdown and later decom [12:47:41] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) 05Open→03Resolved I am going to close this task, the FPC issue was addressed through card replacement (although we decom'd router in the meantime). Despite my best efforts i... [12:51:51] !log powering off ores100{2..9}.eqiad.wmnet (1001 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in [12:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:55] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:53:07] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:14] !log powering off ores200{2..9}.codfw.wmnet (2001 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in. [12:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:00] !log powering off orespoolcounter{1004,2003,2004}.{eqiad,codfw}.wmnet (1003 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in. [12:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:48] (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:56:55] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:57:13] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:57:58] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Aklapper) @DennisJJackson Hi and welcome to Phabricator! What //in this ticket// led you to asking for "retriage" (and what does that mean)? [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1300). [13:00:05] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] !log rook@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [13:00:12] (03CR) 10Muehlenhoff: Implement Codex design, from design team. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede) [13:00:19] o/ [13:03:15] !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [13:03:22] o/ I can deploy in a moment [13:03:44] o/ [13:03:53] (unless Lucas is faster) [13:03:56] I can also deploy [13:04:27] please do [13:05:53] ok [13:06:57] (03PS3) 10Lucas Werkmeister (WMDE): fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:07:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:07:47] (03Merged) 10jenkins-bot: fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:08:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) p:05Triage→03Low [13:08:41] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) [13:08:45] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:08:55] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) [13:08:58] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] [13:09:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:09:08] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [13:09:09] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:09:34] (03CR) 10Vgutierrez: [C: 03+1] "LGTM but please fix the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/963147 (owner: 10Fabfur) [13:10:27] nemo-yiannis: latest deployment of tegola is spamming logs and dumping its full request/response, please mitigate or revert, it is overwhelming logstash :( [13:10:29] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:33] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:10:35] filing a task now [13:10:48] aanzx: please test [13:10:50] checking [13:11:02] godog: should I hold off deploying or is this sufficiently unrelated to mw? [13:11:13] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:11:28] godog: on it [13:11:41] (03PS1) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) [13:11:42] Lucas_WMDE: some mw logs will be delayed [13:11:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:11:45] nemo-yiannis: thank you [13:11:45] (03PS1) 10Jbond: realm: test monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/963300 [13:11:49] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [13:11:50] ok [13:11:58] Lucas_WMDE: look good [13:12:28] (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:12:59] (03PS1) 10Jgiannelos: Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 [13:13:52] there’s an “expected but failed to find position index” error in mwdebug logstash but that seems to be unrelated [13:13:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43863/console" [puppet] - 10https://gerrit.wikimedia.org/r/963300 (owner: 10Jbond) [13:13:59] there’s lots of instances of that on mediawiki-warnings [13:13:59] godog: can you take a look at this patch? https://gerrit.wikimedia.org/r/c/operations/software/tegola/+/963301 [13:14:05] (03PS2) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) [13:14:10] let’s sync then [13:14:11] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Continuing with sync [13:14:18] (03CR) 10Fabfur: purged: use unix socket for varnish in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:14:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos) [13:14:26] (03CR) 10Cathal Mooney: "Overall looks good to me, one comment. I'm not familiar with those interface:: classes but agree if it works how it looks it seems cleane" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [13:14:28] !log Cassandra bootstrap, restbase1030-a (`auto_bootstrap: false`) — T346803 [13:14:28] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) [13:14:29] nemo-yiannis: for sure! LGTM [13:14:30] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803 [13:14:32] (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:14:44] filed related task as https://phabricator.wikimedia.org/T348141 [13:14:58] (03CR) 10Jgiannelos: [C: 03+2] Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos) [13:15:12] (03Abandoned) 10Jbond: realm: test monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/963300 (owner: 10Jbond) [13:15:46] (03Merged) 10jenkins-bot: Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos) [13:15:48] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:16:07] Lucas_WMDE: Hi, would you be willing to deploy a couple of last-minute patches? [13:16:36] Daimona: there’s still one regular config change in the queue [13:16:38] are they urgent? [13:16:40] RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [13:16:47] ("What patches", you may ask. I still have to write them...) [13:17:05] (also kinda waiting for godog / nemo-yiannis to be done fixing logstash – I only realized after continuing the scap that delayed logstash makes the canaries less useful) [13:17:18] Not urgent in the sense that something's broken. Just rolling out a new feature. It can wait if we're having infra issues [13:17:44] (03PS10) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) [13:17:46] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [13:17:56] (The feature rollout itself is overdue, but I can schedule it for another window today or tomorrow) [13:18:20] (03PS1) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) [13:18:30] I don’t think the infra issues will persist long [13:18:34] (a patch was already merged) [13:18:35] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [13:18:40] Lucas_WMDE: thank you, yes you are correct that the lag might impact the canaries check [13:18:57] (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:19:22] ok, the config changes are low risk I think but let’s still wait with the second one then [13:19:26] !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS bullseye [13:19:29] (first one is currently at 62% php-fpm-restart) [13:19:42] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:19:59] (03PS1) 10Jgiannelos: tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303 [13:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:42] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] (duration: 11m 43s) [13:20:45] (03PS2) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) [13:20:46] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [13:20:59] Daimona: feel free to start uploading patches, at least ^^ [13:21:12] Yup, writing them now, sorry [13:21:25] (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303 (owner: 10Jgiannelos) [13:21:59] And it's actually 3 patches, but one is just beta [13:22:17] (03Merged) 10jenkins-bot: tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303 (owner: 10Jgiannelos) [13:22:57] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:23:35] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [13:23:44] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10DennisJJackson) @Aklapper - It looks like this issue was originally raised several years ago and put in the icebox. I'm flagging that the situation around standardization and deploy... [13:24:05] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [13:24:05] (03PS1) 10Daimona Eaytoy: beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) [13:24:11] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [13:24:27] (03Abandoned) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:24:49] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [13:25:00] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [13:25:05] (03PS3) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) [13:25:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:25:40] (03CR) 10Fabfur: [C: 03+2] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:25:48] 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm) [13:25:55] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [13:25:56] (03PS5) 10Slyngshede: Implement Codex design, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) [13:26:54] (03CR) 10Slyngshede: "Tool tips are back 😊" [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede) [13:27:08] (03PS1) 10Daimona Eaytoy: metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) [13:27:30] godog: tegola log rate should be more reasonable now [13:27:36] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963147 (T347837). `purged` daemon will be restarted by puppet in eqiad in the next 30m [13:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:40] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [13:28:17] nemo-yiannis: indeed, I can confirm the kafka lag is going down, thank you for the quick action on this [13:28:23] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [13:28:29] (03PS2) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) [13:28:39] alright, then I’ll continue now [13:28:52] (03PS7) 10Lucas Werkmeister (WMDE): fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:29:00] (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:29:03] Lucas_WMDE: there's still some lag though it should clear in 10m or so FYI, should be safe to proceed [13:29:08] ah ok [13:29:13] yeah it’ll have to go out to mwdebug first and all [13:29:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:29:28] can I see the lag somewhere? [13:29:45] yes sorry https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&from=now-30m&to=now&var-topic=All&var-consumer_group=All [13:30:01] err, this one [13:30:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43864/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:30:02] cool, thank you! [13:30:03] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&from=now-30m&to=now&var-topic=All&var-consumer_group=logstash7-eqiad [13:30:10] (03PS1) 10Daimona Eaytoy: prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) [13:30:14] I’ll look at that before continuing the sync at the end then [13:30:18] (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:30:20] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308 [13:30:45] Lucas_WMDE: ack [13:31:17] oops, scap failed actually [13:31:19] * Lucas_WMDE looks [13:31:35] “mkdir: cannot create directory ‘log’: Permission denied” https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test-docker/6591/console [13:31:39] that’s a transient one isn’t it [13:31:43] I’ll try again [13:31:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:32:00] quite poetic [13:32:25] (03CR) 10Lucas Werkmeister (WMDE): "trying again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:32:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:32:37] Lucas_WMDE: thanks for bearing with me, my patches are https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/963307 and its 2 dependencies [13:32:40] (03Merged) 10jenkins-bot: fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [13:32:47] yay now it went through [13:32:48] * Lucas_WMDE looks [13:33:06] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] [13:33:10] (03CR) 10FNegri: [C: 03+2] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [13:33:22] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [13:33:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43866/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:33:47] jouncebot: next [13:33:47] In 0 hour(s) and 26 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1400) [13:34:08] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) [13:34:30] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:36] aanzx: please test :) [13:34:37] checking [13:34:38] ok [13:36:45] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) [13:37:06] (03PS3) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) [13:37:21] (03CR) 10CI reject: [V: 04-1] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:37:40] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308 (owner: 10Volans) [13:38:06] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10ssingh) Hi @DennisJJackson: Thanks for the question. We do plan to work on ECH and enable it for our sites and have had some discussions internally. There is no timeline yet as such... [13:38:20] seems to be working for me at least [13:38:24] Lucas_WMDE: look good [13:38:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43867/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:38:30] ok thanks! [13:39:13] hm, logstash hasn’t caught up yet [13:39:20] (though it’s definitely improving) [13:39:45] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) [13:39:54] I’ll start it, this should be low risk [13:39:56] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync [13:40:01] (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:40:13] (03CR) 10David Caro: "It seems that cloudbackups are being resolved as ip6 (from pcc):" [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [13:40:23] Lucas_WMDE: +1 [13:40:28] PROBLEM - Check systemd state on releases2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:48] 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) @cmooney this should be a complication if we did have a mixed of 1G and 10G servers within the same rack which is not the case. In all exist... [13:42:18] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308 (owner: 10Volans) [13:43:09] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9223568, @ayounsi wrote: > I was wondering what to do for all the appliances that have ntp.site.wikimedia.o... [13:44:12] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [13:44:36] (03PS1) 10Volans: Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314 [13:45:04] Lucas_WMDE: can you run namespaceDupes.php after sync [13:45:14] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:45:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:37] aanzx: so far it says there’s nothing to do [13:45:48] and I assume mwmaint has the updated code already [13:45:51] but I can check again after the scap is done [13:46:49] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede) [13:46:53] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] (duration: 13m 46s) [13:46:57] T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939 [13:47:15] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement Codex design, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede) [13:47:25] (03PS4) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) [13:47:27] !log mwscript namespaceDupes fonwiki --fix # T347939 – 0 pages to fix, 0 resolvable; 0 links to fix, 0 resolvable, 0 deleted [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:35] aanzx: ^ [13:47:44] thanks [13:47:49] (03PS2) 10Lucas Werkmeister (WMDE): beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:47:55] (03PS2) 10Lucas Werkmeister (WMDE): metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:48:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:48:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:48:27] Daimona: deploying the first two changes for now [13:48:31] we’ll see if there’s time for the third one [13:48:36] (03CR) 10CI reject: [V: 04-1] beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:48:37] (maybe after the wikifunctions window) [13:48:42] oops, CI reject :( [13:48:57] (03CR) 10Volans: [C: 03+2] Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314 (owner: 10Volans) [13:48:59] (03Merged) 10jenkins-bot: beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:49:00] ugh, the same operation not permitted error again [13:49:03] (03Merged) 10jenkins-bot: metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy) [13:49:07] Yup, sounds good to me, ty! [13:49:10] ok, it was only the test, not the gate-and-submit 🤷 [13:49:29] ugh, but now it got merged and the scap backport exited… [13:49:30] The first 2 should also be no-ops, so... [13:49:32] (03CR) 10David Caro: P:cloudceph: cleanup firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [13:49:38] * Lucas_WMDE restarts it [13:49:54] (03PS1) 10Jclark-ctr: correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291) [13:49:55] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] [13:49:58] T336939: Add new user right to meta - https://phabricator.wikimedia.org/T336939 [13:50:21] (03CR) 10Ladsgroup: "I don't know this well enough to confidently say it should be merged but generally speaking it looks okay." [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:50:22] Daimona: shouldn’t the second one have an effect? [13:50:28] or do you mean it’s a no-op because the right doesn’t do anything yet? [13:50:35] (KubernetesAPILatency) resolved: (26) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:42] Yeah, it has an effect but there should be nothing using that right [13:50:51] ok, got it [13:50:56] but I can still test it with the siteinfo api ^^ [13:50:59] Still testable with UserGroupRights though I guess [13:51:02] ^^^ [13:51:12] (03CR) 10Jclark-ctr: [C: 03+2] correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [13:51:16] !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:52:21] (03PS7) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:52:26] okay, API output diff looks good to me [13:52:34] Daimona: agree? ^^ [13:53:10] (03Merged) 10jenkins-bot: Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314 (owner: 10Volans) [13:53:32] Lemme see [13:53:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [13:54:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [13:54:11] Yup, LGTM [13:54:14] !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Continuing with sync [13:54:19] \o/ [13:56:25] Lucas_WMDE: As for the third patch: I was waiting for someone from my team to volunteer for testing it, but nobody seems to be available, and I don't have those rights on meta [13:56:34] ah, hm [13:56:40] So, given that we're also approaching the end of the window, I think it should be done another time [13:56:55] ok [13:57:04] (03CR) 10Ottomata: "I'd like to take a stab at doing the broader changes needed for automating this, but I probably won't have time very soon. Don't want to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [13:57:10] Oh wait [13:57:27] There's actually someone available. But still, it's late, so I'm also still fine with doing that another time [13:57:49] I think this deployment will already overrun a little bit into the wikifunctions window [13:57:54] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2008.codfw.wmnet [13:57:58] it only just started php-fpm-restart [13:58:07] let’s see how much wikifunctions stuff there is to do, I guess [13:58:30] (03PS8) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:58:31] Ok, ty [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1400) [14:00:12] I’m still deploying, please hold [14:00:32] (03PS1) 10Fabfur: purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) [14:00:35] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] (duration: 10m 40s) [14:00:39] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) Thanks everyone for the discussion and feedback above! So it seems like two main points have come up above: 1. We can c... [14:00:40] alright, I’m done for now [14:00:45] T336939: Add new user right to meta - https://phabricator.wikimedia.org/T336939 [14:00:50] is there anything to deploy from wikifunctions? [14:00:59] otherwise I have one more config change I’d like to do [14:01:01] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) >ACAST_PS_ADVERTISE is hardcoded in anycast_healthchecker (the tool we use to monitor services). in that case agree its t... [14:01:04] * Lucas_WMDE will wait a few minutes [14:01:04] (KubernetesAPILatency) firing: (22) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:01:38] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) [14:03:12] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9224990, @jbond wrote: >>ACAST_PS_ADVERTISE is hardcoded in anycast_healthchecker (the tool we use to mon... [14:03:41] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43868/console" [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:04:32] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:05:35] (03PS2) 10Fabfur: purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) [14:05:47] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2007.codfw.wmnet [14:06:05] (KubernetesAPILatency) resolved: (25) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:44] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:08:34] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2006.codfw.wmnet [14:08:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43869/console" [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:08:56] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:08:56] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:57] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores2008.codfw.wmnet [14:10:13] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2005.codfw.wmnet [14:10:36] doesn’t sound like there’s anything to do for wikifunctions today [14:10:45] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:10:48] Daimona: if you still have a tester available then I think we can go ahead [14:10:56] Yup, I do, thank you [14:11:00] ok, let’s go [14:11:13] (03PS2) 10Lucas Werkmeister (WMDE): prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy) [14:11:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy) [14:12:15] (03Merged) 10jenkins-bot: prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy) [14:12:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:12:44] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] [14:12:47] T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065 [14:12:52] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2009.codfw.wmnet [14:13:51] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:14:07] (03PS9) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [14:14:12] !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:44] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:15:03] (03CR) 10Papaul: [V: 03+2] correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [14:15:50] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:16:30] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:16:31] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:31] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2007.codfw.wmnet [14:16:45] !log starting Cassandra rebuild, restbase1030-a — T346803 [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:48] T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803 [14:17:04] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:17:09] Daimona: can you test the change? [14:17:13] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:13] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2005.codfw.wmnet [14:17:16] Yup, coordinating right now [14:17:21] ok thanks [14:17:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:17:55] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:17:55] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:56] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2006.codfw.wmnet [14:17:57] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:18:01] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores[1002-1009].eqiad.wmnet [14:19:01] (03PS1) 10Andrew Bogott: Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380) [14:20:16] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:21:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:21:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:19] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2009.codfw.wmnet [14:21:46] (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:21:54] (03PS10) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [14:22:03] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores[2001-2004].codfw.wmnet [14:22:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [14:22:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:23:27] (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:23:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [14:24:01] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963321 (T347837). `purged` daemon will be restarted by puppet in drmrs in the next 30m [14:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:05] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [14:25:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380) (owner: 10Andrew Bogott) [14:25:09] Lucas_WMDE: It's working! [14:25:20] \o/ [14:25:22] !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Continuing with sync [14:25:37] anakin_phantom_menace.gif [14:25:55] 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:26:05] (03CR) 10Andrew Bogott: [C: 03+2] Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380) (owner: 10Andrew Bogott) [14:26:43] * Daimona staring at my IRC client that does not display GIFs, I guess :( [14:26:56] But google always has an answer for you :D [14:27:21] I just typed a fake file name and trusted your brain to fill it in :P [14:27:31] my client definitely doesn’t support gifs either [14:28:02] Oooooooh :D I just by default assumed that gifs were too much for ye olde hexchat [14:28:09] :D [14:29:03] (03PS11) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [14:29:18] * Daimona is happy because, OTOH, his IRC client automatically replaces passwords with ********* when you type them :P [14:29:44] ah, good ole' hunter1 [14:29:52] After all, that's the must-have feature for all IRC clients [14:30:04] :D [14:31:10] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] (duration: 18m 26s) [14:31:14] T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065 [14:31:16] * Lucas_WMDE observes klausman has an older version of hunter [14:31:48] it's a shame about bash.org! end of an era [14:32:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:33:08] a german one (http://ibash.de/) grew a weird second life – last quote from ten months ago, but people are just chatting in the comments now, https://xkcd.com/1305/ -style [14:33:22] Oh yeah, the new version is ******* [14:34:16] (03PS1) 10Fabfur: purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) [14:34:28] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:34:42] (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:36:36] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:36:45] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Ah right! My bad. Unrelated and maybe a scope creep, but we could also start by advertising a unicast v6 IP to validat... [14:37:23] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:37:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:28] !log spontaneously extended UTC afternoon backport+config window done now [14:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:47] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:48] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:48] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores[1002-1009].eqiad.wmnet [14:39:07] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [14:39:07] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:08] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores[2001-2004].codfw.wmnet [14:39:43] (03PS1) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) [14:40:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Open→03In progress p:05Medium→03Low a:03bking [14:41:29] (03PS2) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) [14:41:31] (03CR) 10CI reject: [V: 04-1] Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [14:41:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Taking this back, as I was able to get the host to boot by changing the boot option for the 2nd NIC interfac... [14:41:49] (03CR) 10Ahmon Dancy: [C: 03+1] P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [14:42:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:42:43] (03CR) 10CI reject: [V: 04-1] Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [14:43:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:45:04] (03PS3) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) [14:48:52] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:11] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:56] (03PS1) 10Bking: partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463) [14:52:26] (03CR) 10Filippo Giunchedi: [C: 03+1] partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [14:53:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:54:01] (03CR) 10Bking: [C: 03+2] partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [14:55:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [14:55:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [14:55:52] (03PS1) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) [14:56:11] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [14:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [14:56:44] (03CR) 10CI reject: [V: 04-1] wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah) [14:57:21] (03PS2) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) [14:59:19] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B server moves - port-block constraint / numbering - https://phabricator.wikimedia.org/T348125 (10cmooney) 05Open→03Resolved @papaul answered in T348129#9224878, seems like we're in a good place given previous rack assignment as '1... [14:59:22] !log revoke a bot password, https://phabricator.wikimedia.org/T348132 [14:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:25] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:59:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10wiki_willy) a:03RobH Adding the procurement project tag. @RobH - can you move this to the S4 space as well? Thanks, Willy [15:00:10] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:00:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [15:01:41] 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10cmooney) >>! In T348129#9224878, @Papaul wrote: > @cmooney this should be a complication if we did have a mixed of 1G and 10G servers within the sam... [15:02:22] (03PS1) 10Slyngshede: Handle mobile viewport correct. [software/bitu] - 10https://gerrit.wikimedia.org/r/963331 [15:03:02] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Handle mobile viewport correct. [software/bitu] - 10https://gerrit.wikimedia.org/r/963331 (owner: 10Slyngshede) [15:05:02] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:07:07] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudvirt1062-67 - jclark@cumin1001" [15:07:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Jclark-ctr) [15:08:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudvirt1062-67 - jclark@cumin1001" [15:08:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:22] (03PS1) 10RLazarus: admin: Temporarily add a second ssh key for rzl [puppet] - 10https://gerrit.wikimedia.org/r/963333 [15:12:46] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) >>! In T348041#9222035, @ssingh wrote: > We can and probably should have a backup static routes for each of `ns[01]` bu... [15:12:48] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1062.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:52] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1065.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1066.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED [15:13:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:47] (03PS1) 10Bking: cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463) [15:14:22] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/963335 [15:14:57] (03CR) 10Filippo Giunchedi: [C: 03+1] cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [15:15:07] (03CR) 10BBlack: [C: 03+2] admin: Temporarily add a second ssh key for rzl [puppet] - 10https://gerrit.wikimedia.org/r/963333 (owner: 10RLazarus) [15:15:24] (03CR) 10Bking: [C: 03+2] cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [15:16:04] (03PS1) 10Slyngshede: Add viewport meta tag [software/bitu] - 10https://gerrit.wikimedia.org/r/963336 [15:16:41] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add viewport meta tag [software/bitu] - 10https://gerrit.wikimedia.org/r/963336 (owner: 10Slyngshede) [15:17:21] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:17:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [15:18:09] (03PS4) 10Btullis: Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) [15:21:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:21:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [15:22:10] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/963335 (owner: 10Muehlenhoff) [15:23:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) >>! In T348041#9222035, @ssingh wrote: > We can and probably should have a backup static routes for each of `ns[01]` bu... [15:24:05] (03PS1) 10Slyngshede: Better wording for sign in text. [software/bitu] - 10https://gerrit.wikimedia.org/r/963339 [15:24:28] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Better wording for sign in text. [software/bitu] - 10https://gerrit.wikimedia.org/r/963339 (owner: 10Slyngshede) [15:25:09] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Oops, I missed some of the comments. * I'm in favor of ditching the statics * Changing the Hiera merge strategy seems... [15:26:05] (03PS1) 10Jclark-ctr: corrected an-master100[3-4] in site.ppi [puppet] - 10https://gerrit.wikimedia.org/r/963340 (https://phabricator.wikimedia.org/T342291) [15:26:12] (03CR) 10Clément Goubert: [C: 03+2] P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [15:26:49] (03CR) 10Jclark-ctr: [C: 03+2] corrected an-master100[3-4] in site.ppi [puppet] - 10https://gerrit.wikimedia.org/r/963340 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:30:27] (03PS1) 10Jbond: test_init: correctly mock spicerack.Dns [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343 [15:32:21] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:41] !log hashar@deploy2002 Started deploy [integration/docroot@b3b712f]: (no justification provided) [15:32:47] !log hashar@deploy2002 Finished deploy [integration/docroot@b3b712f]: (no justification provided) (duration: 00m 06s) [15:32:49] 10SRE, 10Wikimedia-Mailing-lists: Undelivered mail posted to wikimediacz-l - https://phabricator.wikimedia.org/T348158 (10Urbanecm) [15:32:56] (03PS2) 10Fabfur: purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) [15:33:48] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:35:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:36:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:37:29] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:37:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [15:38:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) p:05Triage→03Medium [15:38:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) [15:38:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [15:39:37] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9225321, @cmooney wrote: >>>! In T348041#9222035, @ssingh wrote: >> We can and probably should have a bac... [15:39:54] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:40:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [15:40:31] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:31] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9225405, @ayounsi wrote: > Oops, I missed some of the comments. > > * I'm in favor of ditching the stati... [15:42:18] (03PS1) 10Jclark-ctr: add cloudvirt10[62-67] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963366 (https://phabricator.wikimedia.org/T342537) [15:43:48] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1066.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1065.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:45:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1062.mgmt.eqiad.wmnet with reboot policy FORCED [15:45:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:45:35] (03CR) 10Jclark-ctr: [C: 03+2] add cloudvirt10[62-67] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963366 (https://phabricator.wikimedia.org/T342537) (owner: 10Jclark-ctr) [15:45:49] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) >>! In T348041#9225478, @ssingh wrote: >>>! In T348041#9225405, @ayounsi wrote: >> * Changing the Hiera merge strategy s... [15:46:54] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) For posterity: - no static routes - merge strategy Arzhel mentioned above - I am going to rename `ACAST_PS_ADVERTISE`... [15:47:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [15:47:19] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [15:47:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [15:47:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [15:49:06] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43870/console" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [15:51:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:52:13] (03PS1) 10Andrea Denisse: alertmanager: Add the "Auto-Submitted: auto-generated" header to AM emails [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) [15:55:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:56:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:56:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:58:17] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/908604/2460/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:58:34] (03CR) 10Hashar: [C: 03+1] "I guess this can be merged at any time." [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:59:20] 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) I am thinking about something to consider when going servers refresh or new servers [16:00:04] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST clusterissuers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:04] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST clusterissuers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:41] (03PS14) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) [16:05:43] (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:05:52] (03Abandoned) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509) (owner: 10Jforrester) [16:06:29] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/963368/43872/" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse) [16:06:44] (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:07:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye [16:07:18] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [16:07:20] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:07:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye [16:07:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b... [16:07:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS b... [16:07:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b... [16:07:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b... [16:09:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) p:05Triage→03Medium [16:11:11] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [16:11:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) [16:15:12] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1062'] [16:15:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063'] [16:15:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064'] [16:15:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1065'] [16:21:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1062'] [16:21:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1064'] [16:21:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1065'] [16:21:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1063'] [16:21:34] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:21:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:21:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:21:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067'] [16:22:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:22:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:22:28] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:22:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:22:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10ayounsi) Yeah, that's perfect. We can revisit the day it dies and needs to be migrated to a VM. [16:23:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye [16:23:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:23:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye [16:23:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:23:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:23:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:23:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:23:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067'] [16:24:23] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:24:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:24:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:24:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067'] [16:24:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:24:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:25:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067'] [16:25:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:25:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:25:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:25:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:26:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:26:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:26:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066'] [16:27:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067'] [16:28:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067'] [16:28:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [16:31:50] (03PS1) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) [16:34:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1066'] [16:34:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1067'] [16:34:49] (03CR) 10Vgutierrez: [C: 03+1] "change itself looks good, PCC should cover each DC though" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [16:35:13] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:36:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) [16:39:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops, I've confirmed that our new partman recipe works in T342463 , but the reimage for `cloudelas... [16:39:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) p:05Low→03Medium a:05bking→03None [16:40:12] (03PS1) 10Andrew Bogott: openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377 [16:40:33] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:40:33] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:41:22] (03PS5) 10Brion VIBBER: Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) [16:42:19] (03CR) 10FNegri: [C: 03+1] openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377 (owner: 10Andrew Bogott) [16:48:01] jouncebot: nowandnext [16:48:01] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [16:48:01] In 0 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700) [16:49:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [16:49:07] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [16:49:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [16:49:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [16:49:18] !log taavi@mwmaint2002 ~ $ mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php metawiki | tee T242031-sul.log # T242031 [16:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:22] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [16:53:43] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:53:43] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:54:56] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [16:55:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [16:55:11] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:55:11] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:35] (03CR) 10Fabfur: [V: 03+1] purged: use unix socket for varnish in all DCs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [16:57:23] (03PS1) 10Andrew Bogott: Keystone: upgrade init scripts for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963378 [16:58:07] (03CR) 10Andrew Bogott: [C: 03+2] openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377 (owner: 10Andrew Bogott) [16:59:10] (03CR) 10Ayounsi: [C: 03+1] "one nit" [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney) [16:59:12] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43873/console" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [16:59:54] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963326 (T347837). `purged` daemon will be restarted by puppet in esams in the next 30m [16:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:58] T347837: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700) [17:00:06] (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [17:00:43] 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur) [17:01:07] 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur) 05Open→03Resolved [17:03:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye [17:03:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye [17:06:11] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:10:36] (03PS1) 10DCausse: rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914) [17:16:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bullseye [17:19:28] (03CR) 10Andrew Bogott: [C: 03+1] "Are you imagining that we'd also move the openstack swift endpoint in the keystone catalog, or just keep this around as a fallback? (Or, " [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah) [17:20:22] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse) [17:21:12] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse) [17:22:17] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:22:34] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:23:47] 10SRE, 10Cloud-VPS: cloudlb2001-dev and cloudlb2002-dev connected at different speeds - https://phabricator.wikimedia.org/T348173 (10cmooney) p:05Triage→03Low [17:24:39] 10SRE, 10Traffic: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) [17:24:51] 10SRE, 10Traffic: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) [17:24:58] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [17:26:23] RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [17:27:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [17:27:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye [17:27:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [17:27:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro... [17:29:12] jouncebot: nowandnext [17:29:12] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700) [17:29:12] In 0 hour(s) and 30 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800) [17:29:12] In 0 hour(s) and 30 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800) [17:29:58] (03PS1) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) [17:30:16] (03PS1) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) [17:30:31] (03PS1) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) [17:31:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1062'] [17:32:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063'] [17:32:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064'] [17:32:20] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1065'] [17:32:23] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066'] [17:33:36] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye [17:33:38] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [17:33:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [17:33:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye [17:33:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye [17:33:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye [17:33:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [17:33:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [17:33:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye [17:34:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye [17:37:22] (03PS1) 10Majavah: Set READ_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) [17:37:27] (03CR) 10Andrew Bogott: [C: 03+2] designate pools.yaml: remove a domain-terminating '.' [puppet] - 10https://gerrit.wikimedia.org/r/961170 (owner: 10Andrew Bogott) [17:41:28] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh) [17:42:42] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh) p:05Triage→03Medium [17:43:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testreduce1002.eqiad.wmnet [17:43:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [17:43:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [17:47:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testreduce1002.eqiad.wmnet [17:47:54] (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [17:52:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [17:52:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [18:00:07] jeena and dduvall: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800). [18:00:07] jeena and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800). [18:00:51] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080) [18:00:53] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:00:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) p:05Triage→03Medium [18:01:35] (03PS1) 10Subramanya Sastry: parsoid-rt-client: Reduce worker pool to 24 clients [puppet] - 10https://gerrit.wikimedia.org/r/963392 (https://phabricator.wikimedia.org/T345220) [18:01:47] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) [18:01:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [18:02:03] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:08:28] Lots of errors are being logged. [18:08:40] jeena: Roll back! [18:09:00] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.29 refs T347080 [18:09:02] rolling back [18:09:04] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:09:32] dancy: can I just cancel this deploy if it's not done? [18:09:45] that should work [18:09:47] yes [18:10:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:10:41] bah deprecations, looks like this one was filed already. Adding as a blocker. [18:10:49] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080) [18:10:51] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:11:35] fun [18:11:41] Hmm.. that commit message title is wrong. [18:11:43] I thought it was fine since they were deprecation warnings [18:11:43] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:11:46] Was that autogenerated? [18:11:48] but they did increase a lot [18:11:50] yeah [18:12:05] https://phabricator.wikimedia.org/T348180 [18:12:20] thcipriani: oh, it was? [18:12:38] (03PS12) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [18:13:02] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:13:17] jeena: Can you send me a transcript of what you ran to rollback? I want to fix that. [18:13:47] okay [18:14:00] looks like the deprecation errors are known and filtered out on the New Errors dash. there are others that spiked, however [18:14:33] Yeah I also see a cirrusSearchHandler error, but not as many as the deprecation warnings [18:14:35] dduvall: hrm, coming from the same place, but slightly different path https://phabricator.wikimedia.org/T348134 (not an api error) [18:14:47] i see [18:15:04] (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:15:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:17:31] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/963392 (https://phabricator.wikimedia.org/T345220) (owner: 10Subramanya Sastry) [18:18:15] PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [18:18:35] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:19:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye [18:19:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bullseye [18:19:44] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.29 refs T347080 [18:19:48] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:20:50] jeena, thcipriani i went ahead and filed https://phabricator.wikimedia.org/T348181 as well [18:21:04] thanks dduvall [18:21:16] np [18:21:27] dduvall: thanks – merged your other task with the existing one and noted the different stack trace, added as a blocker [18:21:41] k [18:23:35] (KubernetesAPILatency) firing: (29) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:28:35] (KubernetesAPILatency) resolved: (29) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:33:37] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:38:43] RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [18:45:21] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:53:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [18:54:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [18:54:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye [18:54:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro... [19:04:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52822 and previous config saved to /var/cache/conftool/dbconfig/20231004-190427-arnaudb.json [19:04:35] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:05:56] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) Speaking here only with respect to the data model: TL;DR I think you need to change the schema like so... `lang=diff ---... [19:07:02] 10SRE, 10Data Products, 10Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577 (10VirginiaPoundstone) @Milimetric What is the status on this task? [19:12:33] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [19:12:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [19:19:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [19:19:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P52823 and previous config saved to /var/cache/conftool/dbconfig/20231004-191933-arnaudb.json [19:34:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P52824 and previous config saved to /var/cache/conftool/dbconfig/20231004-193439-arnaudb.json [19:43:48] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:49:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52825 and previous config saved to /var/cache/conftool/dbconfig/20231004-194946-arnaudb.json [19:49:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [19:49:51] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:50:03] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [19:50:04] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:50:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:50:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52826 and previous config saved to /var/cache/conftool/dbconfig/20231004-195023-arnaudb.json [19:52:44] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) @Eevans Understood! I'll make that change to the schema soon. As far as returning a single `DPPageviews` vs. an array w... [19:58:36] (03PS1) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [19:59:40] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) >>! In T343855#9226451, @Htriedman wrote: > @Eevans Understood! I'll make that change to the schema soon. > > As far as re... [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:23] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Postal32) [20:00:45] i'll steal the window [20:00:59] (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:01:49] (03PS1) 10Urbanecm: Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) [20:02:03] (03PS2) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) [20:02:42] (03CR) 10Urbanecm: [C: 03+2] Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm) [20:02:45] (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:03:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:41] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: upgrade init scripts for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963378 (owner: 10Andrew Bogott) [20:04:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:14:48] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Aklapper) 05Open→03Stalled Hi @Postal32, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Please provide reasons why you'd like to to access Netbox and how you plan... [20:21:41] (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:22:16] (03CR) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:22:19] (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:23:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:23:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:23:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm) [20:29:11] (03Merged) 10jenkins-bot: Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm) [20:29:13] (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:30:12] (03CR) 10Ebernhardson: "I certainly think the wider goal is still worth pursing, but indeed this is also blocking our staging deployment of the cirrus service so " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [20:31:27] (03PS9) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) [20:31:29] (03PS15) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) [20:31:31] (03PS1) 10Ebernhardson: rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 [20:32:18] (03CR) 10Ebernhardson: "This also ensures that when we use .Release.Name in a path it has a more decriptive path." [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson) [20:32:36] (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [20:45:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye [20:45:29] (03Merged) 10jenkins-bot: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:45:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye [20:45:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye [20:45:59] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [20:46:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [20:46:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [20:46:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye [20:46:11] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm) [20:46:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye [20:46:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [20:46:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [20:46:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye [20:46:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye [20:46:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [20:46:55] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] [20:47:00] T346760: Pages transcluding Special:ManageMentors are sometimes being rendered in the default skin - https://phabricator.wikimedia.org/T346760 [20:47:00] T347571: GrowthExperiments fails CI: mwext-php74-phan-docker - https://phabricator.wikimedia.org/T347571 [20:48:20] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:48:39] !log urbanecm@deploy2002 urbanecm: Continuing with sync [20:53:03] RECOVERY - Check systemd state on releases2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:45] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] (duration: 07m 49s) [20:54:56] T346760: Pages transcluding Special:ManageMentors are sometimes being rendered in the default skin - https://phabricator.wikimedia.org/T346760 [20:54:56] T347571: GrowthExperiments fails CI: mwext-php74-phan-docker - https://phabricator.wikimedia.org/T347571 [20:56:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:57:04] scap's quick again. yay! :) [20:57:05] (03CR) 10Gehel: [C: 04-1] "Some comments about the structure, see inline, of ping me if you want more context." [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [20:58:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2054.codfw.wmnet with OS bullseye [20:58:15] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye [20:59:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2100) [21:01:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye [21:02:37] PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye [21:04:05] RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:25] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:49] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:11:36] (03PS3) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) [21:13:37] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:14:31] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:38] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:16] (03CR) 10JHathaway: postgresql: fix ordering on a new install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [21:20:19] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [21:23:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [21:26:34] (03PS1) 10Subramanya Sastry: parsoid-rt-client: Further reduce worker pool to 16 clients [puppet] - 10https://gerrit.wikimedia.org/r/963413 (https://phabricator.wikimedia.org/T345220) [21:30:35] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:03] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:31:13] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:31:21] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:19] (03CR) 10Volans: [C: 03+2] "LGTM, thx for the fix" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343 (owner: 10Jbond) [21:33:55] (03PS1) 10Subramanya Sastry: Revert "Deprecate TOC mutation in OutputPageParserOutput hook" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134) [21:34:51] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [21:36:24] (03Merged) 10jenkins-bot: test_init: correctly mock spicerack.Dns [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343 (owner: 10Jbond) [21:36:41] PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:05] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:37:15] jouncebot nowandnext [21:37:16] For the next 0 hour(s) and 22 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2100) [21:37:16] In 8 hour(s) and 22 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600) [21:37:16] In 8 hour(s) and 22 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600) [21:38:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [21:38:49] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:39:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134) (owner: 10Subramanya Sastry) [21:39:29] ^ cc: jeena [21:40:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:40:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS bullseye [21:40:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye completed: - cloud... [21:42:27] brennen: +1 I think there is another blocker still before we can roll forward [21:43:14] yeah. [21:43:15] oh, looks like it has a fix as well https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/963405/ [21:43:16] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:43:35] hmm, is that one a straightforward revert or should we wait for review? [21:44:04] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:07] I'm not sure, they said revert but the patch doesn't so I'd prefer to wait [21:44:23] brennen, afk for baout 10 mins. but will be here agian then if you need anything from me before i bail for the evening. [21:44:31] looks like zuul needs that time anyway. [21:44:32] there's no reviewer added though [21:44:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye [21:44:37] (03CR) 10Bking: [C: 03+1] rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson) [21:44:44] subbu: thanks - i'm guessing there's not much to test with this one? [21:44:51] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963414 [21:44:59] (03CR) 10Bking: [C: 03+1] Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [21:45:02] dont think so? it is just going to stop emitting the deprecations. [21:45:08] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963414 (owner: 10Volans) [21:45:23] cool. [21:46:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye [21:46:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye [21:46:53] jeena: i asked - https://phabricator.wikimedia.org/T348181#9226698 [21:47:20] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:34] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:49:00] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:49:01] (03PS3) 10JHathaway: puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) [21:49:38] (03CR) 10JHathaway: puppetdb: avoid creating database users via dbconfig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [21:50:55] (03PS1) 10Volans: Upstream release v7.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963415 [21:51:06] (03CR) 10Volans: [C: 03+2] Upstream release v7.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963415 (owner: 10Volans) [21:51:20] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:06] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:52:20] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:46] (03Merged) 10jenkins-bot: Revert "Deprecate TOC mutation in OutputPageParserOutput hook" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134) (owner: 10Subramanya Sastry) [21:53:14] !log brennen@deploy2002 Started scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] [21:53:18] T348134: PHP Deprecated: Use of OutputPageParserOutput hook to mutate TOC was deprecated in MediaWiki 1.41 - https://phabricator.wikimedia.org/T348134 [21:53:26] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:54:39] !log brennen@deploy2002 brennen and ssastry: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:54:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS bullseye [21:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye completed: - cloud... [21:55:53] (03PS3) 10Jforrester: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) [21:55:55] (03PS2) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) [21:55:57] (03PS2) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [21:55:59] (03PS2) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [21:56:12] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:56:25] !log brennen@deploy2002 brennen and ssastry: Continuing with sync [21:56:58] RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:06] (03PS3) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) [21:58:10] 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10Wikimedia-Fundraising, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Pols12) [21:58:43] !log uploaded spicerack_7.3.1 to apt.wikimedia.org bullseye-wikimedia [21:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye [21:59:17] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson) [21:59:25] (03CR) 10Bking: [C: 03+2] Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [22:00:13] brennen, how is it looking? [22:00:17] (03Merged) 10jenkins-bot: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [22:00:35] just restarting php now, we'll know shortly [22:02:18] RECOVERY - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.235 port 9042 https://phabricator.wikimedia.org/T93886 [22:02:28] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] (duration: 09m 13s) [22:02:32] !log starting Cassandra rebuild, restbase1030-b — T346803 [22:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:02:40] T348134: PHP Deprecated: Use of OutputPageParserOutput hook to mutate TOC was deprecated in MediaWiki 1.41 - https://phabricator.wikimedia.org/T348134 [22:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:53] T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803 [22:03:27] subbu: last deprecation notice at 21:57 UTC, if i don't see any in the next 10 min or so i'll assume that's fixed. [22:03:49] ok .. i would be surprised if you saw any new ones .. [22:04:09] i am going to sign off now .. but will check in again in a couple hours. [22:04:22] (will look at the phab task). [22:05:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye [22:05:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye [22:06:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye [22:06:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:06:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro... [22:06:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [22:06:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [22:06:30] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) @Eevans In that case, I'll change the data model to drop it! Will update this thread when it's done. [22:06:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [22:07:35] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:09:21] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse) [22:11:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [22:11:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [22:13:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2054.codfw.wmnet with OS bullseye [22:13:23] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye executed with errors: - kubernetes2054 (**FAIL**) - Removed from... [22:18:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:18:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:21:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [22:22:44] 10SRE, 10All-and-every-Wiktionary, 10Language-Team, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10greg) [22:23:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [22:24:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [22:24:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [22:27:54] (03PS16) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) [22:28:27] (03PS17) 10Ebernhardson: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) [22:39:35] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [22:40:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [22:40:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS bullseye [22:40:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye completed: - cloud... [23:00:22] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye [23:32:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro... [23:38:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [23:39:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [23:43:48] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:44:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [23:44:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...