[00:00:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049275 (owner: 10TrainBranchBot) [00:01:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: T365763 [00:01:19] T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763 [00:01:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: T365763 [00:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:30] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:08:45] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:10:04] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:14:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:14:38] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:18:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:18:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:43:16] !log sudo pkill mpeg: mw1438, high CPU usage, ffmpeg processes [00:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:29] !log [correction of command] sudo pkill ffmpeg: mw1438, high CPU usage, ffmpeg processes [00:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:10] ok should recover now [00:44:36] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:44:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:45:12] PROBLEM - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:45:20] PROBLEM - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:45:32] FIRING: [2x] JobUnavailable: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:45:43] ^ expected [00:46:06] silencing [00:49:14] FIRING: [7x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:49:21] ^silenced [00:49:27] thanks [00:58:30] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:58:45] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:03:30] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956) [01:07:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Popups] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049181 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [01:07:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [01:11:28] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919775 (10BCornwall) [01:32:33] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [01:36:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919804 (10BCornwall) [01:40:11] !log Removing downtime for cp[5017-5024] as nvme drives are installed and hosts back online - T365763 [01:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:16] T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763 [01:42:58] FIRING: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:43:01] ha [01:43:05] all good [01:43:18] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=cache_text,dc=eqsin [01:43:37] (03PS1) 10BCornwall: Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286 [01:44:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:10] RECOVERY - PyBal IPVS diff check on lvs5004 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:45:20] RECOVERY - PyBal IPVS diff check on lvs5006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:47:21] (03CR) 10Ssingh: [C:03+1] Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286 (owner: 10BCornwall) [01:47:45] (03CR) 10BCornwall: [C:03+2] Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286 (owner: 10BCornwall) [01:47:58] RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:48:27] !log Running authdns-update on dns1004 to pool eqsin - T365763 [01:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:32] T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763 [01:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919821 (10BCornwall) [01:55:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919822 (10BCornwall) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0200) [02:18:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:23:53] FIRING: [5x] KubernetesAPILatency: High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:28:53] RESOLVED: [5x] KubernetesAPILatency: High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0300) [03:01:52] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956) [03:01:54] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [03:02:35] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [03:03:02] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.11 refs T366956 [03:03:08] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [03:28:10] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:29:14] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:10] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:55:21] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.11 refs T366956 (duration: 52m 19s) [03:55:26] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0400) [04:01:03] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.8 (duration: 00m 55s) [04:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:12:54] (03CR) 10Ayounsi: [C:03+2] "self merging as it's only for the dev instance and CI/PCC is happy." [puppet] - 10https://gerrit.wikimedia.org/r/1049263 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [05:24:20] (03PS1) 10Ayounsi: Netbox 4: rename device_role to role [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) [05:32:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [05:32:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [05:32:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65394 and previous config saved to /var/cache/conftool/dbconfig/20240625-053239-marostegui.json [05:32:44] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:32:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:32:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:33:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:33:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65395 and previous config saved to /var/cache/conftool/dbconfig/20240625-053312-marostegui.json [05:33:18] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:35:52] (03CR) 10Marostegui: mariadb: monitoring memory pressure (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [05:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:06] (03CR) 10Marostegui: mariadb: add monitoring on io pressure for mariadb hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0600). [06:02:34] (03PS2) 10Arnaudb: dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020) [06:02:51] !log Drop ipblocks from s6 T367632 [06:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:56] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [06:04:04] (03CR) 10Marostegui: dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020) (owner: 10Arnaudb) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:29] (03CR) 10Arnaudb: [C:03+2] dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020) (owner: 10Arnaudb) [06:05:23] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]] [06:05:41] T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:56] !log Drop ipblocks from s7 T367632 [06:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:02] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [06:17:16] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:17:22] T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020 [06:19:05] !log arnaudb@deploy1002 arnaudb: Continuing with sync [06:24:10] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]] (duration: 18m 47s) [06:24:15] T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020 [06:25:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049310 (https://phabricator.wikimedia.org/T368355) [06:25:22] (03CR) 10Muehlenhoff: [C:03+2] Remove apereo spec test [puppet] - 10https://gerrit.wikimedia.org/r/1049139 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [06:25:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T368020 [06:25:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T368020 [06:26:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es1039 with weight 0 T368020', diff saved to https://phabricator.wikimedia.org/P65396 and previous config saved to /var/cache/conftool/dbconfig/20240625-062640-arnaudb.json [06:27:35] (03Abandoned) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971412 (owner: 10Muehlenhoff) [06:27:43] (03PS2) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047906 (https://phabricator.wikimedia.org/T368020) [06:31:28] (03PS1) 10Muehlenhoff: Revert "Point codfw and codfw1dev to use the eqiad LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049378 (https://phabricator.wikimedia.org/T367861) [06:32:26] (03PS1) 10Marostegui: db2129: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049379 [06:32:31] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1047905 (https://phabricator.wikimedia.org/T368020) (owner: 10Gerrit maintenance bot) [06:33:14] (03CR) 10Marostegui: [C:03+2] db2129: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049379 (owner: 10Marostegui) [06:33:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65397 and previous config saved to /var/cache/conftool/dbconfig/20240625-063334-marostegui.json [06:33:39] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:33:46] !log Starting es7 eqiad failover from es1035 to es1039 - T368020 [06:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:52] T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020 [06:34:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es1039 to es7 primary T368020', diff saved to https://phabricator.wikimedia.org/P65398 and previous config saved to /var/cache/conftool/dbconfig/20240625-063453-arnaudb.json [06:36:18] (03PS1) 10Arnaudb: Revert "dbconfig: temporary disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386 [06:36:58] (03CR) 10Arnaudb: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047906 (https://phabricator.wikimedia.org/T368020) (owner: 10Gerrit maintenance bot) [06:38:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368355 [06:38:59] T368355: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T368355 [06:39:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T368355', diff saved to https://phabricator.wikimedia.org/P65399 and previous config saved to /var/cache/conftool/dbconfig/20240625-063908-root.json [06:39:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368355 [06:40:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T368020', diff saved to https://phabricator.wikimedia.org/P65400 and previous config saved to /var/cache/conftool/dbconfig/20240625-064000-arnaudb.json [06:40:06] T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020 [06:40:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049310 (https://phabricator.wikimedia.org/T368355) (owner: 10Gerrit maintenance bot) [06:40:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386 (owner: 10Arnaudb) [06:41:37] (03Merged) 10jenkins-bot: Revert "dbconfig: temporary disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386 (owner: 10Arnaudb) [06:42:14] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] [06:45:16] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:45:17] !log arnaudb@deploy1002 Sync cancelled. [06:45:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:46:51] ah [06:47:11] this explains why I could not reenable writes on es7 [06:47:12] :D [06:48:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P65401 and previous config saved to /var/cache/conftool/dbconfig/20240625-064841-marostegui.json [06:50:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:50:32] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:52:42] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] [06:52:51] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [06:53:03] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [06:53:16] its back! [06:53:54] (03Merged) 10jenkins-bot: ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [06:54:14] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:32] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [06:55:13] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:55:23] !log arnaudb@deploy1002 arnaudb: Continuing with sync [06:55:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0700). [07:00:05] Func: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] o/ [07:00:29] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] (duration: 07m 47s) [07:01:00] !log Starting s8 codfw failover from db2165 to db2161 - T368355 [07:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:06] T368355: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T368355 [07:01:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9920235 (10SLyngshede-WMF) @odimitrijevic / @Ottomata / @WDoranWMF Would either of you approve? [07:01:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T368355', diff saved to https://phabricator.wikimedia.org/P65402 and previous config saved to /var/cache/conftool/dbconfig/20240625-070127-marostegui.json [07:02:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2165 T368355', diff saved to https://phabricator.wikimedia.org/P65403 and previous config saved to /var/cache/conftool/dbconfig/20240625-070252-marostegui.json [07:03:27] (03PS2) 10Kevin Bazira: ml-services: return logo-detection latency metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962) [07:03:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P65404 and previous config saved to /var/cache/conftool/dbconfig/20240625-070348-marostegui.json [07:06:01] (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049388 [07:06:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Long schema change [07:06:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Long schema change [07:06:56] (03CR) 10Marostegui: [C:03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049388 (owner: 10Marostegui) [07:07:33] (03PS1) 10David Caro: p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) [07:07:55] (03CR) 10CI reject: [V:04-1] p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [07:09:07] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3060/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [07:09:49] (03Abandoned) 10David Caro: p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [07:10:28] (03CR) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [07:13:03] (03CR) 10Muehlenhoff: [C:03+2] Revert "Point codfw and codfw1dev to use the eqiad LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049378 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff) [07:14:37] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920267 (10dcaro) >>! In T348643#9919050, @CDanis wrote: > Unfortunately `cloudcephosd1020` has too old a Debian / kernel for this without some mor... [07:14:46] !log Optimize pagelinks on old s8 codfw master db2165 dbmaint T364069 [07:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:18:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65405 and previous config saved to /var/cache/conftool/dbconfig/20240625-071855-marostegui.json [07:18:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:19:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:26:50] (03PS1) 10Slyngshede: data.yaml: Add daphnesmit to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) [07:28:02] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:29:14] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:00] (03PS1) 10Brouberol: amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 [07:31:46] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3061/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol) [07:32:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3062/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:32:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [07:33:58] (03CR) 10CI reject: [V:04-1] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol) [07:36:08] (03PS2) 10Brouberol: amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 [07:36:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [07:41:39] (03CR) 10Elukey: [C:03+1] No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:42:23] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:42:46] (03CR) 10Elukey: [C:03+1] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol) [07:43:35] (03CR) 10Brouberol: [C:03+2] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol) [07:44:00] brouberol: o/ [07:44:13] hi elukey o/ [07:44:18] sorry my brain is still foggy, isn't the change the same as we have now? [07:44:51] ah no in theory it is fine [07:44:54] nevermind :D [07:44:57] mondays [07:45:12] am I right? [07:45:36] puppet is now compiling and running on dse-k8s-worker1001 [07:45:43] thanks for the quick review! [07:45:51] (03PS1) 10Muehlenhoff: Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1049446 (https://phabricator.wikimedia.org/T367861) [07:45:54] I am more worried that it is not monday and I am still doing pebcak :D [07:46:06] anyway, thanks for fixing! [07:46:24] now that I recall we can probably remove the rocm stuff from the DSE workers [07:46:50] https://phabricator.wikimedia.org/T363191 [07:47:37] I have reopened it [07:47:56] (03CR) 10Muehlenhoff: [C:03+2] Default to use acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/1047443 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:50:44] (03CR) 10Muehlenhoff: [C:03+2] No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:54:10] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920337 (10dcaro) Just created a silly dashboard with the data that's coming in: https://grafana-rw.wikimedia.org/d/... [07:54:15] (03PS1) 10David Caro: p:prometheus::cloud: use cloudcephosd1010 instead of 1020 [puppet] - 10https://gerrit.wikimedia.org/r/1049449 (https://phabricator.wikimedia.org/T348643) [07:55:35] (03PS1) 10Muehlenhoff: Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) [07:57:59] (03PS1) 10Muehlenhoff: Remove acmechief annotations for IDM/IDP [puppet] - 10https://gerrit.wikimedia.org/r/1049453 (https://phabricator.wikimedia.org/T365799) [07:58:01] (03PS2) 10Muehlenhoff: Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) [07:58:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049453 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:03:33] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9920355 (10Joe) >>! In T368098#9918924, @xcollazo wrote: > Ok after ob... [08:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:54] (03PS2) 10Ayounsi: Netbox 4: rename device_role to role in validators [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) [08:05:54] (03PS1) 10Ayounsi: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) [08:06:22] (03CR) 10David Caro: [C:03+2] p:prometheus::cloud: use cloudcephosd1010 instead of 1020 [puppet] - 10https://gerrit.wikimedia.org/r/1049449 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [08:06:52] (03CR) 10CI reject: [V:04-1] Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:09:43] (03PS1) 10Slyngshede: IDP-Test: Switch to CAS 7 on idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487) [08:10:33] (03PS1) 10Elukey: prometheus-amd-rocm-stats.py: fix edge case for temperature reading [puppet] - 10https://gerrit.wikimedia.org/r/1049457 [08:11:33] (03PS2) 10Ayounsi: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) [08:13:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:16:34] (03CR) 10Ayounsi: [C:03+2] "Tested locally, merging into dev, post merge reviews welcome before moving to prod." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:16:44] (03CR) 10Ayounsi: [C:03+2] "Tested locally, merging into dev, post merge reviews welcome before moving to prod." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:16:52] (03PS1) 10Muehlenhoff: Remove acmechief annotations for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) [08:18:23] (03Merged) 10jenkins-bot: Netbox 4: rename device_role to role in validators [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:18:58] (03Merged) 10jenkins-bot: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:22:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:25:05] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920418 (10dcaro) Data is coming in now from both nodes, latencies look similar so far, with sdc on 1034 being different and having less spread (no... [08:26:50] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es2022', diff saved to https://phabricator.wikimedia.org/P65406 and previous config saved to /var/cache/conftool/dbconfig/20240625-082649-jynus.json [08:28:13] (03CR) 10Btullis: [C:03+1] "Late to the party, but +1 and thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol) [08:30:55] I see no errors, downtiming and depooling another [08:31:07] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: full dump [08:31:20] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: full dump [08:32:17] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es2025', diff saved to https://phabricator.wikimedia.org/P65407 and previous config saved to /var/cache/conftool/dbconfig/20240625-083216-jynus.json [08:32:33] (03CR) 10Marostegui: [C:03+1] Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:35:13] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [08:38:04] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3063/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [08:38:13] (03PS1) 10Muehlenhoff: Remove acmechief annotations for dns/ncredir/durum/doh [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) [08:39:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:39:14] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:39:29] (03CR) 10Vgutierrez: [C:03+1] "+1 on the ncredir side of things :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:43:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:44:39] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, let me know when this should be merged. We could merge that in today office hours." [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [08:45:23] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: full dump [08:45:37] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: full dump [08:46:36] (03PS1) 10Muehlenhoff: Remove access for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/1049465 [08:47:06] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:49:36] (03PS1) 10Muehlenhoff: Remove acmechief annotations for apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049466 (https://phabricator.wikimedia.org/T365799) [08:50:55] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm. Can you double check the diff for gitlab2002 https://puppet-compiler.wmflabs.org/output/1049459/3064/gitlab2002.wikimedia.org/index." [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:52:30] (03CR) 10Jelto: [V:03+1 C:03+2] package_builder: don't install python-all on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1049180 (https://phabricator.wikimedia.org/T367544) (owner: 10Jelto) [08:53:23] (03PS1) 10Muehlenhoff: Remove acmechief annotations for various DE roles [puppet] - 10https://gerrit.wikimedia.org/r/1049469 (https://phabricator.wikimedia.org/T365799) [08:55:16] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:57] (03CR) 10Muehlenhoff: "Indeed, that is expected. Before we had the split between P5 and P7 acmechief hosts, all clients defaulted to acmechief1001. This after th" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:57:39] (03PS1) 10Jelto: gitlab: add missing custom nginx config also to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1049472 (https://phabricator.wikimedia.org/T366786) [08:58:17] (03CR) 10Jelto: [V:03+1 C:03+1] "thanks for the explanation, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:59:42] (03CR) 10Jelto: [C:03+2] gitlab: add missing custom nginx config also to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1049472 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto) [09:01:03] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049466 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:03:37] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:06:19] (03PS1) 10Elukey: docker::reporter: remove Stretch/Jessie restrictions [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) [09:08:24] (03PS1) 10Muehlenhoff: Remove acmechief annotations for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1049475 (https://phabricator.wikimedia.org/T365799) [09:18:01] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920631 (10ABran-WMF) [09:18:23] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920630 (10Marostegui) >>! In T365995#9883497, @jcrespo wrote: > backup1009 is the main backup node for bacula on eqiad. Most ba... [09:19:17] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371) [09:19:22] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371) [09:19:42] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920643 (10Marostegui) [09:20:25] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9920645 (10ABran-WMF) [09:21:19] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9920659 (10ABran-WMF) [09:22:05] (03PS1) 10MVernon: ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621) [09:22:50] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1049475 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:23:11] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9920670 (10ABran-WMF) [09:24:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [09:24:35] (03CR) 10Muehlenhoff: [C:03+2] Remove access for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/1049465 (owner: 10Muehlenhoff) [09:24:51] (03CR) 10Arnaudb: [C:03+1] ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:26:33] (03PS1) 10Muehlenhoff: Remove acmechief annotations for MX hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) [09:28:13] (03PS1) 10Ayounsi: Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) [09:28:42] (03PS2) 10Ayounsi: Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) [09:29:09] (03PS1) 10Cathal Mooney: Adjust labs-in policy after clouddb is replaced with an-redacteddb [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316) [09:29:35] (03PS1) 10Muehlenhoff: Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) [09:30:26] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:30:36] RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2197) taken on 2024-06-25 08:42:03 (463 GiB, -3.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:31:06] (03PS4) 10Clément Goubert: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) [09:31:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:31:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:31:53] (03CR) 10Slyngshede: [C:03+2] IDP-Test: Switch to CAS 7 on idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:32:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1228 T368374', diff saved to https://phabricator.wikimedia.org/P65408 and previous config saved to /var/cache/conftool/dbconfig/20240625-093221-root.json [09:32:23] (03PS1) 10Marostegui: instances.yaml: Remove db1228 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1049485 (https://phabricator.wikimedia.org/T368374) [09:32:28] T368374: Move one host temporarily to m2 - https://phabricator.wikimedia.org/T368374 [09:32:40] (03CR) 10Ayounsi: [V:03+1] Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:33:04] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1228 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1049485 (https://phabricator.wikimedia.org/T368374) (owner: 10Marostegui) [09:34:12] !log Switching idp-test.wikimedia.org to CAS 7 [09:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:19] (03CR) 10Muehlenhoff: "Needs manager approval, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [09:34:21] (03PS1) 10Hnowlan: shellbox-video: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049487 [09:34:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:34:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1228 from dbctl T368374', diff saved to https://phabricator.wikimedia.org/P65409 and previous config saved to /var/cache/conftool/dbconfig/20240625-093454-marostegui.json [09:35:03] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:35:46] (03CR) 10Ayounsi: [V:03+1 C:03+2] Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:35:47] (03CR) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:36:14] (03PS1) 10Marostegui: db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049488 [09:36:59] (03CR) 10Marostegui: [C:03+2] db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049488 (owner: 10Marostegui) [09:38:48] (03PS4) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) [09:38:49] (03PS3) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) [09:42:16] (03CR) 10Majavah: [C:03+1] Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:42:49] (03PS1) 10Slyngshede: Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) [09:43:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:05] (03PS1) 10Majavah: P:puppet: Remove Puppet 7 MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1049494 [09:44:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db[1217,1228].eqiad.wmnet with reason: Cloning [09:44:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1217,1228].eqiad.wmnet with reason: Cloning [09:47:18] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:47:28] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:47:39] ^ known and expected [09:49:40] (03PS1) 10Marostegui: mariadb: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1049497 (https://phabricator.wikimedia.org/T368374) [09:49:45] (03CR) 10Aklapper: [V:03+2 C:03+2] Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 (owner: 10Pppery) [09:49:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on 25 hosts with reason: Turning down appserver clusters [09:49:53] (03CR) 10Muehlenhoff: [C:03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1049494 (owner: 10Majavah) [09:50:04] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:50:07] (03CR) 10Majavah: [C:03+2] P:puppet: Remove Puppet 7 MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1049494 (owner: 10Majavah) [09:50:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on 25 hosts with reason: Turning down appserver clusters [09:50:30] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920844 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ca43ab0-579a-4f82-97aa-11720f300bd7) set by cgoubert@cumin1002 for 21 days, 0:00... [09:50:51] (03PS1) 10Fabfur: benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718) [09:53:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on 11 hosts with reason: Turning down appserver clusters [09:53:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on 11 hosts with reason: Turning down appserver clusters [09:54:13] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920870 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=046a1781-9fad-454c-b26b-ad2c96d2d8b2) set by cgoubert@cumin1002 for 21 days, 0:00... [09:55:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9920871 (10cmooney) >>! In T326322#9650260, @cmooney wrote: >>>! In T326322#9130092, @ayounsi wrote: >> @cmooney I came across https://w... [09:55:40] (03PS1) 10Ayounsi: Netbox puppet import: ignore ipip interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 [09:56:59] (03CR) 10Ayounsi: "So far not useful to have them in Netbox as they don't hold any info (IP or other). Please let me know if you think we should import them." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi) [09:58:17] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1049497 (https://phabricator.wikimedia.org/T368374) (owner: 10Marostegui) [09:58:48] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920891 (10dcaro) >>! In T348643#9920418, @dcaro wrote: > Data is coming in now from both nodes, latencies look simi... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1000) [10:01:39] (03PS1) 10Superpes15: Removing 'spamblacklistlog' rights to usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) [10:02:21] (03PS2) 10Superpes15: Removing 'spamblacklistlog' right to usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) [10:02:26] (03PS3) 10Superpes15: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) [10:04:19] (03PS1) 10Muehlenhoff: Remove acmechief annotations for caches [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799) [10:05:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:05:53] (03CR) 10Muehlenhoff: [C:03+2] irc.wikimedia.org: Stop sending broadcast events to the old buster nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049137 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:07:04] (03PS1) 10Muehlenhoff: Remove acmechief annotations for icinga [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) [10:08:04] (03PS2) 10Fabfur: benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718) [10:08:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:09:13] (03PS7) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [10:09:17] (03PS7) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [10:09:21] (03PS7) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [10:09:25] (03PS12) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [10:09:50] (03CR) 10CI reject: [V:04-1] cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:09:57] (03CR) 10CI reject: [V:04-1] cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:10:08] (03CR) 10CI reject: [V:04-1] cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:10:19] (03CR) 10CI reject: [V:04-1] Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:11:11] !log jmm@deploy1002 Started scap: (no justification provided) [10:12:43] !log jmm@deploy1002 Finished scap: (no justification provided) (duration: 03m 30s) [10:13:22] (03PS1) 10Muehlenhoff: Switch old irc hosts to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049503 (https://phabricator.wikimedia.org/T331702) [10:15:08] (03PS1) 10Muehlenhoff: Remove acmechief annotations for netmon [puppet] - 10https://gerrit.wikimedia.org/r/1049504 (https://phabricator.wikimedia.org/T365799) [10:15:52] (03PS1) 10Muehlenhoff: Remove acmechief annotations for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) [10:17:32] FIRING: [2x] UdpMxIrcEchoThroughput: irc1001:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [10:18:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9920991 (10Dreamy_Jazz) @Dzahn I am requesting membership with access to Kerberos. [10:19:31] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049506 [10:20:03] (03PS4) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) [10:20:08] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:21:47] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:23:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:23:39] (03PS8) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [10:23:39] (03PS8) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [10:23:40] (03PS8) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [10:23:40] (03PS13) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [10:24:14] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:35] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for netmon [puppet] - 10https://gerrit.wikimedia.org/r/1049504 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:26:53] (03PS1) 10Slyngshede: Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) [10:27:02] (03CR) 10Slyngshede: [C:03+2] Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:27:04] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:29:15] (03CR) 10Muehlenhoff: [C:03+2] Switch old irc hosts to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049503 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:29:21] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921045 (10fnegri) This was linked in the parent task but I'm not sure if it's really a blocker here: T103011 [10:32:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:32:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:35:13] PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [10:35:15] (03CR) 10JMeybohm: [C:04-1] "I think this won't work as is. docker-report does install the debmonitor client via apt-get and that will most likely fail for old distros" [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [10:36:23] PROBLEM - ircecho bot process on irc2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [10:37:29] (03CR) 10Vgutierrez: [C:03+1] Remove acmechief annotations for caches [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:37:32] RESOLVED: [2x] UdpMxIrcEchoThroughput: irc1001:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [10:37:41] (03PS1) 10Muehlenhoff: Remove acmechief annotations for ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049511 (https://phabricator.wikimedia.org/T365799) [10:38:27] (03CR) 10Klausman: [C:03+1] prometheus-amd-rocm-stats.py: fix edge case for temperature reading [puppet] - 10https://gerrit.wikimedia.org/r/1049457 (owner: 10Elukey) [10:38:52] (03PS1) 10Muehlenhoff: Remove acmechief annotations for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1049512 (https://phabricator.wikimedia.org/T365799) [10:39:14] FIRING: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:14] FIRING: JobUnavailable: Reduced availability for job udpmxircecho in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:57] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921082 (10Marostegui) >>! In T368136#9919314, @bd808 wrote: > What sort of data y'all are concerned about exposing to new roots on the replica db ho... [10:40:06] !log m2 dbmaint eqiad Stop db1217:3322 to clone db1228 T368374 [10:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] T368374: Move one host temporarily to m2 - https://phabricator.wikimedia.org/T368374 [10:40:32] FIRING: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:41:39] (03PS1) 10Muehlenhoff: Switch the mw_rc_irc role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619) [10:41:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049512 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:43:00] (03CR) 10Muehlenhoff: [C:03+1] "The idea is to obtain the list of legacy images and then remove them in the docker registry instead." [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [10:43:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:44:41] (03CR) 10Majavah: Adjust labs-in policy after clouddb is replaced with an-redacteddb (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316) (owner: 10Cathal Mooney) [10:44:42] (03PS1) 10Muehlenhoff: Remove acmechief annotations for archiva [puppet] - 10https://gerrit.wikimedia.org/r/1049515 (https://phabricator.wikimedia.org/T365799) [10:45:00] (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Clarify error page titles [puppet] - 10https://gerrit.wikimedia.org/r/1049145 (owner: 10Majavah) [10:45:33] (03PS1) 10Muehlenhoff: Remove acmechief annotations for lists [puppet] - 10https://gerrit.wikimedia.org/r/1049516 (https://phabricator.wikimedia.org/T365799) [10:46:29] (03CR) 10Muehlenhoff: [C:03+2] Switch the mw_rc_irc role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:48:41] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9921167 (10MoritzMuehlenhoff) [10:49:14] FIRING: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:49:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049516 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:50:32] RESOLVED: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921174 (10jcrespo) > Is there a procedure for that so we know how to do so? Sadly, there is not. The code changes for implemen... [10:50:51] (03CR) 10Slyngshede: [C:03+2] Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:51:23] (03CR) 10Slyngshede: [V:03+2 C:03+2] Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:51:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049515 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:54:41] (03PS1) 10Muehlenhoff: Remove acmechief annotations for [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) [10:55:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:32] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:13] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921190 (10Marostegui) I will try - but just in case @ABran-WMF please take some notes! [10:56:36] (03PS2) 10Muehlenhoff: Remove acmechief annotations for wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) [10:57:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:59:23] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921203 (10fnegri) > I assume wmcs-roots is just WMCS staff and those would be the ones having root access? wmcs-roots is defined in [admin/data/da... [11:03:59] (03CR) 10JMeybohm: [C:04-1] "Yeah, that's understood. But removing the rules will still make docker-report fail for all of those images and they would not be reported " [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [11:36:48] (03PS1) 10Majavah: P:netbox: Don't show status MOTD for active hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049525 [11:40:10] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert) [11:40:36] (03PS8) 10Clément Goubert: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) [11:42:41] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert) [11:42:45] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9921290 (10Ladsgroup) The trigger seems to be a duress imposed on s4:... [11:44:17] (03Merged) 10jenkins-bot: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert) [11:45:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:45:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:46:06] (03CR) 10Vgutierrez: [C:03+1] trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:46:42] (03CR) 10Vgutierrez: [C:03+1] Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1049446 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff) [11:49:06] (03PS3) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) [11:49:24] (03PS4) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) [11:51:04] (03CR) 10Jforrester: "As merging this will make then next scap run deploy it immediately, we shouldn't do this until we're sure (unless there's a nicer way to r" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [11:51:10] (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) [11:51:52] (03CR) 10Jforrester: Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [11:52:02] (03PS2) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) [11:52:09] (03PS1) 10Jgiannelos: pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 [11:52:49] (03PS3) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) [11:52:50] (03CR) 10CI reject: [V:04-1] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [11:53:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [11:53:51] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:54:14] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:59] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:56:14] !log disable puppet on A:cp-esams before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049529 - T364383 [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:20] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [11:57:59] (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [11:58:29] !log rolling upgrade of fifo-log-demux on A:cp-esams - T364383 [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1200) [12:00:36] RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2197) taken on 2024-06-25 10:54:18 (817 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:02:37] (03PS1) 10Clément Goubert: mw-on-k8s: Rate limit udp2log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049532 (https://phabricator.wikimedia.org/T365655) [12:03:47] (03CR) 10Ayounsi: [C:03+1] "Thanks !! Could be worth linking it to https://phabricator.wikimedia.org/T352957 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049525 (owner: 10Majavah) [12:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:56] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:05:40] (03PS2) 10Jgiannelos: pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 [12:05:56] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:06:24] (03CR) 10CI reject: [V:04-1] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [12:08:02] (03CR) 10Ayounsi: Adjust labs-in policy after clouddb is replaced with an-redacteddb (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316) (owner: 10Cathal Mooney) [12:09:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:09:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:09:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65411 and previous config saved to /var/cache/conftool/dbconfig/20240625-120926-marostegui.json [12:42:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:42:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:44:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:44:58] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:45:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:28] !log cgoubert@deploy1002 Started scap: Deploy udp2log rate-limiting - T365655 - T368098 [12:46:35] T365655: mw-api-ext unavailability 2024-05-22 18:30 UTC - https://phabricator.wikimedia.org/T365655 [12:46:35] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [12:51:40] !log cgoubert@deploy1002 Finished scap: Deploy udp2log rate-limiting - T365655 - T368098 (duration: 05m 49s) [12:51:46] T365655: mw-api-ext unavailability 2024-05-22 18:30 UTC - https://phabricator.wikimedia.org/T365655 [12:51:47] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [12:51:52] jouncebot: nowandnext [12:51:53] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1200) [12:51:53] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300) [12:52:03] let's start CI [12:54:02] * MichaelG_WMF is here to observe [12:54:26] probably the change-tag change can only be tested in a sensible way on testwiki? [12:54:40] yep [12:54:47] or in prod [12:54:55] eg. on cswiki, where i'm an admin as a volunteer [12:56:21] true, but that would mean we would have to change live prod config. not sure if that is possible in a way that is not mild vandalism [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300) [13:00:05] Func and urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:13] o/ [13:00:41] i can deploy today [13:00:44] I take it urbanecm is deploying ^^ [13:00:47] ah, jinx [13:02:48] Func: i'm afraid that change is not safe to backport, as extension.json change takes effect immediately, but PHP waits for the rolling restart. [13:02:57] how critical is backporting that change? [13:03:48] not critical, since it has been broken for 3 weeks... [13:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:04:32] urbanecm: IIUC that shouldn’t be relevant anymore now that 100% of traffic goes to mw-on-k8s [13:04:39] was about to say [13:04:40] Lucas_WMDE: oh, we're on 100%? [13:04:44] they'll be deployed at the same time [13:04:48] we are \o/ [13:04:50] (except on videoscalers) [13:04:51] i missed that announcement, i thought it's still 80:20 or something [13:04:52] for about a week now IIRC [13:05:00] then what i said doesn't matter :) [13:05:12] https://phabricator.wikimedia.org/T362323#9903574 [13:05:12] urbanecm: we didn't do a big announcement yet, still got a little cleanup to do [13:05:18] heh, almost exactly a week indeed [13:05:23] ah, that's why i missed it! [13:05:35] Func: i'll backport your [13:05:38] *your patch [13:05:42] (it makes me happy that nobody noticed) [13:05:43] thanks [13:07:26] !log temporary disabled puppet on cp4037 to test benthos configuration (T367756) [13:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:31] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [13:18:54] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] [13:19:02] T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989 [13:19:02] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [13:22:50] scap, scap faster, please! [13:26:48] !log rolling restart of pybal on lvs1020 and lvs1018 - T367861 [13:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:53] T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation - https://phabricator.wikimedia.org/T367861 [13:27:02] scap should be somewhat faster when it doesn’t do the php-fpm-restart on the bare-metal servers anymore [13:27:11] (but I don’t know if that’s happened yet) [13:27:34] at least in my subjective experience the k8s restarts have stayed faster than the bare-metal restarts even as the k8s cluster grew and bare-metal shrunk [13:29:58] !log IPIP encapsulation enabled on ldap-ro.eqiad.wikimedia.org - T367861 [13:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:26] Lucas_WMDE: it's still building the images [13:30:32] it's not even at the mwdebug stage [13:30:36] RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2024-06-25 12:45:43 (506 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:30:47] and we're half through the window already [13:31:18] I guess it’s taking longer due to the backports including i18n changes 😔 [13:31:40] possible [13:31:43] hard to say [13:32:46] docker pull! progress. [13:36:32] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [13:36:46] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [13:37:55] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: cluster=apus [13:39:28] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:42:02] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt forkrb1002 - jclark@cumin1002" [13:43:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt forkrb1002 - jclark@cumin1002" [13:43:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:26] !log disable puppet on A:lvs and A:codfw for CR 1049560 [13:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:07] jouncebot: nowandnext [13:48:07] For the next 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300) [13:48:07] In 1 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500) [13:48:41] urbanecm: hm I wonder if it is i18n taking so long [13:49:09] I’m pretty sure it is tbh, I thought that’s been a known thing for a long time now [13:49:18] that as soon as i18n is touched it has to rebuild the whole cache [13:49:21] or something like that [13:51:03] I think there was a related issue a year or two ago, maybe I can find it [13:51:08] !log restart pybal on lvs1020 [13:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:25] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [13:59:15] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [13:59:25] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [14:00:04] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [14:00:15] T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989 [14:00:15] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [14:00:22] finally [14:01:00] sheesh [14:01:19] let's test [14:01:24] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [14:01:45] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [14:02:34] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [14:02:38] Func: your patch also made it to mwdebug. can you test it there please? [14:02:50] ok [14:02:55] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [14:03:20] (haven’t managed to find the task I remembered so far, I’m afraid) [14:04:35] urbanecm: Even with wmdebug I'm not seeing the tag on https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:Zna%C4%8Dky - shouldn't it be there? [14:04:42] urbanecm: looks good [14:05:00] !log restart pybal on lvs2014 [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:23] urbanecm: oh wait, you only backported the change to wmf.11, so I need to find a group 0 wiki [14:06:23] MichaelG_WMF: testwiki should work [14:06:29] i have the wmf.10 lined up [14:06:42] let me run the migration on ptwiki and mwdebug [14:06:51] urbanecm: But I'm not seeing it on https://test.wikipedia.org/wiki/Special:Tags either... [14:07:43] hmm... [14:07:49] but it does work at https://test.wikipedia.org/w/index.php?title=MediaWiki:GrowthExperimentsMentorship.json&diff=prev&oldid=601004 [14:07:52] i just made an edit [14:08:05] but still not on special:tags [14:08:14] i'm willing to bet on a cache, given ti works on edit [14:08:17] MichaelG_WMF: thoughts? [14:09:06] oh, not [14:09:13] that's what i only did for wmf.10 [14:09:18] !log urbanecm@deploy1002 urbanecm: Continuing with sync [14:09:33] proceeding then [14:09:48] urbanecm: cache sounds most plausible, I agree. [14:09:56] (03CR) 10Urbanecm: [C:03+2] WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm) [14:10:48] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [14:11:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [14:11:21] urbanecm: I think we can move forward, adding the tag on edit is the important part. And that would not work if the hooks would not work [14:11:59] yup [14:12:30] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[1011-1021].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [14:12:35] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [14:15:34] !log sudo cumin "A:dnsbox" 'disable-puppet "rolling out CR 1049165"' [14:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:23] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] (duration: 58m 28s) [14:17:29] T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989 [14:17:30] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [14:17:58] (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: generate $time_acl from network::constants [puppet] - 10https://gerrit.wikimedia.org/r/1049165 (owner: 10Ssingh) [14:22:25] (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383) [14:23:00] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:23:44] urbanecm: are you done with the window? [14:23:46] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:24:29] !log re-indexing all wikidata entity schemas (T368010) [14:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:34] T368010: Search not working for entity schemas - https://phabricator.wikimedia.org/T368010 [14:24:48] (03PS1) 10Effie Mouzeli: app.job: update module (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) [14:25:28] (03PS3) 10Clément Goubert: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) [14:25:28] (03PS3) 10Clément Goubert: mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) [14:25:28] (03PS3) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) [14:26:38] (03CR) 10Ladsgroup: [C:03+1] mariadb: disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049555 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb) [14:27:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm) [14:27:15] cdanis: not yet [14:27:55] ok np [14:29:12] one last patch [14:30:52] !log disable puppet on A:cp-eqiad before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049570 - T364383 [14:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:01] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [14:31:53] (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:32:59] (03Merged) 10jenkins-bot: WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm) [14:33:05] urbanecm: wait, it seems the version of Popups on Special:Version is still not the new one? (I don't have a user created before 2017, and after the patch there should be no logical difference by user creation date, so I only tested with my own account.) [14:33:06] finally [14:33:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] [14:33:37] !log rolling upgrade of fifo-log-demux on A:cp-eqiad - T364383 [14:33:39] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [14:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:35] (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:34:38] (03CR) 10Scott French: [C:03+2] mw-on-k8s: extend envoy_cluster_name to new format [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:35:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2005 to codfw - jhancock@cumin2002" [14:35:18] !log sudo cumin -b1 -s900 "A:dnsbox" "run-puppet-agent --enable 'rolling out CR 1049165' && systemctl restart ntp.service" [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:23] Func: i wouldn't bet on the version ID tbh. i'm not sure how reliable it is wrt backports. [14:36:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2005 to codfw - jhancock@cumin2002" [14:36:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:13] (03Merged) 10jenkins-bot: mw-on-k8s: extend envoy_cluster_name to new format [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:37:02] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9921990 (10Ladsgroup) >>! In T368098#9921528, @jcrespo wrote: > Question, what went wrong wi... [14:37:50] (03CR) 10Krinkle: [C:03+1] wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite) [14:38:09] (03PS1) 10Muehlenhoff: Add krb1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1049572 (https://phabricator.wikimedia.org/T365165) [14:38:43] (03CR) 10Muehlenhoff: [C:03+2] Add krb1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1049572 (https://phabricator.wikimedia.org/T365165) (owner: 10Muehlenhoff) [14:39:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9921994 (10Jhancock.wm) [14:39:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9921995 (10MoritzMuehlenhoff) >>! In T365165#9921708, @Jclark-ctr wrote: > @MoritzMuehlenhoff would you be able to update site.pp file for this server... [14:39:40] urbanecm: Ack. Actually, it's a surprise to me that backports can be done without bots leaving any messages on the task or the change. [14:39:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9921996 (10MoritzMuehlenhoff) [14:40:04] (03CR) 10Muehlenhoff: [C:03+2] Revert "Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049568 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff) [14:40:10] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:40:16] !log urbanecm@deploy1002 urbanecm: Continuing with sync [14:40:17] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [14:40:34] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9922000 (10Scott_French) Thanks, @SGupta-WMF! Ahmon tends to be quite responsi... [14:43:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9922006 (10Ottomata) Approved! [14:45:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] (duration: 11m 45s) [14:45:26] T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275 [14:46:53] cdanis: i'm done [14:46:57] thanks! [14:47:06] (03CR) 10CDanis: [C:03+2] Sampled tracing (0.1%) for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049202 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis) [14:47:47] (03PS1) 10Effie Mouzeli: modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [14:48:10] (03Merged) 10jenkins-bot: Sampled tracing (0.1%) for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049202 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis) [14:48:45] (03CR) 10CI reject: [V:04-1] modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [14:49:06] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:50:29] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:54:13] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9922021 (10bd808) >>! In T368136#9921082, @Marostegui wrote: > Also, the issue with root is that that user can make changes to replication, grants, s... [14:55:19] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-e5-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e5-eqiad [14:55:31] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:55:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-e5-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e5-eqiad [14:55:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922024 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7a21c2a6-e267-4150-8111-b348788c4a9b)... [14:55:55] (03PS1) 10Vgutierrez: prometheus::ops: Pull fifo_log_demux metrics [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) [14:55:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365986 - depool es1035', diff saved to https://phabricator.wikimedia.org/P65413 and previous config saved to /var/cache/conftool/dbconfig/20240625-145558-arnaudb.json [14:56:04] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [14:56:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on es1035.eqiad.wmnet with reason: T365986 [14:56:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on es1035.eqiad.wmnet with reason: T365986 [14:56:56] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:57:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-e5-eqiad,lsw1-e5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e5-eqiad [14:57:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-e5-eqiad,lsw1-e5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e5-eqiad [14:58:00] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: JunOS upgrade lsw1-e5-eqiad [14:58:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: JunOS upgrade lsw1-e5-eqiad [14:58:19] (03PS2) 10Effie Mouzeli: modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [14:58:33] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922051 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01b84d43-d6d0-4f45-bc2e-375ff79e21f8)... [14:58:53] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9922052 (10fnegri) > That is true, but also not clearly in the scope of this ticket which seems to be specifically about addressing claims of data pr... [14:59:01] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922053 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=65c438b1-9725-4de3-9a45-8318edea15f1)... [14:59:31] (03CR) 10CI reject: [V:04-1] modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500). [15:00:09] !log rebooting lsw1-e5-eqiad to upgrade JunOS on switch T365986 [15:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:14] (03PS5) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) [15:01:44] 06SRE, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922064 (10Jdforrester-WMF) [15:01:46] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3065/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:02:18] 06SRE, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922079 (10Jdforrester-WMF) [15:02:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:03] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922086 (10Jdforrester-WMF) [15:03:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:23] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:04:05] (03PS1) 10Elukey: config.yaml: remove wikimedia-stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049576 (https://phabricator.wikimedia.org/T367427) [15:04:07] (03PS1) 10Elukey: coredns: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049577 (https://phabricator.wikimedia.org/T368366) [15:04:08] (03PS1) 10Elukey: envoy: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049578 (https://phabricator.wikimedia.org/T368366) [15:04:28] !log brennen@deploy1002 Started deploy [phabricator/deployment@f58dd50]: deploy phab2002 for T368392 [15:04:33] T368392: Deploy Phabricator/Phorge 2024-06-25 - https://phabricator.wikimedia.org/T368392 [15:05:01] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f58dd50]: deploy phab2002 for T368392 (duration: 00m 33s) [15:05:07] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922082 (10Jdforrester-WMF) [15:05:22] !log brennen@deploy1002 Started deploy [phabricator/deployment@f58dd50]: deploy phab1004 for T368392 [15:06:12] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f58dd50]: deploy phab1004 for T368392 (duration: 00m 50s) [15:08:25] (03CR) 10Dzahn: [C:03+2] gitlab: remove last reference to ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1049253 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:12:59] (03CR) 10Cwhite: [C:03+2] wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite) [15:17:25] FIRING: SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:31] (03CR) 10Clément Goubert: [C:03+2] mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [15:18:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65414 and previous config saved to /var/cache/conftool/dbconfig/20240625-151802-arnaudb.json [15:18:09] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [15:18:28] (03Merged) 10jenkins-bot: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [15:18:38] (03PS1) 10Elukey: helm-state-metrics: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049586 (https://phabricator.wikimedia.org/T368366) [15:18:39] (03PS1) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [15:18:42] (03PS1) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [15:18:48] (03CR) 10Vgutierrez: acme-chief: Add new certificates and domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047147 (owner: 10BCornwall) [15:19:24] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:20:00] !log Deploying statsd to mw-api-ext - T365265 [15:20:03] cc herron ^ [15:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:07] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [15:20:14] claime: kk [15:20:54] herron: if you're ok, I can do all remaining deployments today, or we can stagger them on other days [15:21:17] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:21:24] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:22:25] FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:23:47] claime: ok yeah I'd be inclined to stagger them, by days or even a few hours? in case we do run into an issue [15:24:22] herron: sure. we have two remaining major deployments, mw-api-int and mw-web [15:27:54] (03PS4) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) [15:28:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) (owner: 10Jdlrobson) [15:28:20] (03CR) 10Jdlrobson: "Deploy scheduled for Wednesday 1pm PST" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) (owner: 10Jdlrobson) [15:29:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [15:29:20] (03PS4) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) [15:31:25] !log Ran `mwscript extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --wiki=testwiki` for T366781 [15:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:30] T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781 [15:32:25] RESOLVED: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65415 and previous config saved to /var/cache/conftool/dbconfig/20240625-153307-arnaudb.json [15:33:13] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [15:33:17] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[1011-1021].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [15:33:21] (03PS5) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) [15:33:23] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [15:33:36] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [15:33:38] (03Abandoned) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [15:39:44] (03PS1) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [15:39:47] (03PS1) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [15:45:03] (03CR) 10Majavah: [C:03+2] Remove acmechief annotations for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1049559 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [15:48:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 25%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65416 and previous config saved to /var/cache/conftool/dbconfig/20240625-154813-arnaudb.json [15:50:12] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [15:50:25] jouncebot: nowandnext [15:50:25] For the next 0 hour(s) and 9 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500) [15:50:25] In 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1600) [15:50:41] !ops disabling puppet/stopping pybal on lvs2011 for memory failure maintenance - T368165 [15:51:13] (03CR) 10JMeybohm: [C:03+1] mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [15:51:36] brett: !log? [15:51:41] ..... [15:51:45] !log disabling puppet/stopping pybal on lvs2011 for memory failure maintenance - T368165 [15:51:48] (03CR) 10JMeybohm: [C:03+1] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [15:52:02] how embarrassing [15:52:08] E_COFFEE? :) [15:52:13] indeed... [15:52:16] uh [15:52:21] is logmsgbot broken anyway haha [15:52:26] lol [15:52:32] :S [15:52:41] you mean stashbot? [15:52:41] Did.... !ops break it? [15:52:55] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799) [15:52:58] PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:53:12] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:53:19] ^expected [15:53:21] brett: no [15:53:33] It and ircservserv-wm quit a bit ago [15:53:42] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:53:43] It didn't auto restart I guess [15:53:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs2011.codfw.wmnet with reason: T368165 [15:54:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2011.codfw.wmnet with reason: T368165 [15:54:38] taavi: you can restart stashbot right? [16:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:02:38] jhathaway, rzl, brett: we need a quick phab deploy for a followup to un-break wikibugs. ok if we use this window? [16:02:50] please do [16:02:57] thx [16:03:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 50%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65417 and previous config saved to /var/cache/conftool/dbconfig/20240625-160318-arnaudb.json [16:03:25] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [16:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:03] !log silencing phabricator hosts prior to deploy [16:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:28] !log brennen@deploy1002 Started deploy [phabricator/deployment@72ad841]: deploy phab2002 for T368392 - followup T364728 [16:08:34] T368392: Deploy Phabricator/Phorge 2024-06-25 - https://phabricator.wikimedia.org/T368392 [16:08:35] T364728: Revert or upstream rPHABf2fd14dc1edeb41aa2874336548cfaa7fa0e87a0 (maniphest.gettasktransactions API) - https://phabricator.wikimedia.org/T364728 [16:09:01] !log brennen@deploy1002 Finished deploy [phabricator/deployment@72ad841]: deploy phab2002 for T368392 - followup T364728 (duration: 00m 33s) [16:10:16] !log brennen@deploy1002 Started deploy [phabricator/deployment@72ad841]: deploy phab1004 for T368392 - followup T364728 [16:10:55] !log brennen@deploy1002 Finished deploy [phabricator/deployment@72ad841]: deploy phab1004 for T368392 - followup T364728 (duration: 00m 39s) [16:11:22] (03PS1) 10Pppery: Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) [16:13:18] (03PS1) 10Eevans: ml-cache: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970) [16:13:49] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [16:15:02] !log Extending vg-srv on mw1437 [16:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:22] (03CR) 10Eevans: [C:03+2] ml-cache: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [16:17:50] PROBLEM - Disk space on mw1437 is CRITICAL: DISK CRITICAL - free space: / 6395 MB (1% inode=99%): /tmp 6395 MB (1% inode=99%): /var/tmp 6395 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1437&var-datasource=eqiad+prometheus/ops [16:18:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65418 and previous config saved to /var/cache/conftool/dbconfig/20240625-161824-arnaudb.json [16:18:32] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [16:19:29] (03PS1) 10Pppery: Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) [16:19:43] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [16:19:48] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [16:19:55] !log cleaning up shellbox leftover files on mw1437.eqiad.wmnet [16:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:22] (03CR) 10Pppery: Export source strings again so en.json is indented with tabs (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery) [16:23:51] !log depooling mw1437 [16:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:55] !log running requeueTranscodes for missing audio files on commons (mwmaint1002) cf T368364 [16:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:01] T368364: Transcodes of audio-only samples are not running for new uploads - https://phabricator.wikimedia.org/T368364 [16:26:11] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922688 (10RobH) [16:26:50] (03PS2) 10Cathal Mooney: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) [16:27:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1437.eqiad.wmnet with reason: Resizing disk [16:27:10] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922684 (10RobH) a:05RobH→03None [16:27:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1437.eqiad.wmnet with reason: Resizing disk [16:27:31] (03CR) 10Cathal Mooney: Validate IRB interface names correspond to vlan and refactor (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [16:27:52] (03CR) 10CI reject: [V:04-1] Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [16:29:06] (03CR) 10Ahmon Dancy: "Sorry I missed office hours today. Feel free to deploy whenever you see fit." [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [16:29:44] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:00] RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:12] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:30:36] RECOVERY - Disk space on mw1437 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1437&var-datasource=eqiad+prometheus/ops [16:30:41] (03CR) 10Cathal Mooney: [C:03+1] "Agreed on my side, can't think of any reason they would be useful." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi) [16:31:35] (03PS1) 10CDanis: haproxy: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428) [16:31:37] (03PS1) 10CDanis: ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) [16:31:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1437.eqiad.wmnet [16:31:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1437.eqiad.wmnet [16:32:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922720 (10Jhancock.wm) swapped DIMM_B1 for DIMM_B2 to test. error has cleared. [16:33:13] (03CR) 10Cathal Mooney: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [16:33:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65419 and previous config saved to /var/cache/conftool/dbconfig/20240625-163330-arnaudb.json [16:33:36] T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986 [16:34:13] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis) [16:35:09] (03CR) 10CDanis: [C:03+2] haproxy: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis) [16:36:38] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922750 (10cmooney) 05Open→03Resolved [16:37:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [16:37:29] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [16:39:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9922776 (10Jhancock.wm) Thursday is great, thanks. [16:39:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65420 and previous config saved to /var/cache/conftool/dbconfig/20240625-163919-marostegui.json [16:39:25] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:42:00] PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:24] ^expected [16:43:33] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [16:46:58] (03CR) 10Isabelle Hurbain-Palatin: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [16:49:07] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [16:49:13] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [16:49:50] (03CR) 10Aklapper: [C:03+2] Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery) [16:50:03] (03CR) 10Aklapper: [V:03+2 C:03+2] "Applies cleanly on latest wmf/stable branch locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery) [16:54:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P65421 and previous config saved to /var/cache/conftool/dbconfig/20240625-165426-marostegui.json [16:57:18] (03CR) 10CDanis: "not in a rush about this one, please advise about rollout though (ATS restarts required?)" [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis) [16:58:56] (03CR) 10Jgiannelos: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [16:59:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922861 (10BCornwall) 05Open→03Resolved Linux is happy, too. Thank you, @Jhancock.wm! [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1700) [17:01:09] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922864 (10BCornwall) a:03BCornwall [17:01:22] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [17:01:28] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [17:02:06] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [17:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:04:28] (03CR) 10Ssingh: "Reload should be fine here and is done by Puppet automatically." [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis) [17:04:32] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [17:04:53] sukhe: ah thanks, I had remembered it being manual restart required for some reason [17:06:01] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1049592/3067/" [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [17:06:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [17:06:55] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [17:07:11] cdanis: happy to take care of rolling this out if desired (as we do for other such requests) [17:07:36] sukhe: I mean if it is just Puppet auto-reloads I have no problem +2'd and p-merging :) [17:07:56] cdanis: I am pretty sure but if it is not, I take the fall. go ahead :) [17:08:04] you can try on one host I guess [17:08:23] cdanis: I am on-call now, so an official excuse [17:08:46] (03CR) 10CDanis: [C:03+2] ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis) [17:09:04] (03CR) 10Dzahn: [C:03+2] "no changes" [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [17:09:19] (03CR) 10Jgiannelos: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [17:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P65422 and previous config saved to /var/cache/conftool/dbconfig/20240625-170933-marostegui.json [17:11:58] (03PS5) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [17:11:59] (03PS1) 10Scott French: mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) [17:12:29] sukhe: lol we should have waited for vg, I think I edited an obsolete file [17:12:33] ah well [17:12:53] didn't you get his +1? [17:12:58] I thought I saw that [17:13:00] no I thought you gave a +1 [17:13:06] oh I didn't lol [17:13:11] yeah ok lol [17:13:16] well there was no change on a cp-text host [17:13:21] (03PS1) 10Bvibber: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) [17:13:22] I'll look into it [17:13:22] sorry for the confusion, i was remarking that the reload is not required [17:13:25] yeah npo [17:13:28] my mistake [17:13:28] and I saw another +1 so I confused it with that [17:13:33] didn't sleep great last night lol [17:13:35] there was a +1 right!? [17:13:39] or am I dreaming now [17:13:55] there wasn't [17:14:03] maybe you saw the V+2 [17:14:24] ah +1 on https://gerrit.wikimedia.org/r/1049603 [17:14:43] yeah, that one was the more important one, and it worked ;) [17:14:59] that counts [17:15:30] (03CR) 10Giuseppe Lavagetto: [C:03+1] Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber) [17:18:09] (03PS1) 10CDanis: ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049609 (https://phabricator.wikimedia.org/T368428) [17:18:43] ok now I'm editing the right file lol [17:24:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65423 and previous config saved to /var/cache/conftool/dbconfig/20240625-172440-marostegui.json [17:24:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [17:24:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:24:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [17:25:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65424 and previous config saved to /var/cache/conftool/dbconfig/20240625-172502-marostegui.json [17:25:41] (03PS1) 10Eevans: sessionstore2004: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970) [17:27:30] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9922998 (10Dzahn) [17:27:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [17:27:50] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [17:28:11] !log Pooling lvs2011 - T368165 [17:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:17] T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165 [17:28:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [17:28:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049274 (https://phabricator.wikimedia.org/T368327) (owner: 10Cwhite) [17:29:05] RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:29:12] nice [17:36:01] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2004-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049614 (https://phabricator.wikimedia.org/T364457) [17:36:03] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049615 (https://phabricator.wikimedia.org/T364457) [17:36:05] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049616 (https://phabricator.wikimedia.org/T364457) [17:36:52] !log Depooling lvs2011 due to elevated socket/tcp errors - T368165 [17:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:57] T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165 [17:37:03] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2004-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049614 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [17:37:04] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [17:38:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber) [17:39:05] PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:15] ^expected [17:41:11] (03CR) 10Eevans: [C:03+2] sessionstore2004: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [17:43:09] !log Re-re-pooling lvs2011 - T368165 [17:43:10] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9923096 (10xcollazo) >>! In T368098#9921287, @Ladsgroup wrote: >... >That's around 100M hit... [17:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:15] T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165 [17:44:05] RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:08] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [17:44:14] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [17:44:43] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9923108 (10xcollazo) >>! In T368098#9921990, @Ladsgroup wrote: >... > - Replicas in dump gr... [17:51:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [17:51:23] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [17:52:37] (03PS3) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) [17:55:16] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [17:57:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [18:00:04] jeena and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1800). [18:00:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923213 (10Dzahn) 05In progress→03Stalled Hi Andy, this ticket is currently stalled and waiting for your input to continue before we can merge h... [18:03:09] o/ [18:04:09] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956) [18:04:10] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:04:56] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:06:15] !log bringing up link from ssw1-a1-codfw to ssw1-d1-codfw T364095 [18:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:20] T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 [18:07:19] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923230 (10AndyRussG) >>! In T367681#9923213, @Dzahn wrote: > Hi Andy, this ticket is currently stalled and waiting for your input to continue befor... [18:07:26] (03CR) 10Scott French: "Alright, I think we're ready for attempt #2. I'll aim to get this out during tomorrow's UTC-late infrastructure window. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:08:32] (03CR) 10AndyRussG: "thanks so much for working on this, and many apologies for the delay!" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [18:12:42] (03CR) 10Dzahn: "ah, so this profile::phorge was used to setup a test instance of phorge before we switched phabricator to phorge upsteadm. but it's not in" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [18:14:16] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.11 refs T366956 [18:14:21] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [18:16:15] CUSTOM - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:16:42] er? [18:16:44] what is this custom thing? [18:16:48] I missed the memo [18:17:19] CUSTOM - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:17:48] sukhe: It's me, I'm debugging alerts of that host with mutante. [18:17:58] oh thanks denisse [18:18:25] it's a way to send alerts manually from Icinga web UI [18:18:31] without actually taking something down :) [18:20:12] oh interesting [18:22:28] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [18:25:27] (03CR) 10BCornwall: [C:03+2] cp5017: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049168 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [18:28:51] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [18:31:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye [18:31:19] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b... [18:43:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65425 and previous config saved to /var/cache/conftool/dbconfig/20240625-184349-marostegui.json [18:43:55] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:49:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5017.eqsin.wmnet with OS bullseye [18:49:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye [18:50:18] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923367 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls... [18:50:25] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b... [18:58:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65426 and previous config saved to /var/cache/conftool/dbconfig/20240625-185856-marostegui.json [18:59:11] (03PS2) 10Dzahn: Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [18:59:33] (03CR) 10CI reject: [V:04-1] Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:01:13] (03PS3) 10Dzahn: Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:04:24] (03CR) 10Dzahn: [C:04-1] admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [19:07:25] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923444 (10Dzahn) Hey @AndyRussG No worries, and hope you are well / feeling better. There is no particular rush here. We have a couple days until t... [19:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65428 and previous config saved to /var/cache/conftool/dbconfig/20240625-191403-marostegui.json [19:14:56] (03PS4) 10Dzahn: Phabricator: Add safe.directory directive [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:15:51] (03CR) 10Dzahn: "the arcanist class is only used on toolforge, not on prod phabricator. so this is just a single dir after all" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:16:16] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1025478/3069/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:23:45] !log re-enable puppet on lvs2011 [19:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:52] (03CR) 10Dzahn: [C:03+2] "config was created but unfortunately won't work as expected since /srv/phab is a symlink" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:24:16] (03CR) 10Dzahn: [C:03+2] "[phab1004:/srv/phab] $ cat /etc/gitconfig.d/10-safe_directory_phabdir.gitconfig" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:25:31] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [19:28:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [19:29:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65429 and previous config saved to /var/cache/conftool/dbconfig/20240625-192910-marostegui.json [19:29:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:29:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [19:29:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:29:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:29:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T367856)', diff saved to https://phabricator.wikimedia.org/P65430 and previous config saved to /var/cache/conftool/dbconfig/20240625-192947-marostegui.json [19:32:15] (03CR) 10Dzahn: [C:03+2] "There is another issue here. When the git config is in the home dir of the user running git then it works but the same config in /etc/gitc" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper) [19:33:39] (03PS1) 10Cwhite: mediawiki: enable forward of fatal metrics to statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049625 (https://phabricator.wikimedia.org/T356814) [19:41:45] (03PS1) 10Scott French: kubernetes: promote unavailable replicas alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T2000). Please do the needful. [20:00:05] ksarabia and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:01:33] !log hashar@deploy1002 Started deploy [integration/docroot@1eb5f4c]: remove CollaborationKit T368092 [20:01:38] T368092: Archive the CollaborationKit extension - https://phabricator.wikimedia.org/T368092 [20:01:40] !log hashar@deploy1002 Finished deploy [integration/docroot@1eb5f4c]: remove CollaborationKit T368092 (duration: 00m 07s) [20:01:49] i can deploy o/ [20:01:55] whee [20:02:40] i'll start with yours bvibber since i don't see ksarabia yet [20:03:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS bullseye [20:03:20] ok [20:03:27] (03PS2) 10Bvibber: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) [20:03:29] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls... [20:03:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber) [20:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:16] (03Merged) 10jenkins-bot: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber) [20:05:31] hi [20:05:47] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]] [20:05:54] T368433: Disable 1440p and 2160p video transcodes until encoding performance is better - https://phabricator.wikimedia.org/T368433 [20:06:04] hi kim! i'll deploy your patches after bvibber's [20:06:14] sounds good [20:08:39] !log cjming@deploy1002 cjming, bvibber: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:42] bvibber: is your patch testable? on mwdebug if so [20:08:58] Nah it'll just affect job queue [20:09:04] sounds good - will sync [20:09:07] So should be fine as long as it doesn't kill the site ;) [20:09:11] !log cjming@deploy1002 cjming, bvibber: Continuing with sync [20:11:03] (03PS6) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) [20:11:06] !log restart swift-proxy on ms-fe2010 ms-fe1011 T360913 [20:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:12] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [20:14:24] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]] (duration: 08m 36s) [20:14:29] T368433: Disable 1440p and 2160p video transcodes until encoding performance is better - https://phabricator.wikimedia.org/T368433 [20:14:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [20:14:50] bvibber: should be live! [20:14:54] thx :D [20:14:57] yw! [20:15:15] kimberly_sarabia: moving onto your pathes [20:15:18] *patches [20:15:27] sounds good [20:15:50] (03Merged) 10jenkins-bot: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [20:16:13] [oh, actually i can check that it dsiabled correctly.... and it looks good :D thx] [20:16:21] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]] [20:16:31] T366378: [Config change] Enable night theme on preferences pages - https://phabricator.wikimedia.org/T366378 [20:16:32] T367374: [Config] Enable dark mode on protect and deletion pages - https://phabricator.wikimedia.org/T367374 [20:16:32] T366373: [Config change] Enable night theme on pages which use data tables - https://phabricator.wikimedia.org/T366373 [20:16:32] T366520: [Config] Dark mode is not available on Special:ApiSandbox - https://phabricator.wikimedia.org/T366520 [20:17:08] (03PS5) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) [20:19:01] !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:06] kimberly_sarabia: 1st patch is up on test servers if you want to check [20:19:27] ok taking a look [20:20:10] (03CR) 10Marostegui: [C:03+1] Remove acmechief annotations for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [20:25:26] kimberly_sarabia: shall i sync? [20:25:46] Ok, it looks good except system-messages, protected-pages is not enabled in dark mode for me. I will ask if we either missed something in the patch or if we are not moving forward with that. But go ahead and move forward with the sync [20:25:56] thank you [20:26:06] alrighty - syncing! [20:26:10] !log cjming@deploy1002 jdlrobson, cjming: Continuing with sync [20:26:57] PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 9479 MB (2% inode=99%): /tmp 9479 MB (2% inode=99%): /var/tmp 9479 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [20:31:25] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]] (duration: 15m 04s) [20:31:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [20:31:38] T366378: [Config change] Enable night theme on preferences pages - https://phabricator.wikimedia.org/T366378 [20:31:39] T367374: [Config] Enable dark mode on protect and deletion pages - https://phabricator.wikimedia.org/T367374 [20:31:39] T366373: [Config change] Enable night theme on pages which use data tables - https://phabricator.wikimedia.org/T366373 [20:31:39] T366520: [Config] Dark mode is not available on Special:ApiSandbox - https://phabricator.wikimedia.org/T366520 [20:32:12] kimberly_sarabia: 1st patch should be live - started your 2nd patch [20:32:28] (03Merged) 10jenkins-bot: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [20:32:58] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]] [20:33:05] cjming: thank you! [20:33:05] T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128 [20:35:34] !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:35:57] kimberly_sarabia: yw! ok to sync 2nd patch? [20:36:16] cjming: yup! [20:36:20] !log cjming@deploy1002 jdlrobson, cjming: Continuing with sync [20:37:31] (03PS1) 10Dzahn: phabricator: configure git safedir for all directories [puppet] - 10https://gerrit.wikimedia.org/r/1049637 (https://phabricator.wikimedia.org/T360756) [20:41:27] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]] (duration: 08m 29s) [20:41:33] T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128 [20:41:59] kimberly_sarabia: and 2nd patch should be live :) [20:42:25] cjming: wonderful! thank you so much [20:42:41] you're very welcome! [20:44:47] !log end of UTC late backport window [20:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:12] (03CR) 10Krinkle: "For the past ~2 years, the CDN config has been a 1-day fresh TTL with a 7-day stale/keep TTL (i.e. akin to stale-while-revalidate). In ord" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [21:04:10] (03CR) 10Dzahn: "This is somewhat "wrong" but the only way to make things work because the deployment dir path changes on EVERY deploy and only scap knows " [puppet] - 10https://gerrit.wikimedia.org/r/1049637 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [21:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:47:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [21:47:45] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049615 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [21:57:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65431 and previous config saved to /var/cache/conftool/dbconfig/20240625-215705-marostegui.json [21:57:11] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:03:38] hmm [22:03:46] av_interleaved_write_frame(): No space left on device [22:03:46] Error writing trailer of transcoded.webm: No space left on device [22:04:07] is that just the quota or are the video job runners out of space :D [22:05:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [22:06:23] (03CR) 10JHathaway: [C:03+1] Remove acmechief annotations for MX hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [22:09:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [22:10:10] !log a webVideoTranscode job reported 'No space left on device' from a failed ffmpeg run on mw1446 recently [22:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P65432 and previous config saved to /var/cache/conftool/dbconfig/20240625-221212-marostegui.json [22:27:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P65433 and previous config saved to /var/cache/conftool/dbconfig/20240625-222719-marostegui.json [22:33:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [22:42:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65434 and previous config saved to /var/cache/conftool/dbconfig/20240625-224226-marostegui.json [22:42:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [22:42:33] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:42:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [22:42:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T364069)', diff saved to https://phabricator.wikimedia.org/P65435 and previous config saved to /var/cache/conftool/dbconfig/20240625-224249-marostegui.json [22:43:04] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [22:44:15] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [22:47:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:32] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:51:20] mutante: ^ gerrit go boom? [22:51:37] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049616 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [22:51:57] bd808: yea, but when I started looking it was already back [22:52:10] ack [22:52:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:14] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:55:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:42] (03PS1) 10Dzahn: gerrit: add another IP to misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1049643 [22:59:19] (03CR) 10Dzahn: [V:03+2 C:03+2] gerrit: add another IP to misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1049643 (owner: 10Dzahn) [23:00:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:36] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [23:05:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [23:27:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [23:35:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367856)', diff saved to https://phabricator.wikimedia.org/P65436 and previous config saved to /var/cache/conftool/dbconfig/20240625-233520-marostegui.json [23:35:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049644 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049644 (owner: 10TrainBranchBot) [23:50:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65437 and previous config saved to /var/cache/conftool/dbconfig/20240625-235027-marostegui.json