[00:00:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049275 (owner: 10TrainBranchBot)
[00:01:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: T365763
[00:01:19] <stashbot>	 T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763
[00:01:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: T365763
[00:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:30] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[00:08:45] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[00:10:04] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[00:14:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:14:38] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:18:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:18:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:43:16] <sukhe>	 !log sudo pkill mpeg: mw1438, high CPU usage, ffmpeg processes
[00:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:29] <sukhe>	 !log [correction of command] sudo pkill ffmpeg: mw1438, high CPU usage, ffmpeg processes
[00:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:10] <sukhe>	 ok should recover now
[00:44:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:44:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:45:12] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[00:45:20] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[00:45:32] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:45:43] <sukhe>	 ^ expected
[00:46:06] <brett>	 silencing
[00:49:14] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:49:21] <brett>	 ^silenced
[00:49:27] <sukhe>	 thanks
[00:58:30] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[00:58:45] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[01:03:30] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[01:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[01:07:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956)
[01:07:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Popups] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049181 (https://phabricator.wikimedia.org/T366419) (owner: 10Func)
[01:07:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[01:11:28] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919775 (10BCornwall)
[01:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.11 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049282 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[01:36:05] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919804 (10BCornwall)
[01:40:11] <brett>	 !log Removing downtime for cp[5017-5024] as nvme drives are installed and hosts back online - T365763
[01:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:40:16] <stashbot>	 T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763
[01:42:58] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:43:01] <sukhe>	 ha
[01:43:05] <sukhe>	 all good
[01:43:18] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=cache_text,dc=eqsin
[01:43:37] <wikibugs>	 (03PS1) 10BCornwall: Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286
[01:44:14] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:10] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs5004 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[01:45:20] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs5006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[01:47:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286 (owner: 10BCornwall)
[01:47:45] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Revert "depool eqsin for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1049286 (owner: 10BCornwall)
[01:47:58] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:48:27] <brett>	 !log Running authdns-update on dns1004 to pool eqsin - T365763
[01:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:32] <stashbot>	 T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763
[01:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:54:47] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919821 (10BCornwall)
[01:55:00] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919822 (10BCornwall)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0200)
[02:18:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:23:53] <jinxer-wm>	 FIRING: [5x] KubernetesAPILatency: High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:28:53] <jinxer-wm>	 RESOLVED: [5x] KubernetesAPILatency: High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0300)
[03:01:52] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956)
[03:01:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[03:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049289 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[03:03:02] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.11  refs T366956
[03:03:08] <stashbot>	 T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956
[03:28:10] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:29:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:38:10] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:55:21] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.11  refs T366956 (duration: 52m 19s)
[03:55:26] <stashbot>	 T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0400)
[04:01:03] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.8 (duration: 00m 55s)
[04:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[05:12:54] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "self merging as it's only for the dev instance and CI/PCC is happy." [puppet] - 10https://gerrit.wikimedia.org/r/1049263 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[05:24:20] <wikibugs>	 (03PS1) 10Ayounsi: Netbox 4: rename device_role to role [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275)
[05:32:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[05:32:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[05:32:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65394 and previous config saved to /var/cache/conftool/dbconfig/20240625-053239-marostegui.json
[05:32:44] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[05:32:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[05:32:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[05:33:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:33:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:33:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65395 and previous config saved to /var/cache/conftool/dbconfig/20240625-053312-marostegui.json
[05:33:18] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:35:52] <wikibugs>	 (03CR) 10Marostegui: mariadb: monitoring memory pressure (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb)
[05:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:59:06] <wikibugs>	 (03CR) 10Marostegui: mariadb: add monitoring on io pressure for mariadb hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0600).
[06:02:34] <wikibugs>	 (03PS2) 10Arnaudb: dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020)
[06:02:51] <marostegui>	 !log Drop ipblocks from s6 T367632
[06:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:56] <stashbot>	 T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632
[06:04:04] <wikibugs>	 (03CR) 10Marostegui: dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020) (owner: 10Arnaudb)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:29] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] dbconfig: temporary disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047910 (https://phabricator.wikimedia.org/T368020) (owner: 10Arnaudb)
[06:05:23] <logmsgbot>	 !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]]
[06:05:41] <stashbot>	 T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:11:56] <marostegui>	 !log Drop ipblocks from s7 T367632
[06:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:02] <stashbot>	 T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632
[06:17:16] <logmsgbot>	 !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:17:22] <stashbot>	 T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020
[06:19:05] <logmsgbot>	 !log arnaudb@deploy1002 arnaudb: Continuing with sync
[06:24:10] <logmsgbot>	 !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1047910|dbconfig: temporary disable writes on es7 (T368020)]] (duration: 18m 47s)
[06:24:15] <stashbot>	 T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020
[06:25:16] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049310 (https://phabricator.wikimedia.org/T368355)
[06:25:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove apereo spec test [puppet] - 10https://gerrit.wikimedia.org/r/1049139 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[06:25:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T368020
[06:25:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T368020
[06:26:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es1039 with weight 0 T368020', diff saved to https://phabricator.wikimedia.org/P65396 and previous config saved to /var/cache/conftool/dbconfig/20240625-062640-arnaudb.json
[06:27:35] <wikibugs>	 (03Abandoned) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971412 (owner: 10Muehlenhoff)
[06:27:43] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047906 (https://phabricator.wikimedia.org/T368020)
[06:31:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Point codfw and codfw1dev to use the eqiad LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049378 (https://phabricator.wikimedia.org/T367861)
[06:32:26] <wikibugs>	 (03PS1) 10Marostegui: db2129: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049379
[06:32:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1047905 (https://phabricator.wikimedia.org/T368020) (owner: 10Gerrit maintenance bot)
[06:33:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2129: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049379 (owner: 10Marostegui)
[06:33:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65397 and previous config saved to /var/cache/conftool/dbconfig/20240625-063334-marostegui.json
[06:33:39] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[06:33:46] <arnaudb>	 !log Starting es7 eqiad failover from es1035 to es1039 - T368020
[06:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:52] <stashbot>	 T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020
[06:34:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es1039 to es7 primary T368020', diff saved to https://phabricator.wikimedia.org/P65398 and previous config saved to /var/cache/conftool/dbconfig/20240625-063453-arnaudb.json
[06:36:18] <wikibugs>	 (03PS1) 10Arnaudb: Revert "dbconfig: temporary disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386
[06:36:58] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047906 (https://phabricator.wikimedia.org/T368020) (owner: 10Gerrit maintenance bot)
[06:38:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368355
[06:38:59] <stashbot>	 T368355: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T368355
[06:39:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T368355', diff saved to https://phabricator.wikimedia.org/P65399 and previous config saved to /var/cache/conftool/dbconfig/20240625-063908-root.json
[06:39:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368355
[06:40:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T368020', diff saved to https://phabricator.wikimedia.org/P65400 and previous config saved to /var/cache/conftool/dbconfig/20240625-064000-arnaudb.json
[06:40:06] <stashbot>	 T368020: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T368020
[06:40:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049310 (https://phabricator.wikimedia.org/T368355) (owner: 10Gerrit maintenance bot)
[06:40:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386 (owner: 10Arnaudb)
[06:41:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "dbconfig: temporary disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049386 (owner: 10Arnaudb)
[06:42:14] <logmsgbot>	 !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]]
[06:45:16] <logmsgbot>	 !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:45:17] <logmsgbot>	 !log arnaudb@deploy1002 Sync cancelled.
[06:45:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:46:51] <arnaudb>	 ah
[06:47:11] <arnaudb>	 this explains why I could not reenable writes on es7
[06:47:12] <arnaudb>	 :D
[06:48:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P65401 and previous config saved to /var/cache/conftool/dbconfig/20240625-064841-marostegui.json
[06:50:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:50:32] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:52:42] <logmsgbot>	 !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]]
[06:52:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou)
[06:53:03] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou)
[06:53:16] <arnaudb>	 its back!
[06:53:54] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou)
[06:54:14] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:54:32] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[06:55:13] <logmsgbot>	 !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:55:23] <logmsgbot>	 !log arnaudb@deploy1002 arnaudb: Continuing with sync
[06:55:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T0700).
[07:00:05] <jouncebot>	 Func: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:15] <Func>	 o/
[07:00:29] <logmsgbot>	 !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1049386|Revert "dbconfig: temporary disable writes on es7"]] (duration: 07m 47s)
[07:01:00] <marostegui>	 !log Starting s8 codfw failover from db2165 to db2161 - T368355
[07:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:06] <stashbot>	 T368355: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T368355
[07:01:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9920235 (10SLyngshede-WMF) @odimitrijevic / @Ottomata / @WDoranWMF Would either of you approve?
[07:01:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T368355', diff saved to https://phabricator.wikimedia.org/P65402 and previous config saved to /var/cache/conftool/dbconfig/20240625-070127-marostegui.json
[07:02:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2165 T368355', diff saved to https://phabricator.wikimedia.org/P65403 and previous config saved to /var/cache/conftool/dbconfig/20240625-070252-marostegui.json
[07:03:27] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: return logo-detection latency metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962)
[07:03:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P65404 and previous config saved to /var/cache/conftool/dbconfig/20240625-070348-marostegui.json
[07:06:01] <wikibugs>	 (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049388
[07:06:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Long schema change
[07:06:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Long schema change
[07:06:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049388 (owner: 10Marostegui)
[07:07:33] <wikibugs>	 (03PS1) 10David Caro: p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643)
[07:07:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro)
[07:09:07] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3060/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro)
[07:09:49] <wikibugs>	 (03Abandoned) 10David Caro: p:prometheus::cloud: add temporary ebpf scraping [puppet] - 10https://gerrit.wikimedia.org/r/1049389 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro)
[07:10:28] <wikibugs>	 (03CR) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol)
[07:13:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Point codfw and codfw1dev to use the eqiad LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049378 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff)
[07:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920267 (10dcaro) >>! In T348643#9919050, @CDanis wrote: > Unfortunately `cloudcephosd1020` has too old a Debian / kernel for this without some mor...
[07:14:46] <marostegui>	 !log Optimize pagelinks on old s8 codfw master db2165 dbmaint T364069
[07:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:52] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:18:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65405 and previous config saved to /var/cache/conftool/dbconfig/20240625-071855-marostegui.json
[07:18:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:19:00] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:26:50] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add daphnesmit to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159)
[07:28:02] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:29:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:31:00] <wikibugs>	 (03PS1) 10Brouberol: amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424
[07:31:46] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3061/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol)
[07:32:21] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3062/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:32:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org
[07:33:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol)
[07:36:08] <wikibugs>	 (03PS2) 10Brouberol: amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424
[07:36:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org
[07:41:39] <wikibugs>	 (03CR) 10Elukey: [C:03+1] No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[07:42:23] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[07:42:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol)
[07:43:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] amg_gpu: fix duplicated class declaration [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol)
[07:44:00] <elukey>	 brouberol: o/
[07:44:13] <brouberol>	 hi elukey o/
[07:44:18] <elukey>	 sorry my brain is still foggy, isn't the change the same as we have now?
[07:44:51] <elukey>	 ah no in theory it is fine
[07:44:54] <elukey>	 nevermind :D
[07:44:57] <brouberol>	 mondays
[07:45:12] <brouberol>	 am I right?
[07:45:36] <brouberol>	 puppet is now compiling and running on dse-k8s-worker1001
[07:45:43] <brouberol>	 thanks for the quick review!
[07:45:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1049446 (https://phabricator.wikimedia.org/T367861)
[07:45:54] <elukey>	 I am more worried that it is not monday and I am still doing pebcak :D
[07:46:06] <elukey>	 anyway, thanks for fixing!
[07:46:24] <elukey>	 now that I recall we can probably remove the rocm stuff from the DSE workers
[07:46:50] <elukey>	 https://phabricator.wikimedia.org/T363191
[07:47:37] <elukey>	 I have reopened it
[07:47:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Default to use acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/1047443 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[07:50:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] No longer refer to setting the acmechief hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1047444 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[07:54:10] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920337 (10dcaro) Just created a silly dashboard with the data that's coming in: https://grafana-rw.wikimedia.org/d/...
[07:54:15] <wikibugs>	 (03PS1) 10David Caro: p:prometheus::cloud: use cloudcephosd1010 instead of 1020 [puppet] - 10https://gerrit.wikimedia.org/r/1049449 (https://phabricator.wikimedia.org/T348643)
[07:55:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799)
[07:57:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for IDM/IDP [puppet] - 10https://gerrit.wikimedia.org/r/1049453 (https://phabricator.wikimedia.org/T365799)
[07:58:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799)
[07:58:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049453 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:03:33] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9920355 (10Joe) >>! In T368098#9918924, @xcollazo wrote: > Ok after ob...
[08:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:54] <wikibugs>	 (03PS2) 10Ayounsi: Netbox 4: rename device_role to role in validators [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275)
[08:05:54] <wikibugs>	 (03PS1) 10Ayounsi: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275)
[08:06:22] <wikibugs>	 (03CR) 10David Caro: [C:03+2] p:prometheus::cloud: use cloudcephosd1010 instead of 1020 [puppet] - 10https://gerrit.wikimedia.org/r/1049449 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro)
[08:06:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:09:43] <wikibugs>	 (03PS1) 10Slyngshede: IDP-Test: Switch to CAS 7 on idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487)
[08:10:33] <wikibugs>	 (03PS1) 10Elukey: prometheus-amd-rocm-stats.py: fix edge case for temperature reading [puppet] - 10https://gerrit.wikimedia.org/r/1049457
[08:11:33] <wikibugs>	 (03PS2) 10Ayounsi: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275)
[08:13:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:16:34] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Tested locally, merging into dev, post merge reviews welcome before moving to prod." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:16:44] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Tested locally, merging into dev, post merge reviews welcome before moving to prod." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:16:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799)
[08:18:23] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox 4: rename device_role to role in validators [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049292 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:18:58] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox 4: fix cable terminations breaking changes [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049454 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:22:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:25:05] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920418 (10dcaro) Data is coming in now from both nodes, latencies look similar so far, with sdc on 1034 being different and having less spread (no...
[08:26:50] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es2022', diff saved to https://phabricator.wikimedia.org/P65406 and previous config saved to /var/cache/conftool/dbconfig/20240625-082649-jynus.json
[08:28:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Late to the party, but +1 and thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1049424 (owner: 10Brouberol)
[08:30:55] <jynus>	 I see no errors, downtiming and depooling another
[08:31:07] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: full dump
[08:31:20] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: full dump
[08:32:17] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es2025', diff saved to https://phabricator.wikimedia.org/P65407 and previous config saved to /var/cache/conftool/dbconfig/20240625-083216-jynus.json
[08:32:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:35:13] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga)
[08:38:04] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3063/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy)
[08:38:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for dns/ncredir/durum/doh [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799)
[08:39:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:39:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049452 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:39:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "+1 on the ncredir side of things :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:43:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:44:39] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, let me know when this should be merged. We could merge that in today office hours." [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy)
[08:45:23] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: full dump
[08:45:37] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: full dump
[08:46:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/1049465
[08:47:06] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:49:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049466 (https://phabricator.wikimedia.org/T365799)
[08:50:55] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm. Can you double check the diff for gitlab2002 https://puppet-compiler.wmflabs.org/output/1049459/3064/gitlab2002.wikimedia.org/index." [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:52:30] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] package_builder: don't install python-all on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1049180 (https://phabricator.wikimedia.org/T367544) (owner: 10Jelto)
[08:53:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for various DE roles [puppet] - 10https://gerrit.wikimedia.org/r/1049469 (https://phabricator.wikimedia.org/T365799)
[08:55:16] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:57] <wikibugs>	 (03CR) 10Muehlenhoff: "Indeed, that is expected. Before we had the split between P5 and P7 acmechief hosts, all clients defaulted to acmechief1001. This after th" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:57:39] <wikibugs>	 (03PS1) 10Jelto: gitlab: add missing custom nginx config also to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1049472 (https://phabricator.wikimedia.org/T366786)
[08:58:17] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "thanks for the explanation, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[08:59:42] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: add missing custom nginx config also to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1049472 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto)
[09:01:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049466 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:03:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1049459 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:06:19] <wikibugs>	 (03PS1) 10Elukey: docker::reporter: remove Stretch/Jessie restrictions [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427)
[09:08:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1049475 (https://phabricator.wikimedia.org/T365799)
[09:18:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920631 (10ABran-WMF)
[09:18:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920630 (10Marostegui) >>! In T365995#9883497, @jcrespo wrote: > backup1009 is the main backup node for bacula on eqiad. Most ba...
[09:19:17] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371)
[09:19:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371)
[09:19:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920643 (10Marostegui)
[09:20:25] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9920645 (10ABran-WMF)
[09:21:19] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9920659 (10ABran-WMF)
[09:22:05] <wikibugs>	 (03PS1) 10MVernon: ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621)
[09:22:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1049475 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:23:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9920670 (10ABran-WMF)
[09:24:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[09:24:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/1049465 (owner: 10Muehlenhoff)
[09:24:51] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:26:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for MX hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799)
[09:28:13] <wikibugs>	 (03PS1) 10Ayounsi: Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275)
[09:28:42] <wikibugs>	 (03PS2) 10Ayounsi: Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275)
[09:29:09] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust labs-in policy after clouddb is replaced with an-redacteddb [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316)
[09:29:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799)
[09:30:26] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:30:36] <icinga-wm>	 RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2197) taken on 2024-06-25 08:42:03 (463 GiB, -3.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:31:06] <wikibugs>	 (03PS4) 10Clément Goubert: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655)
[09:31:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:31:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:31:53] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP-Test: Switch to CAS 7 on idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/1049456 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[09:32:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1228 T368374', diff saved to https://phabricator.wikimedia.org/P65408 and previous config saved to /var/cache/conftool/dbconfig/20240625-093221-root.json
[09:32:23] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1228 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1049485 (https://phabricator.wikimedia.org/T368374)
[09:32:28] <stashbot>	 T368374: Move one host temporarily to m2 - https://phabricator.wikimedia.org/T368374
[09:32:40] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1] Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:33:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1228 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1049485 (https://phabricator.wikimedia.org/T368374) (owner: 10Marostegui)
[09:34:12] <slyngs>	 !log Switching idp-test.wikimedia.org to CAS 7
[09:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:19] <wikibugs>	 (03CR) 10Muehlenhoff: "Needs manager approval, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede)
[09:34:21] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049487
[09:34:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:34:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1228 from dbctl T368374', diff saved to https://phabricator.wikimedia.org/P65409 and previous config saved to /var/cache/conftool/dbconfig/20240625-093454-marostegui.json
[09:35:03] <wikibugs>	 (03CR) 10MVernon: [V:03+2 C:03+2] ceph: install wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049480 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:35:46] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1 C:03+2] Allow Ganeti RAPI access from netbox-dev2003 [puppet] - 10https://gerrit.wikimedia.org/r/1049482 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:35:47] <wikibugs>	 (03CR) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[09:36:14] <wikibugs>	 (03PS1) 10Marostegui: db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049488
[09:36:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049488 (owner: 10Marostegui)
[09:38:48] <wikibugs>	 (03PS4) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949)
[09:38:49] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949)
[09:42:16] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:42:49] <wikibugs>	 (03PS1) 10Slyngshede: Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487)
[09:43:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:44:05] <wikibugs>	 (03PS1) 10Majavah: P:puppet: Remove Puppet 7 MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1049494
[09:44:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db[1217,1228].eqiad.wmnet with reason: Cloning
[09:44:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1217,1228].eqiad.wmnet with reason: Cloning
[09:47:18] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:47:28] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:47:39] <marostegui>	 ^ known and expected
[09:49:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1049497 (https://phabricator.wikimedia.org/T368374)
[09:49:45] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 (owner: 10Pppery)
[09:49:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on 25 hosts with reason: Turning down appserver clusters
[09:49:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1049494 (owner: 10Majavah)
[09:50:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for cloudlb/clouddumps/cloudservices-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1049484 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[09:50:07] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:puppet: Remove Puppet 7 MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1049494 (owner: 10Majavah)
[09:50:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on 25 hosts with reason: Turning down appserver clusters
[09:50:30] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920844 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ca43ab0-579a-4f82-97aa-11720f300bd7) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:50:51] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718)
[09:53:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on 11 hosts with reason: Turning down appserver clusters
[09:53:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on 11 hosts with reason: Turning down appserver clusters
[09:54:13] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920870 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=046a1781-9fad-454c-b26b-ad2c96d2d8b2) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:55:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9920871 (10cmooney) >>! In T326322#9650260, @cmooney wrote: >>>! In T326322#9130092, @ayounsi wrote: >> @cmooney I came across https://w...
[09:55:40] <wikibugs>	 (03PS1) 10Ayounsi: Netbox puppet import: ignore ipip interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499
[09:56:59] <wikibugs>	 (03CR) 10Ayounsi: "So far not useful to have them in Netbox as they don't hold any info (IP or other). Please let me know if you think we should import them." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi)
[09:58:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1049497 (https://phabricator.wikimedia.org/T368374) (owner: 10Marostegui)
[09:58:48] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9920891 (10dcaro) >>! In T348643#9920418, @dcaro wrote: > Data is coming in now from both nodes, latencies look simi...
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1000)
[10:01:39] <wikibugs>	 (03PS1) 10Superpes15: Removing 'spamblacklistlog' rights to usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683)
[10:02:21] <wikibugs>	 (03PS2) 10Superpes15: Removing 'spamblacklistlog' right to usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683)
[10:02:26] <wikibugs>	 (03PS3) 10Superpes15: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683)
[10:04:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for caches [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799)
[10:05:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:05:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] irc.wikimedia.org: Stop sending broadcast events to the old buster nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049137 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[10:07:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for icinga [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799)
[10:08:04] <wikibugs>	 (03PS2) 10Fabfur: benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718)
[10:08:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:09:13] <wikibugs>	 (03PS7) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797)
[10:09:17] <wikibugs>	 (03PS7) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797)
[10:09:21] <wikibugs>	 (03PS7) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797)
[10:09:25] <wikibugs>	 (03PS12) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797)
[10:09:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol)
[10:09:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol)
[10:10:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol)
[10:10:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol)
[10:11:11] <logmsgbot>	 !log jmm@deploy1002 Started scap: (no justification provided)
[10:12:43] <logmsgbot>	 !log jmm@deploy1002 Finished scap: (no justification provided) (duration: 03m 30s)
[10:13:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch old irc hosts to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049503 (https://phabricator.wikimedia.org/T331702)
[10:15:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for netmon [puppet] - 10https://gerrit.wikimedia.org/r/1049504 (https://phabricator.wikimedia.org/T365799)
[10:15:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799)
[10:17:32] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1001:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[10:18:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9920991 (10Dreamy_Jazz) @Dzahn I am requesting membership with access to Kerberos.
[10:19:31] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049506
[10:20:03] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949)
[10:20:08] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:21:47] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:23:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:23:39] <wikibugs>	 (03PS8) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797)
[10:23:39] <wikibugs>	 (03PS8) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797)
[10:23:40] <wikibugs>	 (03PS8) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797)
[10:23:40] <wikibugs>	 (03PS13) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797)
[10:24:14] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for netmon [puppet] - 10https://gerrit.wikimedia.org/r/1049504 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:26:53] <wikibugs>	 (03PS1) 10Slyngshede: Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487)
[10:27:02] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:27:04] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Thymeleaf syntax to remove deprecation warning. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049492 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:29:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch old irc hosts to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049503 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[10:29:21] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921045 (10fnegri) This was linked in the parent task but I'm not sure if it's really a blocker here: T103011
[10:32:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:32:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:35:13] <icinga-wm>	 PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho
[10:35:15] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "I think this won't work as is. docker-report does install the debmonitor client via apt-get and that will most likely fail for old distros" [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[10:36:23] <icinga-wm>	 PROBLEM - ircecho bot process on irc2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho
[10:37:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Remove acmechief annotations for caches [puppet] - 10https://gerrit.wikimedia.org/r/1049501 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:37:32] <jinxer-wm>	 RESOLVED: [2x] UdpMxIrcEchoThroughput: irc1001:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[10:37:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049511 (https://phabricator.wikimedia.org/T365799)
[10:38:27] <wikibugs>	 (03CR) 10Klausman: [C:03+1] prometheus-amd-rocm-stats.py: fix edge case for temperature reading [puppet] - 10https://gerrit.wikimedia.org/r/1049457 (owner: 10Elukey)
[10:38:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1049512 (https://phabricator.wikimedia.org/T365799)
[10:39:14] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job udpmxircecho in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:39:57] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921082 (10Marostegui) >>! In T368136#9919314, @bd808 wrote: > What sort of data y'all are concerned about exposing to new roots on the replica db ho...
[10:40:06] <marostegui>	 !log m2 dbmaint eqiad Stop db1217:3322 to clone db1228 T368374
[10:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:11] <stashbot>	 T368374: Move one host temporarily to m2 - https://phabricator.wikimedia.org/T368374
[10:40:32] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:41:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch the mw_rc_irc role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619)
[10:41:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049512 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:43:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "The idea is to obtain the list of legacy images and then remove them in the docker registry instead." [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[10:43:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:44:41] <wikibugs>	 (03CR) 10Majavah: Adjust labs-in policy after clouddb is replaced with an-redacteddb (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316) (owner: 10Cathal Mooney)
[10:44:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for archiva [puppet] - 10https://gerrit.wikimedia.org/r/1049515 (https://phabricator.wikimedia.org/T365799)
[10:45:00] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Clarify error page titles [puppet] - 10https://gerrit.wikimedia.org/r/1049145 (owner: 10Majavah)
[10:45:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for lists [puppet] - 10https://gerrit.wikimedia.org/r/1049516 (https://phabricator.wikimedia.org/T365799)
[10:46:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch the mw_rc_irc role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049513 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:48:41] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9921167 (10MoritzMuehlenhoff)
[10:49:14] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:49:14] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:49:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049516 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:50:32] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:50:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921174 (10jcrespo) > Is there a procedure for that so we know how to do so?  Sadly, there is not. The code changes for implemen...
[10:50:51] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:51:23] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Move fonts to CSS directory. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1049508 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:51:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049515 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:54:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799)
[10:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:55:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:56:13] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921190 (10Marostegui) I will try - but just in case @ABran-WMF please take some notes!
[10:56:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove acmechief annotations for wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799)
[10:57:27] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[10:59:23] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9921203 (10fnegri) >  I assume wmcs-roots is just WMCS staff and those would be the ones having root access?  wmcs-roots is defined in [admin/data/da...
[11:03:59] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "Yeah, that's understood. But removing the rules will still make docker-report fail for all of those images and they would not be reported " [puppet] - 10https://gerrit.wikimedia.org/r/1049474 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[11:36:48] <wikibugs>	 (03PS1) 10Majavah: P:netbox: Don't show status MOTD for active hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049525
[11:40:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert)
[11:40:36] <wikibugs>	 (03PS8) 10Clément Goubert: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655)
[11:42:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert)
[11:42:45] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9921290 (10Ladsgroup) The trigger seems to be a duress imposed on s4:...
[11:44:17] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Introduce rsyslog udp2log rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049182 (https://phabricator.wikimedia.org/T365655) (owner: 10Clément Goubert)
[11:45:12] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:45:15] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:46:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[11:46:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1049446 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff)
[11:49:06] <wikibugs>	 (03PS3) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981)
[11:49:24] <wikibugs>	 (03PS4) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981)
[11:51:04] <wikibugs>	 (03CR) 10Jforrester: "As merging this will make then next scap run deploy it immediately, we shouldn't do this until we're sure (unless there's a nicer way to r" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester)
[11:51:10] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383)
[11:51:52] <wikibugs>	 (03CR) 10Jforrester: Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester)
[11:52:02] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383)
[11:52:09] <wikibugs>	 (03PS1) 10Jgiannelos: pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530
[11:52:49] <wikibugs>	 (03PS3) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383)
[11:52:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos)
[11:53:06] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[11:53:51] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:54:14] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:55:59] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:56:14] <vgutierrez>	 !log disable puppet on A:cp-esams before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049529 - T364383
[11:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:20] <stashbot>	 T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383
[11:57:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for esams [puppet] - 10https://gerrit.wikimedia.org/r/1049529 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[11:58:29] <vgutierrez>	 !log rolling upgrade of fifo-log-demux on A:cp-esams - T364383
[11:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1200)
[12:00:36] <icinga-wm>	 RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2197) taken on 2024-06-25 10:54:18 (817 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:02:37] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Rate limit udp2log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049532 (https://phabricator.wikimedia.org/T365655)
[12:03:47] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Thanks !! Could be worth linking it to https://phabricator.wikimedia.org/T352957 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049525 (owner: 10Majavah)
[12:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:04:56] <icinga-wm>	 PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:05:40] <wikibugs>	 (03PS2) 10Jgiannelos: pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530
[12:05:56] <icinga-wm>	 RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:06:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos)
[12:08:02] <wikibugs>	 (03CR) 10Ayounsi: Adjust labs-in policy after clouddb is replaced with an-redacteddb (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1049483 (https://phabricator.wikimedia.org/T368316) (owner: 10Cathal Mooney)
[12:09:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[12:09:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[12:09:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65411 and previous config saved to /var/cache/conftool/dbconfig/20240625-120926-marostegui.json
[12:42:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[12:42:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[12:44:42] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[12:44:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[12:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:46:28] <logmsgbot>	 !log cgoubert@deploy1002 Started scap: Deploy udp2log rate-limiting - T365655 - T368098
[12:46:35] <stashbot>	 T365655: mw-api-ext unavailability 2024-05-22 18:30 UTC  - https://phabricator.wikimedia.org/T365655
[12:46:35] <stashbot>	 T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098
[12:51:40] <logmsgbot>	 !log cgoubert@deploy1002 Finished scap: Deploy udp2log rate-limiting - T365655 - T368098 (duration: 05m 49s)
[12:51:46] <stashbot>	 T365655: mw-api-ext unavailability 2024-05-22 18:30 UTC  - https://phabricator.wikimedia.org/T365655
[12:51:47] <stashbot>	 T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098
[12:51:52] <urbanecm>	 jouncebot: nowandnext
[12:51:53] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1200)
[12:51:53] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300)
[12:52:03] <urbanecm>	 let's start CI
[12:54:02] * MichaelG_WMF is here to observe
[12:54:26] <MichaelG_WMF>	 probably the change-tag change can only be tested in a sensible way on testwiki?
[12:54:40] <urbanecm>	 yep
[12:54:47] <urbanecm>	 or in prod
[12:54:55] <urbanecm>	 eg. on cswiki, where i'm an admin as a volunteer
[12:56:21] <MichaelG_WMF>	 true, but that would mean we would have to change live prod config. not sure if that is possible in a way that is not mild vandalism
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300)
[13:00:05] <jouncebot>	 Func and urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:13] <Func>	 o/
[13:00:41] <urbanecm>	 i can deploy today
[13:00:44] <Lucas_WMDE>	 I take it urbanecm is deploying ^^
[13:00:47] <Lucas_WMDE>	 ah, jinx
[13:02:48] <urbanecm>	 Func: i'm afraid that change is not safe to backport, as extension.json change takes effect immediately, but PHP waits for the rolling restart.
[13:02:57] <urbanecm>	 how critical is backporting that change?
[13:03:48] <Func>	 not critical, since it has been broken for 3 weeks...
[13:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:04:32] <Lucas_WMDE>	 urbanecm: IIUC that shouldn’t be relevant anymore now that 100% of traffic goes to mw-on-k8s
[13:04:39] <claime>	 was about to say
[13:04:40] <urbanecm>	 Lucas_WMDE: oh, we're on 100%? 
[13:04:44] <claime>	 they'll be deployed at the same time
[13:04:48] <Lucas_WMDE>	 we are \o/
[13:04:50] <claime>	 (except on videoscalers)
[13:04:51] <urbanecm>	 i missed that announcement, i thought it's still 80:20 or something
[13:04:52] <Lucas_WMDE>	 for about a week now IIRC
[13:05:00] <urbanecm>	 then what i said doesn't matter :)
[13:05:12] <Lucas_WMDE>	 https://phabricator.wikimedia.org/T362323#9903574
[13:05:12] <claime>	 urbanecm: we didn't do a big announcement yet, still got a little cleanup to do
[13:05:18] <Lucas_WMDE>	 heh, almost exactly a week indeed
[13:05:23] <urbanecm>	 ah, that's why i missed it!
[13:05:35] <urbanecm>	 Func: i'll backport your
[13:05:38] <urbanecm>	 *your patch
[13:05:42] <claime>	 (it makes me happy that nobody noticed)
[13:05:43] <Func>	 thanks
[13:07:26] <fabfur>	 !log temporary disabled puppet on cp4037 to test benthos configuration (T367756)
[13:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:31] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[13:18:54] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]]
[13:19:02] <stashbot>	 T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989
[13:19:02] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[13:22:50] <urbanecm>	 scap, scap faster, please!
[13:26:48] <vgutierrez>	 !log rolling restart of pybal on lvs1020 and lvs1018 - T367861
[13:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:53] <stashbot>	 T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation - https://phabricator.wikimedia.org/T367861
[13:27:02] <Lucas_WMDE>	 scap should be somewhat faster when it doesn’t do the php-fpm-restart on the bare-metal servers anymore
[13:27:11] <Lucas_WMDE>	 (but I don’t know if that’s happened yet)
[13:27:34] <Lucas_WMDE>	 at least in my subjective experience the k8s restarts have stayed faster than the bare-metal restarts even as the k8s cluster grew and bare-metal shrunk
[13:29:58] <vgutierrez>	 !log IPIP encapsulation enabled on ldap-ro.eqiad.wikimedia.org - T367861
[13:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:26] <urbanecm>	 Lucas_WMDE: it's still building the images
[13:30:32] <urbanecm>	 it's not even at the mwdebug stage
[13:30:36] <icinga-wm>	 RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2024-06-25 12:45:43 (506 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:30:47] <urbanecm>	 and we're half through the window already
[13:31:18] <Lucas_WMDE>	 I guess it’s taking longer due to the backports including i18n changes 😔
[13:31:40] <urbanecm>	 possible
[13:31:43] <urbanecm>	 hard to say
[13:32:46] <urbanecm>	 docker pull! progress.
[13:36:32] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply
[13:36:46] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[13:37:55] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: cluster=apus
[13:39:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[13:42:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt forkrb1002 - jclark@cumin1002"
[13:43:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt forkrb1002 - jclark@cumin1002"
[13:43:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:44:26] <sukhe>	 !log disable puppet on A:lvs and A:codfw for CR 1049560
[13:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:07] <cdanis>	 jouncebot: nowandnext
[13:48:07] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1300)
[13:48:07] <jouncebot>	 In 1 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500)
[13:48:41] <cdanis>	 urbanecm: hm I wonder if it is i18n taking so long
[13:49:09] <Lucas_WMDE>	 I’m pretty sure it is tbh, I thought that’s been a known thing for a long time now
[13:49:18] <Lucas_WMDE>	 that as soon as i18n is touched it has to rebuild the whole cache
[13:49:21] <Lucas_WMDE>	 or something like that
[13:51:03] <Lucas_WMDE>	 I think there was a related issue a year or two ago, maybe I can find it
[13:51:08] <sukhe>	 !log restart pybal on lvs1020
[13:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs
[13:59:15] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply
[13:59:25] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[14:00:04] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:00:10] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs
[14:00:15] <stashbot>	 T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989
[14:00:15] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[14:00:22] <urbanecm>	 finally
[14:01:00] <cdanis>	 sheesh
[14:01:19] <urbanecm>	 let's test
[14:01:24] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[14:01:45] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[14:02:34] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[14:02:38] <urbanecm>	 Func: your patch also made it to mwdebug. can you test it there please?
[14:02:50] <Func>	 ok
[14:02:55] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[14:03:20] <Lucas_WMDE>	 (haven’t managed to find the task I remembered so far, I’m afraid)
[14:04:35] <MichaelG_WMF>	 urbanecm: Even with wmdebug I'm not seeing the tag on https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:Zna%C4%8Dky - shouldn't it be there?
[14:04:42] <Func>	 urbanecm: looks good
[14:05:00] <sukhe>	 !log restart pybal on lvs2014
[14:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:23] <MichaelG_WMF>	 urbanecm: oh wait, you only backported the change to wmf.11, so I need to find a group 0 wiki
[14:06:23] <urbanecm>	 MichaelG_WMF: testwiki should work
[14:06:29] <urbanecm>	 i have the wmf.10 lined up
[14:06:42] <urbanecm>	 let me run the migration on ptwiki and mwdebug
[14:06:51] <MichaelG_WMF>	 urbanecm: But I'm not seeing it on https://test.wikipedia.org/wiki/Special:Tags either...
[14:07:43] <urbanecm>	 hmm...
[14:07:49] <urbanecm>	 but it does work at https://test.wikipedia.org/w/index.php?title=MediaWiki:GrowthExperimentsMentorship.json&diff=prev&oldid=601004
[14:07:52] <urbanecm>	 i just made an edit
[14:08:05] <urbanecm>	 but still not on special:tags
[14:08:14] <urbanecm>	 i'm willing to bet on a cache, given ti works on edit
[14:08:17] <urbanecm>	 MichaelG_WMF: thoughts?
[14:09:06] <urbanecm>	 oh, not
[14:09:13] <urbanecm>	 that's what i only did for wmf.10
[14:09:18] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[14:09:33] <urbanecm>	 proceeding then
[14:09:48] <MichaelG_WMF>	 urbanecm: cache sounds most plausible, I agree. 
[14:09:56] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm)
[14:10:48] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs
[14:11:13] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs
[14:11:21] <MichaelG_WMF>	 urbanecm: I think we can move forward, adding the tag on edit is the important part. And that would not work if the hooks would not work
[14:11:59] <urbanecm>	 yup
[14:12:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[1011-1021].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[14:12:35] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[14:15:34] <sukhe>	 !log sudo cumin "A:dnsbox" 'disable-puppet "rolling out CR 1049165"'
[14:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:23] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1049534|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049535|Add change tag "Community Configuration" (T366989)]], [[gerrit:1049539|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] (duration: 58m 28s)
[14:17:29] <stashbot>	 T366989: Edits made via Special:CommunityConfiguration should have a CommunityConfiguration tag attached - https://phabricator.wikimedia.org/T366989
[14:17:30] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[14:17:58] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: generate $time_acl from network::constants [puppet] - 10https://gerrit.wikimedia.org/r/1049165 (owner: 10Ssingh)
[14:22:25] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383)
[14:23:00] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[14:23:44] <cdanis>	 urbanecm: are you done with the window?
[14:23:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:24:29] <dcausse>	 !log re-indexing all wikidata entity schemas (T368010)
[14:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:34] <stashbot>	 T368010: Search not working for entity schemas - https://phabricator.wikimedia.org/T368010
[14:24:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: app.job: update module (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885)
[14:25:28] <wikibugs>	 (03PS3) 10Clément Goubert: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265)
[14:25:28] <wikibugs>	 (03PS3) 10Clément Goubert: mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265)
[14:25:28] <wikibugs>	 (03PS3) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265)
[14:26:38] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] mariadb: disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049555 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb)
[14:27:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm)
[14:27:15] <urbanecm>	 cdanis: not yet
[14:27:55] <cdanis>	 ok np
[14:29:12] <urbanecm>	 one last patch
[14:30:52] <vgutierrez>	 !log disable puppet on A:cp-eqiad before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049570 - T364383
[14:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:01] <stashbot>	 T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383
[14:31:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1049570 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[14:32:59] <wikibugs>	 (03Merged) 10jenkins-bot: WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049538 (https://phabricator.wikimedia.org/T368275) (owner: 10Urbanecm)
[14:33:05] <Func>	 urbanecm: wait, it seems the version of Popups on Special:Version is still not the new one? (I don't have a user created before 2017, and after the patch there should be no logical difference by user creation date, so I only tested with my own account.)
[14:33:06] <urbanecm>	 finally
[14:33:34] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]]
[14:33:37] <vgutierrez>	 !log rolling upgrade of fifo-log-demux on A:cp-eqiad  - T364383
[14:33:39] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[14:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:35] <wikibugs>	 (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:34:38] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-on-k8s: extend envoy_cluster_name to new format [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:35:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2005 to codfw - jhancock@cumin2002"
[14:35:18] <sukhe>	 !log sudo cumin -b1 -s900 "A:dnsbox" "run-puppet-agent --enable 'rolling out CR 1049165' && systemctl restart ntp.service" 
[14:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:23] <urbanecm>	 Func: i wouldn't bet on the version ID tbh. i'm not sure how reliable it is wrt backports.
[14:36:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2005 to codfw - jhancock@cumin2002"
[14:36:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:36:13] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: extend envoy_cluster_name to new format [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:37:02] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9921990 (10Ladsgroup) >>! In T368098#9921528, @jcrespo wrote: > Question, what went wrong wi...
[14:37:50] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite)
[14:38:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Add krb1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1049572 (https://phabricator.wikimedia.org/T365165)
[14:38:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add krb1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1049572 (https://phabricator.wikimedia.org/T365165) (owner: 10Muehlenhoff)
[14:39:19] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9921994 (10Jhancock.wm)
[14:39:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9921995 (10MoritzMuehlenhoff) >>! In T365165#9921708, @Jclark-ctr wrote: > @MoritzMuehlenhoff  would you be able to update site.pp file for this server...
[14:39:40] <Func>	 urbanecm: Ack. Actually, it's a surprise to me that backports can be done without bots leaving any messages on the task or the change.
[14:39:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9921996 (10MoritzMuehlenhoff)
[14:40:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Point eqiad and cloud/eqiad to use the codfw LDAP ro servers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1049568 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff)
[14:40:10] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:40:16] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[14:40:17] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[14:40:34] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9922000 (10Scott_French) Thanks, @SGupta-WMF! Ahmon tends to be quite responsi...
[14:43:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9922006 (10Ottomata) Approved!
[14:45:19] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1049538|WikiPageWriter: Do not run AbuseFilter when UltimateAuthority is used (T368275)]] (duration: 11m 45s)
[14:45:26] <stashbot>	 T368275: Abusefilter prevented GrowthExperiments Migration script from running - https://phabricator.wikimedia.org/T368275
[14:46:53] <urbanecm>	 cdanis: i'm done
[14:46:57] <cdanis>	 thanks!
[14:47:06] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Sampled tracing (0.1%) for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049202 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis)
[14:47:47] <wikibugs>	 (03PS1) 10Effie Mouzeli: modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885)
[14:48:10] <wikibugs>	 (03Merged) 10jenkins-bot: Sampled tracing (0.1%) for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049202 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis)
[14:48:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli)
[14:49:06] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:50:29] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:54:13] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9922021 (10bd808) >>! In T368136#9921082, @Marostegui wrote: > Also, the issue with root is that that user can make changes to replication, grants, s...
[14:55:19] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-e5-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e5-eqiad
[14:55:31] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:55:33] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-e5-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e5-eqiad
[14:55:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922024 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7a21c2a6-e267-4150-8111-b348788c4a9b)...
[14:55:55] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Pull fifo_log_demux metrics [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383)
[14:55:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365986 - depool es1035', diff saved to https://phabricator.wikimedia.org/P65413 and previous config saved to /var/cache/conftool/dbconfig/20240625-145558-arnaudb.json
[14:56:04] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[14:56:20] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on es1035.eqiad.wmnet with reason: T365986
[14:56:33] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on es1035.eqiad.wmnet with reason: T365986
[14:56:56] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:57:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-e5-eqiad,lsw1-e5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e5-eqiad
[14:57:43] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-e5-eqiad,lsw1-e5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e5-eqiad
[14:58:00] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: JunOS upgrade lsw1-e5-eqiad
[14:58:19] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: JunOS upgrade lsw1-e5-eqiad
[14:58:19] <wikibugs>	 (03PS2) 10Effie Mouzeli: modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885)
[14:58:33] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922051 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01b84d43-d6d0-4f45-bc2e-375ff79e21f8)...
[14:58:53] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9922052 (10fnegri) > That is true, but also not clearly in the scope of this ticket which seems to be specifically about addressing claims of data pr...
[14:59:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922053 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=65c438b1-9725-4de3-9a45-8318edea15f1)...
[14:59:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli)
[15:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500).
[15:00:09] <topranks>	 !log rebooting lsw1-e5-eqiad to upgrade JunOS on switch T365986
[15:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:14] <wikibugs>	 (03PS5) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293)
[15:01:44] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922064 (10Jdforrester-WMF)
[15:01:46] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3065/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[15:02:18] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922079 (10Jdforrester-WMF)
[15:02:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update
[15:03:03] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922086 (10Jdforrester-WMF)
[15:03:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update
[15:03:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update
[15:03:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update
[15:04:05] <wikibugs>	 (03PS1) 10Elukey: config.yaml: remove wikimedia-stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049576 (https://phabricator.wikimedia.org/T367427)
[15:04:07] <wikibugs>	 (03PS1) 10Elukey: coredns: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049577 (https://phabricator.wikimedia.org/T368366)
[15:04:08] <wikibugs>	 (03PS1) 10Elukey: envoy: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049578 (https://phabricator.wikimedia.org/T368366)
[15:04:28] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@f58dd50]: deploy phab2002 for T368392
[15:04:33] <stashbot>	 T368392: Deploy Phabricator/Phorge 2024-06-25 - https://phabricator.wikimedia.org/T368392
[15:05:01] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@f58dd50]: deploy phab2002 for T368392 (duration: 00m 33s)
[15:05:07] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922082 (10Jdforrester-WMF)
[15:05:22] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@f58dd50]: deploy phab1004 for T368392
[15:06:12] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@f58dd50]: deploy phab1004 for T368392 (duration: 00m 50s)
[15:08:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gitlab: remove last reference to ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1049253 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:12:59] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite)
[15:17:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:17:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[15:18:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65414 and previous config saved to /var/cache/conftool/dbconfig/20240625-151802-arnaudb.json
[15:18:09] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[15:18:28] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[15:18:38] <wikibugs>	 (03PS1) 10Elukey: helm-state-metrics: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049586 (https://phabricator.wikimedia.org/T368366)
[15:18:39] <wikibugs>	 (03PS1) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366)
[15:18:42] <wikibugs>	 (03PS1) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366)
[15:18:48] <wikibugs>	 (03CR) 10Vgutierrez: acme-chief: Add new certificates and domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047147 (owner: 10BCornwall)
[15:19:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[15:20:00] <claime>	 !log Deploying statsd to mw-api-ext - T365265
[15:20:03] <claime>	 cc herron ^
[15:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:07] <stashbot>	 T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265
[15:20:14] <herron>	 claime: kk
[15:20:54] <claime>	 herron: if you're ok, I can do all remaining deployments today, or we can stagger them on other days
[15:21:17] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[15:21:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[15:22:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:22:42] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[15:23:47] <herron>	 claime: ok yeah I'd be inclined to stagger them, by days or even a few hours?  in case we do run into an issue
[15:24:22] <claime>	 herron: sure. we have two remaining major deployments, mw-api-int and mw-web
[15:27:54] <wikibugs>	 (03PS4) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373)
[15:28:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) (owner: 10Jdlrobson)
[15:28:20] <wikibugs>	 (03CR) 10Jdlrobson: "Deploy scheduled for Wednesday 1pm PST" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) (owner: 10Jdlrobson)
[15:29:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson)
[15:29:20] <wikibugs>	 (03PS4) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128)
[15:31:25] <Dreamy_Jazz>	 !log Ran `mwscript extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --wiki=testwiki` for T366781
[15:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:30] <stashbot>	 T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781
[15:32:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:33:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65415 and previous config saved to /var/cache/conftool/dbconfig/20240625-153307-arnaudb.json
[15:33:13] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[15:33:17] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[1011-1021].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[15:33:21] <wikibugs>	 (03PS5) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378)
[15:33:23] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[15:33:36] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[15:33:38] <wikibugs>	 (03Abandoned) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[15:39:44] <wikibugs>	 (03PS1) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366)
[15:39:47] <wikibugs>	 (03PS1) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366)
[15:45:03] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Remove acmechief annotations for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1049559 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[15:48:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 25%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65416 and previous config saved to /var/cache/conftool/dbconfig/20240625-154813-arnaudb.json
[15:50:12] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[15:50:25] <Amir1>	 jouncebot: nowandnext
[15:50:25] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1500)
[15:50:25] <jouncebot>	 In 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1600)
[15:50:41] <brett>	 !ops disabling puppet/stopping pybal on lvs2011 for memory failure maintenance - T368165
[15:51:13] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[15:51:36] <vgutierrez>	 brett: !log?
[15:51:41] <brett>	 .....
[15:51:45] <brett>	 !log disabling puppet/stopping pybal on lvs2011 for memory failure maintenance - T368165
[15:51:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[15:52:02] <brett>	 how embarrassing
[15:52:08] <vgutierrez>	 E_COFFEE? :)
[15:52:13] <brett>	 indeed...
[15:52:16] <cdanis>	 uh
[15:52:21] <cdanis>	 is logmsgbot broken anyway haha
[15:52:26] <vgutierrez>	 lol
[15:52:32] <brett>	 :S
[15:52:41] <taavi>	 you mean stashbot?
[15:52:41] <brett>	 Did.... !ops break it?
[15:52:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799)
[15:52:58] <icinga-wm>	 PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:53:12] <icinga-wm>	 PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:53:19] <brett>	 ^expected
[15:53:21] <RhinosF1>	 brett: no
[15:53:33] <RhinosF1>	 It and ircservserv-wm quit a bit ago
[15:53:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[15:53:43] <RhinosF1>	 It didn't auto restart I guess
[15:53:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs2011.codfw.wmnet with reason: T368165
[15:54:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2011.codfw.wmnet with reason: T368165
[15:54:38] <RhinosF1>	 taavi: you can restart stashbot right?
[16:00:04] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:38] <brennen>	 jhathaway, rzl, brett: we need a quick phab deploy for a followup to un-break wikibugs.  ok if we use this window?
[16:02:50] <jhathaway>	 please do
[16:02:57] <brennen>	 thx
[16:03:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 50%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65417 and previous config saved to /var/cache/conftool/dbconfig/20240625-160318-arnaudb.json
[16:03:25] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[16:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:03] <brennen>	 !log silencing phabricator hosts prior to deploy
[16:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:28] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@72ad841]: deploy phab2002 for T368392 - followup T364728
[16:08:34] <stashbot>	 T368392: Deploy Phabricator/Phorge 2024-06-25 - https://phabricator.wikimedia.org/T368392
[16:08:35] <stashbot>	 T364728: Revert or upstream rPHABf2fd14dc1edeb41aa2874336548cfaa7fa0e87a0 (maniphest.gettasktransactions API) - https://phabricator.wikimedia.org/T364728
[16:09:01] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@72ad841]: deploy phab2002 for T368392 - followup T364728 (duration: 00m 33s)
[16:10:16] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@72ad841]: deploy phab1004 for T368392 - followup T364728
[16:10:55] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@72ad841]: deploy phab1004 for T368392 - followup T364728 (duration: 00m 39s)
[16:11:22] <wikibugs>	 (03PS1) 10Pppery: Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989)
[16:13:18] <wikibugs>	 (03PS1) 10Eevans: ml-cache: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970)
[16:13:49] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans)
[16:15:02] <claime>	 !log Extending vg-srv on mw1437
[16:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] ml-cache: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049600 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans)
[16:17:50] <icinga-wm>	 PROBLEM - Disk space on mw1437 is CRITICAL: DISK CRITICAL - free space: / 6395 MB (1% inode=99%): /tmp 6395 MB (1% inode=99%): /var/tmp 6395 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1437&var-datasource=eqiad+prometheus/ops
[16:18:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65418 and previous config saved to /var/cache/conftool/dbconfig/20240625-161824-arnaudb.json
[16:18:32] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[16:19:29] <wikibugs>	 (03PS1) 10Pppery: Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989)
[16:19:43] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[16:19:48] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[16:19:55] <claime>	 !log cleaning up shellbox leftover files on mw1437.eqiad.wmnet
[16:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:22] <wikibugs>	 (03CR) 10Pppery: Export source strings again so en.json is indented with tabs (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery)
[16:23:51] <claime>	 !log depooling mw1437
[16:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:55] <bvibber>	 !log running requeueTranscodes for missing audio files on commons (mwmaint1002) cf T368364
[16:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:01] <stashbot>	 T368364: Transcodes of audio-only samples are not running for new uploads - https://phabricator.wikimedia.org/T368364
[16:26:11] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922688 (10RobH)
[16:26:50] <wikibugs>	 (03PS2) 10Cathal Mooney: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348)
[16:27:01] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1437.eqiad.wmnet with reason: Resizing disk
[16:27:10] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922684 (10RobH) a:05RobH→03None
[16:27:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1437.eqiad.wmnet with reason: Resizing disk
[16:27:31] <wikibugs>	 (03CR) 10Cathal Mooney: Validate IRB interface names correspond to vlan and refactor (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[16:27:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[16:29:06] <wikibugs>	 (03CR) 10Ahmon Dancy: "Sorry I missed office hours today.  Feel free to deploy whenever you see fit." [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy)
[16:29:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:30:00] <icinga-wm>	 RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:30:12] <icinga-wm>	 RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[16:30:36] <icinga-wm>	 RECOVERY - Disk space on mw1437 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1437&var-datasource=eqiad+prometheus/ops
[16:30:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Agreed on my side, can't think of any reason they would be useful." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi)
[16:31:35] <wikibugs>	 (03PS1) 10CDanis: haproxy: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428)
[16:31:37] <wikibugs>	 (03PS1) 10CDanis: ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428)
[16:31:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1437.eqiad.wmnet
[16:31:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1437.eqiad.wmnet
[16:32:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922720 (10Jhancock.wm) swapped DIMM_B1 for DIMM_B2 to test. error has cleared.
[16:33:13] <wikibugs>	 (03CR) 10Cathal Mooney: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[16:33:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65419 and previous config saved to /var/cache/conftool/dbconfig/20240625-163330-arnaudb.json
[16:33:36] <stashbot>	 T365986: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986
[16:34:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis)
[16:35:09] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049603 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis)
[16:36:38] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922750 (10cmooney) 05Open→03Resolved
[16:37:24] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[16:37:29] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[16:39:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9922776 (10Jhancock.wm) Thursday is great, thanks.
[16:39:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65420 and previous config saved to /var/cache/conftool/dbconfig/20240625-163919-marostegui.json
[16:39:25] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[16:42:00] <icinga-wm>	 PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:43:24] <brett>	 ^expected
[16:43:33] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm
[16:46:58] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos)
[16:49:07] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[16:49:13] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[16:49:50] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] Export source strings again so en.json is indented with tabs [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery)
[16:50:03] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Applies cleanly on latest wmf/stable branch locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1049599 (https://phabricator.wikimedia.org/T349989) (owner: 10Pppery)
[16:54:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P65421 and previous config saved to /var/cache/conftool/dbconfig/20240625-165426-marostegui.json
[16:57:18] <wikibugs>	 (03CR) 10CDanis: "not in a rush about this one, please advise about rollout though (ATS restarts required?)" [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis)
[16:58:56] <wikibugs>	 (03CR) 10Jgiannelos: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos)
[16:59:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922861 (10BCornwall) 05Open→03Resolved Linux is happy, too. Thank you, @Jhancock.wm!
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1700)
[17:01:09] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922864 (10BCornwall) a:03BCornwall
[17:01:22] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[17:01:28] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[17:02:06] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[17:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:04:28] <wikibugs>	 (03CR) 10Ssingh: "Reload should be fine here and is done by Puppet automatically." [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis)
[17:04:32] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[17:04:53] <cdanis>	 sukhe: ah thanks, I had remembered it being manual restart required for some reason
[17:06:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1049592/3067/" [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[17:06:50] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[17:06:55] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[17:07:11] <sukhe>	 cdanis: happy to take care of rolling this out if desired (as we do for other such requests)
[17:07:36] <cdanis>	 sukhe: I mean if it is just Puppet auto-reloads I have no problem +2'd and p-merging :)
[17:07:56] <sukhe>	 cdanis: I am pretty sure but if it is not, I take the fall. go ahead :)
[17:08:04] <sukhe>	 you can try on one host I guess
[17:08:23] <sukhe>	 cdanis: I am on-call now, so an official excuse
[17:08:46] <wikibugs>	 (03CR) 10CDanis: [C:03+2] ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049604 (https://phabricator.wikimedia.org/T368428) (owner: 10CDanis)
[17:09:04] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "no changes" [puppet] - 10https://gerrit.wikimedia.org/r/1049592 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[17:09:19] <wikibugs>	 (03CR) 10Jgiannelos: pcs: Enable resource change events on staging (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos)
[17:09:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P65422 and previous config saved to /var/cache/conftool/dbconfig/20240625-170933-marostegui.json
[17:11:58] <wikibugs>	 (03PS5) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978)
[17:11:59] <wikibugs>	 (03PS1) 10Scott French: mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978)
[17:12:29] <cdanis>	 sukhe: lol we should have waited for vg, I think I edited an obsolete file
[17:12:33] <cdanis>	 ah well
[17:12:53] <sukhe>	 didn't you get his +1?
[17:12:58] <sukhe>	 I thought I saw that
[17:13:00] <cdanis>	 no I thought you gave a +1
[17:13:06] <sukhe>	 oh I didn't lol
[17:13:11] <cdanis>	 yeah ok lol
[17:13:16] <cdanis>	 well there was no change on a cp-text host
[17:13:21] <wikibugs>	 (03PS1) 10Bvibber: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433)
[17:13:22] <cdanis>	 I'll look into it
[17:13:22] <sukhe>	 sorry for the confusion, i was remarking that the reload is not required
[17:13:25] <cdanis>	 yeah npo
[17:13:28] <cdanis>	 my mistake
[17:13:28] <sukhe>	 and I saw another +1 so I confused it with that
[17:13:33] <cdanis>	 didn't sleep great last night lol
[17:13:35] <sukhe>	 there was a +1 right!?
[17:13:39] <sukhe>	 or am I dreaming now
[17:13:55] <cdanis>	 there wasn't
[17:14:03] <cdanis>	 maybe you saw the V+2
[17:14:24] <sukhe>	 ah +1 on https://gerrit.wikimedia.org/r/1049603
[17:14:43] <cdanis>	 yeah, that one was the more important one, and it worked ;)
[17:14:59] <sukhe>	 that counts
[17:15:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber)
[17:18:09] <wikibugs>	 (03PS1) 10CDanis: ats: drop traceparent/tracestate response headers [puppet] - 10https://gerrit.wikimedia.org/r/1049609 (https://phabricator.wikimedia.org/T368428)
[17:18:43] <cdanis>	 ok now I'm editing the right file lol
[17:24:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T364069)', diff saved to https://phabricator.wikimedia.org/P65423 and previous config saved to /var/cache/conftool/dbconfig/20240625-172440-marostegui.json
[17:24:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[17:24:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:24:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[17:25:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65424 and previous config saved to /var/cache/conftool/dbconfig/20240625-172502-marostegui.json
[17:25:41] <wikibugs>	 (03PS1) 10Eevans: sessionstore2004: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970)
[17:27:30] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9922998 (10Dzahn)
[17:27:47] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans)
[17:27:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[17:28:11] <brett>	 !log Pooling lvs2011 - T368165
[17:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:17] <stashbot>	 T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165
[17:28:34] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm
[17:28:55] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049274 (https://phabricator.wikimedia.org/T368327) (owner: 10Cwhite)
[17:29:05] <icinga-wm>	 RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:29:12] <sukhe>	 nice
[17:36:01] <wikibugs>	 (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2004-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049614 (https://phabricator.wikimedia.org/T364457)
[17:36:03] <wikibugs>	 (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049615 (https://phabricator.wikimedia.org/T364457)
[17:36:05] <wikibugs>	 (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049616 (https://phabricator.wikimedia.org/T364457)
[17:36:52] <brett>	 !log Depooling lvs2011 due to elevated socket/tcp errors - T368165
[17:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:57] <stashbot>	 T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165
[17:37:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2004-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049614 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott)
[17:37:04] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm
[17:38:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber)
[17:39:05] <icinga-wm>	 PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:39:15] <brett>	 ^expected
[17:41:11] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore2004: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049612 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans)
[17:43:09] <brett>	 !log Re-re-pooling lvs2011 - T368165
[17:43:10] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9923096 (10xcollazo) >>! In T368098#9921287, @Ladsgroup wrote: >... >That's around 100M hit...
[17:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:15] <stashbot>	 T368165: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165
[17:44:05] <icinga-wm>	 RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:44:08] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[17:44:14] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[17:44:43] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15): Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9923108 (10xcollazo) >>! In T368098#9921990, @Ladsgroup wrote: >... >  - Replicas in dump gr...
[17:51:18] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002
[17:51:23] <stashbot>	 T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970
[17:52:37] <wikibugs>	 (03PS3) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465)
[17:55:16] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[17:57:54] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage
[18:00:04] <jouncebot>	 jeena and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T1800).
[18:00:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923213 (10Dzahn) 05In progress→03Stalled Hi Andy, this ticket is currently stalled and waiting for your input to continue before we can merge h...
[18:03:09] <jeena>	 o/
[18:04:09] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956)
[18:04:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[18:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049618 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot)
[18:06:15] <topranks>	 !log bringing up link from ssw1-a1-codfw to ssw1-d1-codfw T364095
[18:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:20] <stashbot>	 T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095
[18:07:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923230 (10AndyRussG) >>! In T367681#9923213, @Dzahn wrote: > Hi Andy, this ticket is currently stalled and waiting for your input to continue befor...
[18:07:26] <wikibugs>	 (03CR) 10Scott French: "Alright, I think we're ready for attempt #2. I'll aim to get this out during tomorrow's UTC-late infrastructure window. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[18:08:32] <wikibugs>	 (03CR) 10AndyRussG: "thanks so much for working on this, and many apologies for the delay!" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[18:12:42] <wikibugs>	 (03CR) 10Dzahn: "ah, so this profile::phorge was used to setup a test instance of phorge before we switched phabricator to phorge upsteadm. but it's not in" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[18:14:16] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.11  refs T366956
[18:14:21] <stashbot>	 T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956
[18:16:15] <icinga-wm>	 CUSTOM - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[18:16:42] <sukhe>	 er?
[18:16:44] <sukhe>	 what is this custom thing?
[18:16:48] <sukhe>	 I missed the memo
[18:17:19] <icinga-wm>	 CUSTOM - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[18:17:48] <denisse>	 sukhe: It's me, I'm debugging alerts of that host with mutante.
[18:17:58] <sukhe>	 oh thanks denisse 
[18:18:25] <mutante>	 it's a way to send alerts manually from Icinga web UI
[18:18:31] <mutante>	 without actually taking something down :)
[18:20:12] <sukhe>	 oh interesting
[18:22:28] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm
[18:25:27] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cp5017: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049168 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh)
[18:28:51] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet
[18:31:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye
[18:31:19] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b...
[18:43:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65425 and previous config saved to /var/cache/conftool/dbconfig/20240625-184349-marostegui.json
[18:43:55] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[18:49:44] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5017.eqsin.wmnet with OS bullseye
[18:49:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye
[18:50:18] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923367 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls...
[18:50:25] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b...
[18:58:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65426 and previous config saved to /var/cache/conftool/dbconfig/20240625-185856-marostegui.json
[18:59:11] <wikibugs>	 (03PS2) 10Dzahn: Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[18:59:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:01:13] <wikibugs>	 (03PS3) 10Dzahn: Phabricator: Add safe.directory directives [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:04:24] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková)
[19:07:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9923444 (10Dzahn) Hey @AndyRussG No worries, and hope you are well / feeling better. There is no particular rush here. We have a couple days until t...
[19:14:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65428 and previous config saved to /var/cache/conftool/dbconfig/20240625-191403-marostegui.json
[19:14:56] <wikibugs>	 (03PS4) 10Dzahn: Phabricator: Add safe.directory directive [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:15:51] <wikibugs>	 (03CR) 10Dzahn: "the arcanist class is only used on toolforge, not on prod phabricator. so this is just a single dir after all" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:16:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1025478/3069/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:23:45] <sukhe>	 !log re-enable puppet on lvs2011
[19:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "config was created but unfortunately won't work as expected since /srv/phab is a symlink" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:24:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[phab1004:/srv/phab] $ cat /etc/gitconfig.d/10-safe_directory_phabdir.gitconfig" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:25:31] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[19:28:02] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[19:29:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367856)', diff saved to https://phabricator.wikimedia.org/P65429 and previous config saved to /var/cache/conftool/dbconfig/20240625-192910-marostegui.json
[19:29:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[19:29:16] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[19:29:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[19:29:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:29:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:29:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T367856)', diff saved to https://phabricator.wikimedia.org/P65430 and previous config saved to /var/cache/conftool/dbconfig/20240625-192947-marostegui.json
[19:32:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "There is another issue here. When the git config is in the home dir of the user running git then it works but the same config in /etc/gitc" [puppet] - 10https://gerrit.wikimedia.org/r/1025478 (https://phabricator.wikimedia.org/T360756) (owner: 10Aklapper)
[19:33:39] <wikibugs>	 (03PS1) 10Cwhite: mediawiki: enable forward of fatal metrics to statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049625 (https://phabricator.wikimedia.org/T356814)
[19:41:45] <wikibugs>	 (03PS1) 10Scott French: kubernetes: promote unavailable replicas alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240625T2000). Please do the needful.
[20:00:05] <jouncebot>	 ksarabia and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <bvibber>	 o/
[20:01:33] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@1eb5f4c]: remove CollaborationKit T368092
[20:01:38] <stashbot>	 T368092: Archive the CollaborationKit extension - https://phabricator.wikimedia.org/T368092
[20:01:40] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@1eb5f4c]: remove CollaborationKit T368092 (duration: 00m 07s)
[20:01:49] <cjming>	 i can deploy o/
[20:01:55] <bvibber>	 whee
[20:02:40] <cjming>	 i'll start with yours bvibber since i don't see ksarabia yet
[20:03:17] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS bullseye
[20:03:20] <bvibber>	 ok
[20:03:27] <wikibugs>	 (03PS2) 10Bvibber: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433)
[20:03:29] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls...
[20:03:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber)
[20:04:14] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:05:16] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable '4K' 2160p and mid 1440p transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049608 (https://phabricator.wikimedia.org/T368433) (owner: 10Bvibber)
[20:05:31] <kimberly_sarabia>	 hi
[20:05:47] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]]
[20:05:54] <stashbot>	 T368433: Disable 1440p and 2160p video transcodes until encoding performance is better - https://phabricator.wikimedia.org/T368433
[20:06:04] <cjming>	 hi kim! i'll deploy your patches after bvibber's
[20:06:14] <kimberly_sarabia>	 sounds good
[20:08:39] <logmsgbot>	 !log cjming@deploy1002 cjming, bvibber: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:08:42] <cjming>	 bvibber: is your patch testable? on mwdebug if so
[20:08:58] <bvibber>	 Nah it'll just affect job queue
[20:09:04] <cjming>	 sounds good - will sync
[20:09:07] <bvibber>	 So should be fine as long as it doesn't kill the site ;)
[20:09:11] <logmsgbot>	 !log cjming@deploy1002 cjming, bvibber: Continuing with sync
[20:11:03] <wikibugs>	 (03PS6) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378)
[20:11:06] <Emperor>	 !log restart swift-proxy on ms-fe2010 ms-fe1011 T360913
[20:11:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:12] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[20:14:24] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1049608|Temporarily disable '4K' 2160p and mid 1440p transcodes (T368433)]] (duration: 08m 36s)
[20:14:29] <stashbot>	 T368433: Disable 1440p and 2160p video transcodes until encoding performance is better - https://phabricator.wikimedia.org/T368433
[20:14:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[20:14:50] <cjming>	 bvibber: should be live!
[20:14:54] <bvibber>	 thx :D
[20:14:57] <cjming>	 yw!
[20:15:15] <cjming>	 kimberly_sarabia: moving onto your pathes
[20:15:18] <cjming>	 *patches
[20:15:27] <kimberly_sarabia>	 sounds good
[20:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[20:16:13] <bvibber>	 [oh, actually i can check that it dsiabled correctly.... and it looks good :D thx]
[20:16:21] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]]
[20:16:31] <stashbot>	 T366378: [Config change] Enable night theme on preferences pages - https://phabricator.wikimedia.org/T366378
[20:16:32] <stashbot>	 T367374: [Config] Enable dark mode on protect and deletion pages - https://phabricator.wikimedia.org/T367374
[20:16:32] <stashbot>	 T366373: [Config change] Enable night theme on pages which use data tables - https://phabricator.wikimedia.org/T366373
[20:16:32] <stashbot>	 T366520: [Config] Dark mode is not available on Special:ApiSandbox - https://phabricator.wikimedia.org/T366520
[20:17:08] <wikibugs>	 (03PS5) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128)
[20:19:01] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:19:06] <cjming>	 kimberly_sarabia: 1st patch is up on test servers if you want to check
[20:19:27] <kimberly_sarabia>	 ok taking a look
[20:20:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Remove acmechief annotations for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[20:25:26] <cjming>	 kimberly_sarabia: shall i sync?
[20:25:46] <kimberly_sarabia>	 Ok, it looks good except system-messages, protected-pages is not enabled in dark mode for me. I will ask if we either missed something in the patch or if we are not moving forward with that. But go ahead and move forward with the sync
[20:25:56] <kimberly_sarabia>	 thank you
[20:26:06] <cjming>	 alrighty - syncing!
[20:26:10] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Continuing with sync
[20:26:57] <icinga-wm>	 PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 9479 MB (2% inode=99%): /tmp 9479 MB (2% inode=99%): /var/tmp 9479 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops
[20:31:25] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1041250|Enable dark mode on more pages (T366378 T367374 T366373 T366520 T366373)]] (duration: 15m 04s)
[20:31:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson)
[20:31:38] <stashbot>	 T366378: [Config change] Enable night theme on preferences pages - https://phabricator.wikimedia.org/T366378
[20:31:39] <stashbot>	 T367374: [Config] Enable dark mode on protect and deletion pages - https://phabricator.wikimedia.org/T367374
[20:31:39] <stashbot>	 T366373: [Config change] Enable night theme on pages which use data tables - https://phabricator.wikimedia.org/T366373
[20:31:39] <stashbot>	 T366520: [Config] Dark mode is not available on Special:ApiSandbox - https://phabricator.wikimedia.org/T366520
[20:32:12] <cjming>	 kimberly_sarabia: 1st patch should be live - started your 2nd patch
[20:32:28] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson)
[20:32:58] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]]
[20:33:05] <kimberly_sarabia>	 cjming: thank you!
[20:33:05] <stashbot>	 T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128
[20:35:34] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:35:57] <cjming>	 kimberly_sarabia: yw! ok to sync 2nd patch?
[20:36:16] <kimberly_sarabia>	 cjming: yup!
[20:36:20] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Continuing with sync
[20:37:31] <wikibugs>	 (03PS1) 10Dzahn: phabricator: configure git safedir for all directories [puppet] - 10https://gerrit.wikimedia.org/r/1049637 (https://phabricator.wikimedia.org/T360756)
[20:41:27] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1043880|Cleanup: Remove wgNavigationTimingSurveyName (T367128)]] (duration: 08m 29s)
[20:41:33] <stashbot>	 T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128
[20:41:59] <cjming>	 kimberly_sarabia: and 2nd patch should be live :)
[20:42:25] <kimberly_sarabia>	 cjming: wonderful! thank you so much
[20:42:41] <cjming>	 you're very welcome!
[20:44:47] <cjming>	 !log end of UTC late backport window
[20:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:12] <wikibugs>	 (03CR) 10Krinkle: "For the past ~2 years, the CDN config has been a 1-day fresh TTL with a 7-day stale/keep TTL (i.e. akin to stale-while-revalidate). In ord" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński)
[21:04:10] <wikibugs>	 (03CR) 10Dzahn: "This is somewhat "wrong" but the only way to make things work because the deployment dir path changes on EVERY deploy and only scap knows " [puppet] - 10https://gerrit.wikimedia.org/r/1049637 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn)
[21:04:14] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:47:37] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bookworm
[21:47:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049615 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott)
[21:57:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65431 and previous config saved to /var/cache/conftool/dbconfig/20240625-215705-marostegui.json
[21:57:11] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[22:03:38] <bvibber>	 hmm
[22:03:46] <bvibber>	 av_interleaved_write_frame(): No space left on device
[22:03:46] <bvibber>	 Error writing trailer of transcoded.webm: No space left on device
[22:04:07] <bvibber>	 is that just the quota or are the video job runners out of space :D
[22:05:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage
[22:06:23] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove acmechief annotations for MX hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff)
[22:09:08] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage
[22:10:10] <bvibber>	 !log a webVideoTranscode job reported 'No space left on device' from a failed ffmpeg run on mw1446 recently
[22:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P65432 and previous config saved to /var/cache/conftool/dbconfig/20240625-221212-marostegui.json
[22:27:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P65433 and previous config saved to /var/cache/conftool/dbconfig/20240625-222719-marostegui.json
[22:33:35] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2005-dev.codfw.wmnet with OS bookworm
[22:42:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364069)', diff saved to https://phabricator.wikimedia.org/P65434 and previous config saved to /var/cache/conftool/dbconfig/20240625-224226-marostegui.json
[22:42:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[22:42:33] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[22:42:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[22:42:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T364069)', diff saved to https://phabricator.wikimedia.org/P65435 and previous config saved to /var/cache/conftool/dbconfig/20240625-224249-marostegui.json
[22:43:04] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet
[22:44:15] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bookworm
[22:47:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:50:32] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:51:20] <bd808>	 mutante: ^ gerrit go boom?
[22:51:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049616 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott)
[22:51:57] <mutante>	 bd808: yea, but when I started looking it was already back
[22:52:10] <bd808>	 ack
[22:52:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:54:14] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:55:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:58:42] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add another IP to misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1049643
[22:59:19] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] gerrit: add another IP to misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1049643 (owner: 10Dzahn)
[23:00:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:02:36] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage
[23:05:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage
[23:27:41] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet with OS bookworm
[23:35:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367856)', diff saved to https://phabricator.wikimedia.org/P65436 and previous config saved to /var/cache/conftool/dbconfig/20240625-233520-marostegui.json
[23:35:26] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[23:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049644
[23:38:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049644 (owner: 10TrainBranchBot)
[23:50:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65437 and previous config saved to /var/cache/conftool/dbconfig/20240625-235027-marostegui.json