[00:02:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 845.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:02:44] (03PS1) 10Dzahn: ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) [00:03:05] (03CR) 10Dzahn: "or we do https://gerrit.wikimedia.org/r/1020958 and don't need this ever again" [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [00:03:06] (03CR) 10CI reject: [V:04-1] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [00:05:53] (03PS2) 10Dzahn: ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) [00:06:15] (03CR) 10CI reject: [V:04-1] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [00:06:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 854.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:06:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T352010)', diff saved to https://phabricator.wikimedia.org/P60832 and previous config saved to /var/cache/conftool/dbconfig/20240418-000616-ladsgroup.json [00:06:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [00:06:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:06:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [00:06:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T352010)', diff saved to https://phabricator.wikimedia.org/P60833 and previous config saved to /var/cache/conftool/dbconfig/20240418-000639-ladsgroup.json [00:06:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:11:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 811.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:46:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:51:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:53:42] 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841 (10ops-monitoring-bot) 03NEW [01:13:50] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:46:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:13:50] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 831.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:38:50] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 831.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:03:50] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:50] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:23] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 826.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:39:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 826.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:40:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 847.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:45:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 801ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:57:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 845.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:02:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 843.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:15:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 854.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:20:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 831.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:21:33] (03PS1) 10Ilias Sarantopoulos: ml-services: increase replicas for ruwiki damaging and log payload [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021056 (https://phabricator.wikimedia.org/T362503) [04:36:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 896.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:41:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 834.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:50:17] (03PS2) 10Ilias Sarantopoulos: ml-services: increase replicas for ruwiki damaging and log payload [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021056 (https://phabricator.wikimedia.org/T362503) [05:01:56] (03CR) 10Marostegui: [C:03+1] mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [05:11:09] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase replicas for ruwiki damaging and log payload [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021056 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [05:11:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2108', diff saved to https://phabricator.wikimedia.org/P60834 and previous config saved to /var/cache/conftool/dbconfig/20240418-051129-root.json [05:12:17] (03PS1) 10Marostegui: db2108: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021088 [05:13:34] (03CR) 10Marostegui: [C:03+2] db2108: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021088 (owner: 10Marostegui) [05:13:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2108.codfw.wmnet with OS bookworm [05:16:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s5 T362668 [05:16:12] T362668: Switchover s5 master (db1183 -> db1230) - https://phabricator.wikimedia.org/T362668 [05:16:16] (03PS1) 10Marostegui: Revert "db2108: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020928 [05:16:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s5 T362668 [05:16:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1230 with weight 0 T362668', diff saved to https://phabricator.wikimedia.org/P60835 and previous config saved to /var/cache/conftool/dbconfig/20240418-051639-arnaudb.json [05:19:59] !log dbmaint Upgrade s7 codfw to Bookworm and MariaDB 10.6 T362745 [05:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:05] T362745: Upgrade s7 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362745 [05:26:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:31:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2108.codfw.wmnet with reason: host reimage [05:31:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:34:22] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1019777 (https://phabricator.wikimedia.org/T362668) (owner: 10Gerrit maintenance bot) [05:34:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2108.codfw.wmnet with reason: host reimage [05:35:39] !log Starting s5 eqiad failover from db1183 to db1230 - T362668 [05:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:49] T362668: Switchover s5 master (db1183 -> db1230) - https://phabricator.wikimedia.org/T362668 [05:36:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T362668', diff saved to https://phabricator.wikimedia.org/P60836 and previous config saved to /var/cache/conftool/dbconfig/20240418-053657-arnaudb.json [05:38:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1230 to s5 primary and set section read-write T362668', diff saved to https://phabricator.wikimedia.org/P60837 and previous config saved to /var/cache/conftool/dbconfig/20240418-053852-arnaudb.json [05:40:18] (03CR) 10Arnaudb: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1019778 (https://phabricator.wikimedia.org/T362668) (owner: 10Gerrit maintenance bot) [05:40:27] (03PS2) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1019778 (https://phabricator.wikimedia.org/T362668) [05:40:29] (03CR) 10Arnaudb: [V:03+2 C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1019778 (https://phabricator.wikimedia.org/T362668) (owner: 10Gerrit maintenance bot) [05:41:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:42:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1183 T362668', diff saved to https://phabricator.wikimedia.org/P60838 and previous config saved to /var/cache/conftool/dbconfig/20240418-054247-arnaudb.json [05:42:53] T362668: Switchover s5 master (db1183 -> db1230) - https://phabricator.wikimedia.org/T362668 [05:46:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:50:23] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:50:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60840 and previous config saved to /var/cache/conftool/dbconfig/20240418-055335-root.json [05:53:41] (03CR) 10Marostegui: [C:03+2] Revert "db2108: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020928 (owner: 10Marostegui) [05:56:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2108.codfw.wmnet with OS bookworm [05:57:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: upgrade db1183 T360116 [05:57:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: upgrade db1183 T360116 [05:57:32] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T0600). [06:00:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:00:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:02:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1183.eqiad.wmnet with OS bookworm [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60841 and previous config saved to /var/cache/conftool/dbconfig/20240418-060841-root.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:13:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: host reimage [06:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: host reimage [06:16:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:20:48] (03PS3) 10Slyngshede: Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 [06:20:54] (03CR) 10Slyngshede: Initial documentation for the Bitu API. (038 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [06:23:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60842 and previous config saved to /var/cache/conftool/dbconfig/20240418-062346-root.json [06:30:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:30:23] (03PS4) 10Slyngshede: Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 [06:30:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:32:01] (03CR) 10Slyngshede: Keymanagement, fix parsing and display of FIDO/U2F keys (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 (owner: 10Slyngshede) [06:34:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:35:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:35:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:35:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T361627)', diff saved to https://phabricator.wikimedia.org/P60843 and previous config saved to /var/cache/conftool/dbconfig/20240418-063536-marostegui.json [06:35:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:36:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1183.eqiad.wmnet with OS bookworm [06:37:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T361627)', diff saved to https://phabricator.wikimedia.org/P60844 and previous config saved to /var/cache/conftool/dbconfig/20240418-063746-marostegui.json [06:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60845 and previous config saved to /var/cache/conftool/dbconfig/20240418-063852-root.json [06:39:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:39:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:41:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 1%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60846 and previous config saved to /var/cache/conftool/dbconfig/20240418-064135-arnaudb.json [06:52:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P60847 and previous config saved to /var/cache/conftool/dbconfig/20240418-065254-marostegui.json [06:53:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60848 and previous config saved to /var/cache/conftool/dbconfig/20240418-065358-root.json [06:56:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 2%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60849 and previous config saved to /var/cache/conftool/dbconfig/20240418-065641-arnaudb.json [06:59:37] (03PS4) 10Msz2001: [plwiki] Limit Content Translation publishing to mainspace for non-editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) [07:00:04] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T0700). [07:00:04] Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] o/ [07:00:39] I can deploy today [07:01:11] I'm ready for having it deployed [07:01:31] (03CR) 10Urbanecm: [C:03+2] [plwiki] Limit Content Translation publishing to mainspace for non-editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) (owner: 10Msz2001) [07:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) (owner: 10Msz2001) [07:03:16] (03Merged) 10jenkins-bot: [plwiki] Limit Content Translation publishing to mainspace for non-editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020729 (https://phabricator.wikimedia.org/T362756) (owner: 10Msz2001) [07:04:13] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1020729|[plwiki] Limit Content Translation publishing to mainspace for non-editors (T362756)]] [07:04:18] I may have a patch for eventlogging stream config, if I can fix it up in time. [07:04:27] T362756: [plwiki] Limit Content Translation publishing to mainspace for non-editors - https://phabricator.wikimedia.org/T362756 [07:04:28] (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete Hiera host entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:07:23] !log urbanecm@deploy1002 msz2001 and urbanecm: Backport for [[gerrit:1020729|[plwiki] Limit Content Translation publishing to mainspace for non-editors (T362756)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:30] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::bastion: install locales-all [puppet] - 10https://gerrit.wikimedia.org/r/1020906 (https://phabricator.wikimedia.org/T362680) (owner: 10Majavah) [07:07:41] Msz2001: can you test the patch at mwdebug1002, please? [07:08:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P60850 and previous config saved to /var/cache/conftool/dbconfig/20240418-070801-marostegui.json [07:08:04] It works as intended [07:08:15] !log urbanecm@deploy1002 msz2001 and urbanecm: Continuing with sync [07:08:43] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725186 (10LSobanski) Additional BGP WARNING alert that showed up today: ` AS38082/IPv6: Active (for 65d14h), AS5398/IPv6: Active (for 118d16h), A... [07:09:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60851 and previous config saved to /var/cache/conftool/dbconfig/20240418-070904-root.json [07:09:43] (03PS3) 10Dzahn: ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) [07:10:01] I'm adding a patch to the calendar [07:10:43] (03PS1) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) [07:10:43] (03CR) 10Muehlenhoff: "Final round of nits/typos, otherwise good to go" [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [07:11:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 5%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60852 and previous config saved to /var/cache/conftool/dbconfig/20240418-071147-arnaudb.json [07:12:11] (03PS2) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) [07:12:44] (03CR) 10Dzahn: "contint.wikimedia.org is an alias for contint2002.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:13:44] urbanecm: I added my config patch to the calendar [07:13:53] I can sync it myself, if you don't have time. [07:14:07] (03CR) 10Urbanecm: [C:04-1] WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:14:17] kostajh: i was just reviewing it, can you see the comment please? [07:15:09] (03PS3) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) [07:15:10] (03CR) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:15:31] fixed [07:15:41] (03CR) 10Urbanecm: [C:03+1] WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:15:45] ty, lgtm [07:16:00] waiting for scap to finish scaping [07:16:12] (03CR) 10Urbanecm: [C:03+2] WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:16:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:16:56] (03Merged) 10jenkins-bot: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020929 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:18:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 (owner: 10Slyngshede) [07:20:11] kostajh: on second thought, you'll also need to list it here https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/wmf-config/ext-EventLogging.php [07:20:34] * kostajh ah [07:20:45] urbanecm: I'll make a second patch, I guess, as the first was merged? [07:20:50] yup [07:20:54] and we can sync them together [07:21:09] k, just a second [07:21:29] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1020729|[plwiki] Limit Content Translation publishing to mainspace for non-editors (T362756)]] (duration: 17m 15s) [07:21:34] T362756: [plwiki] Limit Content Translation publishing to mainspace for non-editors - https://phabricator.wikimedia.org/T362756 [07:21:36] waiting for the other patch now [07:21:41] Msz2001: should be live in producdtion now [07:22:00] Indeed, it is. Thanks! [07:22:40] np [07:23:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T361627)', diff saved to https://phabricator.wikimedia.org/P60853 and previous config saved to /var/cache/conftool/dbconfig/20240418-072309-marostegui.json [07:23:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:23:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:23:23] (03PS1) 10Kosta Harlan: ext-EventLogging: Add mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021338 (https://phabricator.wikimedia.org/T354597) [07:23:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:23:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T361627)', diff saved to https://phabricator.wikimedia.org/P60854 and previous config saved to /var/cache/conftool/dbconfig/20240418-072331-marostegui.json [07:23:42] urbanecm: https://gerrit.wikimedia.org/r/1021338 [07:23:46] I'll add that to the calendar as well [07:23:49] ty [07:23:56] (03CR) 10Urbanecm: [C:03+2] ext-EventLogging: Add mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021338 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:24:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021338 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:24:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60855 and previous config saved to /var/cache/conftool/dbconfig/20240418-072410-root.json [07:24:15] and let's see [07:24:35] cool [07:24:41] (03Merged) 10jenkins-bot: ext-EventLogging: Add mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021338 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [07:24:46] urbanecm: I can verify it once it's on the mwdebug servers [07:24:52] sounds good [07:25:13] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1020929|WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) (T354597)]], [[gerrit:1021338|ext-EventLogging: Add mediawiki.ip_reputation.score (T354597)]] [07:25:21] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [07:25:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T361627)', diff saved to https://phabricator.wikimedia.org/P60856 and previous config saved to /var/cache/conftool/dbconfig/20240418-072542-marostegui.json [07:26:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60857 and previous config saved to /var/cache/conftool/dbconfig/20240418-072653-arnaudb.json [07:28:09] (03CR) 10DCausse: Add Flink alerts for Cirrus Streaming Updater (036 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [07:28:16] !log urbanecm@deploy1002 kharlan and urbanecm: Backport for [[gerrit:1020929|WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) (T354597)]], [[gerrit:1021338|ext-EventLogging: Add mediawiki.ip_reputation.score (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:28:22] kostajh: can you take a look? :) [07:28:40] yep [07:29:25] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:31:09] !log upgrading PHP security updates on codfw baremetal servers T362511 [07:31:10] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:47] urbanecm: I don't see an error in Logstash, so that's fine. But I also don't see the topic or event via kafkacat on stat1007 [07:33:04] i guess progress? [07:33:06] (03PS2) 10Jcrespo: dbbackups: Setup dbprov1005 as new host to send s3 and s5 backups [puppet] - 10https://gerrit.wikimedia.org/r/1020750 (https://phabricator.wikimedia.org/T362509) [07:33:36] i also don't see a validation error [07:34:36] kostajh: since it does not error out, i'm inclined to deploy and see whether the topic won't appear later. what do you think? [07:34:46] urbanecm: it's been a while since I used kafkacat, so maybe it's not working the way I remember it. [07:34:50] yeah I think deployment is fine [07:34:54] ack, proceeding [07:34:57] !log urbanecm@deploy1002 kharlan and urbanecm: Continuing with sync [07:36:31] (03CR) 10Dzahn: [C:04-1] "I would prefer https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020958 instead but if not this is the alternative to keep it simple f" [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:37:13] (03CR) 10Dzahn: "This would be like the last step of the switch-over, new server becomes new source of data." [puppet] - 10https://gerrit.wikimedia.org/r/1020957 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:39:32] (03CR) 10Dzahn: [C:04-1] "for the switch-over window, main lever" [puppet] - 10https://gerrit.wikimedia.org/r/1020954 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:39:47] (03PS3) 10Cathal Mooney: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) [07:40:05] (03CR) 10Dzahn: [C:04-2] "the DNS part of the switch-over" [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:40:38] (03CR) 10Dzahn: [C:04-2] "first thing to do next week" [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:40:49] urbanecm: specifically, I would expect `kafkacat -L -b kafka-jumbo1007.eqiad.wmnet:9092 | grep ip_reputation` to show something [07:40:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P60858 and previous config saved to /var/cache/conftool/dbconfig/20240418-074049-marostegui.json [07:41:24] (03CR) 10Dzahn: graphite: switch envoy ssl provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [07:41:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 15%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60859 and previous config saved to /var/cache/conftool/dbconfig/20240418-074158-arnaudb.json [07:42:07] (03CR) 10Dzahn: [C:04-1] "once both contint hosts are reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [07:42:11] (I still think it's fine to continue deploying) [07:42:21] kostajh: yeah, me too. not sure what is happening. thinking. [07:43:00] (03CR) 10Dzahn: "yea, agreed. good idea to follow some existing standard. amended and left comment on ticket. because they had already agreed to the other " [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [07:43:02] urbanecm: ah, I see an error [07:43:08] Yes? Where? [07:43:14] urbanecm: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.04.18?id=Rico8I4BX0U9mJhKRjiO [07:43:39] (03PS1) 10Arnaudb: mariadb: removes db2113 [puppet] - 10https://gerrit.wikimedia.org/r/1020722 (https://phabricator.wikimedia.org/T362792) [07:44:14] ahh, a validation error [07:44:37] "schema title must be mediawiki/ip_reputation/score." [07:44:40] and where do I set that? [07:45:29] https://gerrit.wikimedia.org/g/schemas/event/secondary/+/bfa34c3c21b5846ae57d49a0fee32c0d21ca17b1/jsonschema/analytics/mediawiki/ip_reputation/score/1.0.0.yaml, but why does it need to not have analytics in it is what i am wondering [07:46:15] ah, not [07:46:56] (03PS1) 10Urbanecm: EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021355 (https://phabricator.wikimedia.org/T354597) [07:47:01] kostajh: ^^ this should be it [07:47:25] ugh [07:47:26] yes [07:47:34] sorry about that [07:47:40] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1020929|WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt) (T354597)]], [[gerrit:1021338|ext-EventLogging: Add mediawiki.ip_reputation.score (T354597)]] (duration: 22m 27s) [07:47:42] np, we both missed it [07:47:45] (03CR) 10Kosta Harlan: [C:03+1] EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021355 (https://phabricator.wikimedia.org/T354597) (owner: 10Urbanecm) [07:47:45] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [07:47:59] kostajh: unfortunately i need to leave to catch my train to Berlin. can you deploy the (hopefully!) last one? [07:48:03] (03CR) 10Kosta Harlan: [C:03+1] "Follows-up Ic5c5d17ce72689396029452450f66dd271c2e575" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021355 (https://phabricator.wikimedia.org/T354597) (owner: 10Urbanecm) [07:48:09] Yeah I'll do it [07:48:13] ack [07:48:14] safe travels [07:48:21] ty! [07:48:28] urbanecm: scap is done? [07:48:33] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2113 [puppet] - 10https://gerrit.wikimedia.org/r/1020722 (https://phabricator.wikimedia.org/T362792) (owner: 10Arnaudb) [07:48:36] kostajh: yes [07:48:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021355 (https://phabricator.wikimedia.org/T354597) (owner: 10Urbanecm) [07:50:29] (03CR) 10Jcrespo: [C:03+2] dbbackups: Setup dbprov1005 as new host to send s3 and s5 backups [puppet] - 10https://gerrit.wikimedia.org/r/1020750 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [07:50:44] (03Merged) 10jenkins-bot: EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021355 (https://phabricator.wikimedia.org/T354597) (owner: 10Urbanecm) [07:51:14] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1021355|EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score (T354597)]] [07:51:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 depool', diff saved to https://phabricator.wikimedia.org/P60860 and previous config saved to /var/cache/conftool/dbconfig/20240418-075154-arnaudb.json [07:51:58] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725253 (10cmooney) I'll take a look and clear up what I can. [07:52:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2113.codfw.wmnet [07:54:16] !log kharlan@deploy1002 urbanecm and kharlan: Backport for [[gerrit:1021355|EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:54:21] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [07:55:04] (03CR) 10Marostegui: [C:03+1] mariadb: removes db2113 [puppet] - 10https://gerrit.wikimedia.org/r/1020722 (https://phabricator.wikimedia.org/T362792) (owner: 10Arnaudb) [07:55:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P60861 and previous config saved to /var/cache/conftool/dbconfig/20240418-075557-marostegui.json [07:57:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60862 and previous config saved to /var/cache/conftool/dbconfig/20240418-075704-arnaudb.json [07:57:07] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [07:58:29] !log kharlan@deploy1002 urbanecm and kharlan: Continuing with sync [07:58:43] kostajh: did it work? [08:00:10] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725264 (10cmooney) 05Open→03Resolved This one in particular down for almost a year and IPs are not responding to ARP/ND on the LAN. Peerin... [08:00:34] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2113.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:01:29] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase replicas for ruwiki damaging and log payload [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021056 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [08:01:32] urbanecm: hard to say, because I don't see how to easily filter out mwdebug triggered events in the eventgate dashboard, and there are already a few hundred events coming in with failures (from the previous patch sync). So I am syncing and looking to see if the errors stop. If they don't, I'll unset the feature flag to disable until I can work out what is happening. [08:02:27] (03Merged) 10jenkins-bot: ml-services: increase replicas for ruwiki damaging and log payload [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021056 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [08:02:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2113.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:02:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2113.codfw.wmnet [08:02:49] kostajh: Fingers crossed [08:03:15] urbanecm: meh, just had the thought to look at request ID, and see that my attempt via mwdebug failed [08:03:23] so I'll switch off the feature flag for now, I guess [08:03:59] :-( [08:05:04] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2113.codfw.wmnet - https://phabricator.wikimedia.org/T362792#9725269 (10ABran-WMF) [08:05:21] (03PS1) 10Arnaudb: mariadb: removes db2112 [puppet] - 10https://gerrit.wikimedia.org/r/1020723 (https://phabricator.wikimedia.org/T362793) [08:07:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:07:50] urbanecm: although, I now see the errors are dropping off in the dashboard as helmfile apply is being run [08:08:16] so it is working, but didn't when I tested via mwdebug a few minutes ago 😕 [08:08:31] Not sure why, but good sign! [08:10:51] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1021355|EventStreamConfig: Fix stream title for mediawiki.ip_reputation.score (T354597)]] (duration: 19m 36s) [08:10:56] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [08:11:00] !log mforns@deploy1002 Started deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] [08:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T361627)', diff saved to https://phabricator.wikimedia.org/P60863 and previous config saved to /var/cache/conftool/dbconfig/20240418-081104-marostegui.json [08:11:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:11:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:11:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60864 and previous config saved to /var/cache/conftool/dbconfig/20240418-081127-marostegui.json [08:11:42] (03CR) 10Filippo Giunchedi: "Thank you for working on this, change LGTM though PCC fails https://puppet-compiler.wmflabs.org/output/1019885/1982/" [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [08:12:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60865 and previous config saved to /var/cache/conftool/dbconfig/20240418-081210-arnaudb.json [08:12:27] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9725288 (10Fabfur) I'd like to join the chorus of thanks to Papaul, you resolved us a very nasty and long running issue here! Thanks a... [08:13:02] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2112 [puppet] - 10https://gerrit.wikimedia.org/r/1020723 (https://phabricator.wikimedia.org/T362793) (owner: 10Arnaudb) [08:13:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60866 and previous config saved to /var/cache/conftool/dbconfig/20240418-081338-marostegui.json [08:13:42] !log UTC morning deploys done [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:03] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:14:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 depool', diff saved to https://phabricator.wikimedia.org/P60867 and previous config saved to /var/cache/conftool/dbconfig/20240418-081439-arnaudb.json [08:15:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2112.codfw.wmnet [08:17:50] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020295 (https://phabricator.wikimedia.org/T361688) (owner: 10Stevemunene) [08:23:56] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:24:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2027.codfw.wmnet [08:25:07] !log mforns@deploy1002 Finished deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] (duration: 14m 07s) [08:25:50] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2112.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:26:57] (03PS1) 10Muehlenhoff: Switch es2027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021372 (https://phabricator.wikimedia.org/T349619) [08:26:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2112.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:26:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:26:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2112.codfw.wmnet [08:27:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60869 and previous config saved to /var/cache/conftool/dbconfig/20240418-082717-arnaudb.json [08:28:41] (03CR) 10Muehlenhoff: [C:03+2] Switch es2027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021372 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:28:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P60870 and previous config saved to /var/cache/conftool/dbconfig/20240418-082845-marostegui.json [08:29:01] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2112.codfw.wmnet - https://phabricator.wikimedia.org/T362793#9725344 (10ABran-WMF) [08:29:15] (03PS1) 10Arnaudb: mariadb: removes db2111 [puppet] - 10https://gerrit.wikimedia.org/r/1020724 (https://phabricator.wikimedia.org/T362794) [08:30:47] (03CR) 10Marostegui: [C:03+1] mariadb: removes db2111 [puppet] - 10https://gerrit.wikimedia.org/r/1020724 (https://phabricator.wikimedia.org/T362794) (owner: 10Arnaudb) [08:31:35] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2111 [puppet] - 10https://gerrit.wikimedia.org/r/1020724 (https://phabricator.wikimedia.org/T362794) (owner: 10Arnaudb) [08:32:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 depool', diff saved to https://phabricator.wikimedia.org/P60871 and previous config saved to /var/cache/conftool/dbconfig/20240418-083245-arnaudb.json [08:34:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2027.codfw.wmnet [08:34:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 depool', diff saved to https://phabricator.wikimedia.org/P60872 and previous config saved to /var/cache/conftool/dbconfig/20240418-083422-arnaudb.json [08:34:26] (03PS4) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) [08:34:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2029.codfw.wmnet [08:34:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2111.codfw.wmnet [08:37:11] (03PS5) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) [08:37:28] (03CR) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [08:38:16] (03PS1) 10Muehlenhoff: Switch es2029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021374 (https://phabricator.wikimedia.org/T349619) [08:38:52] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:39:04] (03CR) 10Muehlenhoff: [C:03+2] Switch es2029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021374 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:40:44] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2111.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:41:00] !log mforns@deploy1002 Started deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] [08:41:15] !log mforns@deploy1002 Finished deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] (duration: 00m 15s) [08:41:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2111.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:41:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2111.codfw.wmnet [08:42:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60873 and previous config saved to /var/cache/conftool/dbconfig/20240418-084223-arnaudb.json [08:42:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2029.codfw.wmnet [08:43:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P60874 and previous config saved to /var/cache/conftool/dbconfig/20240418-084353-marostegui.json [08:44:18] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2111.codfw.wmnet - https://phabricator.wikimedia.org/T362794#9725400 (10ABran-WMF) [08:44:39] (03PS1) 10Arnaudb: mariadb: removes db2110 [puppet] - 10https://gerrit.wikimedia.org/r/1020725 (https://phabricator.wikimedia.org/T362795) [08:45:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [08:45:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [08:45:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T356166)', diff saved to https://phabricator.wikimedia.org/P60875 and previous config saved to /var/cache/conftool/dbconfig/20240418-084510-marostegui.json [08:45:19] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:51:27] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2110 [puppet] - 10https://gerrit.wikimedia.org/r/1020725 (https://phabricator.wikimedia.org/T362795) (owner: 10Arnaudb) [08:52:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 depool', diff saved to https://phabricator.wikimedia.org/P60876 and previous config saved to /var/cache/conftool/dbconfig/20240418-085235-arnaudb.json [08:54:42] (03CR) 10Marostegui: [C:03+1] mariadb: removes db2110 [puppet] - 10https://gerrit.wikimedia.org/r/1020725 (https://phabricator.wikimedia.org/T362795) (owner: 10Arnaudb) [08:56:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2110', diff saved to https://phabricator.wikimedia.org/P60877 and previous config saved to /var/cache/conftool/dbconfig/20240418-085608-arnaudb.json [08:56:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T356166)', diff saved to https://phabricator.wikimedia.org/P60878 and previous config saved to /var/cache/conftool/dbconfig/20240418-085619-marostegui.json [08:56:25] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:57:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2034.codfw.wmnet [08:57:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2110.codfw.wmnet [08:59:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60879 and previous config saved to /var/cache/conftool/dbconfig/20240418-085900-marostegui.json [08:59:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:59:05] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:59:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T361627)', diff saved to https://phabricator.wikimedia.org/P60880 and previous config saved to /var/cache/conftool/dbconfig/20240418-085922-marostegui.json [09:00:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T361627)', diff saved to https://phabricator.wikimedia.org/P60881 and previous config saved to /var/cache/conftool/dbconfig/20240418-090032-marostegui.json [09:01:20] (03PS1) 10Muehlenhoff: Switch es2034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021377 (https://phabricator.wikimedia.org/T349619) [09:01:22] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [09:02:18] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9725479 (10cmooney) >>! In T350179#9725288, @Fabfur wrote: > I'd like to join the chorus of thanks to Papaul, you resolved us a very n... [09:02:51] !log mforns@deploy1002 Started deploy [analytics/refinery@be07da9] (thin): Regular analytics weekly train THIN [analytics/refinery@be07da9e] [09:03:40] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2110.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:03:57] (03CR) 10Muehlenhoff: [C:03+2] Switch es2034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021377 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:04:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2110.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:04:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2110.codfw.wmnet [09:06:37] !log mforns@deploy1002 Finished deploy [analytics/refinery@be07da9] (thin): Regular analytics weekly train THIN [analytics/refinery@be07da9e] (duration: 03m 45s) [09:06:47] !log mforns@deploy1002 Started deploy [analytics/refinery@be07da9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@be07da9e] [09:06:56] (03PS1) 10Arnaudb: mariadb: removes db2109 [puppet] - 10https://gerrit.wikimedia.org/r/1021386 (https://phabricator.wikimedia.org/T362796) [09:07:08] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2110.codfw.wmnet - https://phabricator.wikimedia.org/T362795#9725499 (10ABran-WMF) [09:07:09] (03PS1) 10Brouberol: datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) [09:07:12] (03PS1) 10Brouberol: datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) [09:07:27] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9725493 (10jcrespo) Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is [[ https://aler... [09:07:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T352010)', diff saved to https://phabricator.wikimedia.org/P60882 and previous config saved to /var/cache/conftool/dbconfig/20240418-090744-ladsgroup.json [09:07:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:07:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2034.codfw.wmnet [09:08:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1028.eqiad.wmnet [09:09:34] !log mforns@deploy1002 Finished deploy [analytics/refinery@be07da9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@be07da9e] (duration: 02m 46s) [09:10:04] (03PS1) 10Muehlenhoff: Switch es1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021382 (https://phabricator.wikimedia.org/T349619) [09:11:08] (03CR) 10Alexandros Kosiaris: "I am a bit ambivalent too. I find myself clicking on the Grafana dashboard URL in IRC often enough. However I rarely do that for runbooks." [puppet] - 10https://gerrit.wikimedia.org/r/1019844 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [09:11:12] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2109 [puppet] - 10https://gerrit.wikimedia.org/r/1021386 (https://phabricator.wikimedia.org/T362796) (owner: 10Arnaudb) [09:11:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P60883 and previous config saved to /var/cache/conftool/dbconfig/20240418-091126-marostegui.json [09:11:43] (03CR) 10Muehlenhoff: [C:03+2] Switch es1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021382 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:12:44] (03CR) 10Alexandros Kosiaris: [C:03+1] "I like this one, it would suit my workflow better." [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [09:12:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P60884 and previous config saved to /var/cache/conftool/dbconfig/20240418-091251-arnaudb.json [09:13:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2109.codfw.wmnet [09:14:14] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9725545 (10cmooney) >>! In T362421#9725493, @jcrespo wrote: > Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a815... [09:14:50] (03PS1) 10Brouberol: trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) [09:15:10] (03CR) 10CI reject: [V:04-1] trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:15:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P60885 and previous config saved to /var/cache/conftool/dbconfig/20240418-091541-marostegui.json [09:17:49] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [09:18:34] (03PS1) 10Slyngshede: CloudIDM, Install Bitu for labtest [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) [09:19:46] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2109.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:20:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2109.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:20:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:20:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2109.codfw.wmnet [09:21:03] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1983/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:22:22] (03CR) 10Muehlenhoff: CloudIDM, Install Bitu for labtest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:22:46] (03PS1) 10Majavah: hieradata: Update striker_toolsbeta database name [puppet] - 10https://gerrit.wikimedia.org/r/1021407 (https://phabricator.wikimedia.org/T360149) [09:22:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1028.eqiad.wmnet [09:22:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60886 and previous config saved to /var/cache/conftool/dbconfig/20240418-092252-ladsgroup.json [09:23:04] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2109.codfw.wmnet - https://phabricator.wikimedia.org/T362796#9725570 (10ABran-WMF) [09:23:11] (03PS1) 10Arnaudb: mariadb: removes db2108 [puppet] - 10https://gerrit.wikimedia.org/r/1021387 (https://phabricator.wikimedia.org/T362797) [09:24:01] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2108 [puppet] - 10https://gerrit.wikimedia.org/r/1021387 (https://phabricator.wikimedia.org/T362797) (owner: 10Arnaudb) [09:24:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1034.eqiad.wmnet [09:25:00] (03CR) 10Majavah: CloudIDM, Install Bitu for labtest (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:25:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2108', diff saved to https://phabricator.wikimedia.org/P60887 and previous config saved to /var/cache/conftool/dbconfig/20240418-092504-arnaudb.json [09:25:18] !log mforns@deploy1002 Started deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] [09:25:27] (03CR) 10Majavah: [C:03+2] hieradata: Update striker_toolsbeta database name [puppet] - 10https://gerrit.wikimedia.org/r/1021407 (https://phabricator.wikimedia.org/T360149) (owner: 10Majavah) [09:25:38] !log mforns@deploy1002 Finished deploy [analytics/refinery@be07da9]: Regular analytics weekly train [analytics/refinery@be07da9e] (duration: 00m 20s) [09:25:42] (03CR) 10Alexandros Kosiaris: Add datasets-config helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:25:45] (03CR) 10Muehlenhoff: CloudIDM, Install Bitu for labtest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:27:08] (03PS1) 10Muehlenhoff: Switch es1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021408 (https://phabricator.wikimedia.org/T349619) [09:27:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2108', diff saved to https://phabricator.wikimedia.org/P60888 and previous config saved to /var/cache/conftool/dbconfig/20240418-092718-arnaudb.json [09:27:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P60889 and previous config saved to /var/cache/conftool/dbconfig/20240418-092728-marostegui.json [09:27:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2108.codfw.wmnet [09:27:57] (03CR) 10Slyngshede: [V:03+1] CloudIDM, Install Bitu for labtest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:30:14] (03PS4) 10Slyngshede: Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 [09:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P60890 and previous config saved to /var/cache/conftool/dbconfig/20240418-093049-marostegui.json [09:31:01] (03CR) 10Slyngshede: Initial documentation for the Bitu API. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [09:31:40] (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM, couple of inline comments that will also fix CI." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:32:32] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [09:34:29] (03CR) 10Muehlenhoff: [C:03+2] Switch es1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021408 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:34:44] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2108.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:35:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2108.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:35:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:35:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2108.codfw.wmnet [09:36:06] (03CR) 10JMeybohm: [C:03+2] kubernetes::node: Remove apparmor cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1020803 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [09:38:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60891 and previous config saved to /var/cache/conftool/dbconfig/20240418-093759-ladsgroup.json [09:38:40] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2108.codfw.wmnet - https://phabricator.wikimedia.org/T362797#9725606 (10ABran-WMF) [09:38:40] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [09:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1034.eqiad.wmnet [09:40:32] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9725610 (10MoritzMuehlenhoff) [09:42:36] (03PS1) 10Fabfur: haproxy/benthos: uppercase keyx parameter in X-Analytics-TLS hdr [puppet] - 10https://gerrit.wikimedia.org/r/1021412 (https://phabricator.wikimedia.org/T358109) [09:42:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T356166)', diff saved to https://phabricator.wikimedia.org/P60892 and previous config saved to /var/cache/conftool/dbconfig/20240418-094235-marostegui.json [09:42:43] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:43:17] (03CR) 10Brouberol: [C:03+1] "Sorry it took a while to review, I was on PTO." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [09:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T361627)', diff saved to https://phabricator.wikimedia.org/P60893 and previous config saved to /var/cache/conftool/dbconfig/20240418-094556-marostegui.json [09:45:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance [09:46:04] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:46:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance [09:46:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T361627)', diff saved to https://phabricator.wikimedia.org/P60894 and previous config saved to /var/cache/conftool/dbconfig/20240418-094619-marostegui.json [09:46:24] (03CR) 10Brouberol: deployment_server: Change Puppet query for ML Cassandra Clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:46:43] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1087.eqiad.wmnet [09:48:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T361627)', diff saved to https://phabricator.wikimedia.org/P60895 and previous config saved to /var/cache/conftool/dbconfig/20240418-094830-marostegui.json [09:48:51] (03CR) 10Gmodena: [C:03+1] haproxy/benthos: uppercase keyx parameter in X-Analytics-TLS hdr [puppet] - 10https://gerrit.wikimedia.org/r/1021412 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:50:30] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9725627 (10Volans) >>! In T360029#9723385, @CDanis wrote: > @Marostegui As it turns out, plain old `confctl` can be used to do this already.... [09:51:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [09:53:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T352010)', diff saved to https://phabricator.wikimedia.org/P60896 and previous config saved to /var/cache/conftool/dbconfig/20240418-095308-ladsgroup.json [09:53:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Maintenance [09:53:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:53:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Maintenance [09:53:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T352010)', diff saved to https://phabricator.wikimedia.org/P60897 and previous config saved to /var/cache/conftool/dbconfig/20240418-095331-ladsgroup.json [09:53:55] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1087.eqiad.wmnet [09:57:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host contint1002.wikimedia.org [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1000) [10:00:27] (03CR) 10Muehlenhoff: CloudIDM, Install Bitu for labtest (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:01:30] (03PS1) 10Muehlenhoff: Switch contint1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021414 (https://phabricator.wikimedia.org/T349619) [10:02:59] (03CR) 10Muehlenhoff: [C:03+2] Switch contint1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021414 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:03:13] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9725692 (10dcaro) So far manual tests on the hard drive have been unable to trigger the issue and increase the count... [10:03:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P60898 and previous config saved to /var/cache/conftool/dbconfig/20240418-100338-marostegui.json [10:04:58] (03PS1) 10Clément Goubert: php7.4-fpm: Actually use FPM__request_terminate_timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021415 (https://phabricator.wikimedia.org/T358308) [10:08:02] (03PS2) 10Clément Goubert: php7.4-fpm: Actually use FPM__request_terminate_timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021415 (https://phabricator.wikimedia.org/T358308) [10:08:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host contint1002.wikimedia.org [10:09:28] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for kgraessle - https://phabricator.wikimedia.org/T362812#9725704 (10DMburugu) I approve [10:10:10] (03PS3) 10Clément Goubert: php7.4-fpm: Actually use FPM__request_terminate_timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021415 (https://phabricator.wikimedia.org/T358308) [10:10:55] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9725707 (10MoritzMuehlenhoff) [10:12:06] (03CR) 10Klausman: [V:03+1] deployment_server: Change Puppet query for ML Cassandra Clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:12:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020881 (https://phabricator.wikimedia.org/T362812) (owner: 10Ssingh) [10:12:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) (owner: 10Ssingh) [10:13:40] (KubernetesRsyslogDown) firing: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:35] (03PS2) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [10:17:43] (03CR) 10Vgutierrez: [C:03+1] haproxy/benthos: uppercase keyx parameter in X-Analytics-TLS hdr [puppet] - 10https://gerrit.wikimedia.org/r/1021412 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:18:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:18:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:18:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T355609)', diff saved to https://phabricator.wikimedia.org/P60899 and previous config saved to /var/cache/conftool/dbconfig/20240418-101841-marostegui.json [10:18:46] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:18:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P60900 and previous config saved to /var/cache/conftool/dbconfig/20240418-101852-marostegui.json [10:20:41] (03PS1) 10Clément Goubert: mediawiki: Fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021418 (https://phabricator.wikimedia.org/T358308) [10:23:13] (03CR) 10Ladsgroup: [C:03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [10:24:10] (03CR) 10Fabfur: [C:03+2] haproxy/benthos: uppercase keyx parameter in X-Analytics-TLS hdr [puppet] - 10https://gerrit.wikimedia.org/r/1021412 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:25:38] !log Depooling mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet for reimage - T351074 [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:46] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:25:48] (03CR) 10Kamila Součková: [C:03+1] "that explains things! :D" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021415 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [10:26:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T355609)', diff saved to https://phabricator.wikimedia.org/P60901 and previous config saved to /var/cache/conftool/dbconfig/20240418-102609-marostegui.json [10:26:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:27:30] (03PS1) 10Ilias Sarantopoulos: ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) [10:28:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2318:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2318 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:29:21] (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 6 appservers from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1020852 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:29:22] (03PS1) 10Ilias Sarantopoulos: ml-services: enable payload logging in revscoring-damaging in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021421 (https://phabricator.wikimedia.org/T362503) [10:30:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2020.codfw.wmnet [10:31:13] (03PS1) 10Muehlenhoff: Switch es2020 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021422 (https://phabricator.wikimedia.org/T349619) [10:34:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T361627)', diff saved to https://phabricator.wikimedia.org/P60902 and previous config saved to /var/cache/conftool/dbconfig/20240418-103359-marostegui.json [10:34:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:34:06] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:34:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:34:16] (03CR) 10Muehlenhoff: [C:03+2] Switch es2020 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021422 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:34:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T361627)', diff saved to https://phabricator.wikimedia.org/P60903 and previous config saved to /var/cache/conftool/dbconfig/20240418-103422-marostegui.json [10:35:36] (03CR) 10Kevin Bazira: [C:03+1] ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [10:36:09] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enable payload logging in revscoring-damaging in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021421 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [10:37:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2302.codfw.wmnet with OS bullseye [10:38:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2303.codfw.wmnet with OS bullseye [10:38:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2304.codfw.wmnet with OS bullseye [10:38:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2332.codfw.wmnet with OS bullseye [10:39:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2333.codfw.wmnet with OS bullseye [10:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2020.codfw.wmnet [10:39:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T361627)', diff saved to https://phabricator.wikimedia.org/P60904 and previous config saved to /var/cache/conftool/dbconfig/20240418-103933-marostegui.json [10:39:40] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:39:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2334.codfw.wmnet with OS bullseye [10:40:14] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2021.codfw.wmnet [10:41:14] (03PS1) 10Muehlenhoff: Switch es2021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021425 (https://phabricator.wikimedia.org/T349619) [10:41:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P60905 and previous config saved to /var/cache/conftool/dbconfig/20240418-104117-marostegui.json [10:42:34] (03PS4) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [10:42:40] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable payload logging in revscoring-damaging in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021421 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [10:42:58] (03CR) 10Muehlenhoff: [C:03+2] Switch es2021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021425 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:43:06] (03CR) 10Clément Goubert: [V:03+2 C:03+2] php7.4-fpm: Actually use FPM__request_terminate_timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021415 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [10:43:37] (03Merged) 10jenkins-bot: ml-services: enable payload logging in revscoring-damaging in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021421 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [10:44:43] (03PS4) 10Ladsgroup: mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 [10:44:48] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Set up dedicated cumin user [puppet] - 10https://gerrit.wikimedia.org/r/1020830 (owner: 10Ladsgroup) [10:45:15] !log Rebuild php7.4-fpm production images - T358308 [10:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:21] T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes - https://phabricator.wikimedia.org/T358308 [10:46:15] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:47:14] That's me, transient because of reimages [10:47:25] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021418 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [10:48:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2021.codfw.wmnet [10:48:51] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:23] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021418 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [10:50:40] (03Merged) 10jenkins-bot: mediawiki: Fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021418 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [10:51:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:52:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2022.codfw.wmnet [10:52:15] !log cgoubert@deploy1002 Started scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308 [10:52:21] T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes - https://phabricator.wikimedia.org/T358308 [10:54:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2303.codfw.wmnet with reason: host reimage [10:54:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2302.codfw.wmnet with reason: host reimage [10:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P60906 and previous config saved to /var/cache/conftool/dbconfig/20240418-105441-marostegui.json [10:54:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2332.codfw.wmnet with reason: host reimage [10:55:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2304.codfw.wmnet with reason: host reimage [10:55:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2333.codfw.wmnet with reason: host reimage [10:56:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2334.codfw.wmnet with reason: host reimage [10:56:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P60907 and previous config saved to /var/cache/conftool/dbconfig/20240418-105624-marostegui.json [10:57:06] (03PS1) 10Muehlenhoff: Switch es2022 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021426 (https://phabricator.wikimedia.org/T349619) [10:57:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2303.codfw.wmnet with reason: host reimage [10:57:44] (03CR) 10Muehlenhoff: [C:03+2] Switch es2022 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021426 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:58:18] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [11:00:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2332.codfw.wmnet with reason: host reimage [11:01:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2022.codfw.wmnet [11:02:25] (03PS1) 10Clément Goubert: mw-debug: fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021427 (https://phabricator.wikimedia.org/T358308) [11:02:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2334.codfw.wmnet with reason: host reimage [11:03:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1020.eqiad.wmnet [11:05:30] (03PS1) 10Muehlenhoff: Switch es1020 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021428 (https://phabricator.wikimedia.org/T349619) [11:05:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2304.codfw.wmnet with reason: host reimage [11:08:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2333.codfw.wmnet with reason: host reimage [11:09:25] (03CR) 10Muehlenhoff: [C:03+2] Switch es1020 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021428 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P60908 and previous config saved to /var/cache/conftool/dbconfig/20240418-110950-marostegui.json [11:10:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2302.codfw.wmnet with reason: host reimage [11:11:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T355609)', diff saved to https://phabricator.wikimedia.org/P60909 and previous config saved to /var/cache/conftool/dbconfig/20240418-111132-marostegui.json [11:11:38] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:13:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1020.eqiad.wmnet [11:15:23] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2303.codfw.wmnet with OS bullseye [11:17:46] 06SRE, 07LDAP: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650#9725901 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [11:17:52] 06SRE, 07LDAP: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650#9725902 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:18:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1021.eqiad.wmnet [11:20:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2332.codfw.wmnet with OS bullseye [11:21:03] (03PS1) 10Muehlenhoff: Switch es1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021430 (https://phabricator.wikimedia.org/T349619) [11:21:51] (03CR) 10Muehlenhoff: [C:03+2] Switch es1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021430 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:23:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2334.codfw.wmnet with OS bullseye [11:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T361627)', diff saved to https://phabricator.wikimedia.org/P60910 and previous config saved to /var/cache/conftool/dbconfig/20240418-112459-marostegui.json [11:25:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:25:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2304.codfw.wmnet with OS bullseye [11:25:10] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:25:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:26:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 857.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:26:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1021.eqiad.wmnet [11:27:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:28:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [11:28:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:28:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [11:28:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P60911 and previous config saved to /var/cache/conftool/dbconfig/20240418-112816-ladsgroup.json [11:28:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:28:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T361627)', diff saved to https://phabricator.wikimedia.org/P60912 and previous config saved to /var/cache/conftool/dbconfig/20240418-112827-marostegui.json [11:28:42] (03CR) 10EoghanGaffney: [C:03+2] phabricator: Switch certificate generation to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [11:29:19] !log cgoubert@deploy1002 Finished scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308 (duration: 37m 04s) [11:29:25] T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes - https://phabricator.wikimedia.org/T358308 [11:30:17] (03CR) 10Clément Goubert: [C:03+2] mw-debug: fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021427 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [11:30:28] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure for an-worker1087 - https://phabricator.wikimedia.org/T362871 (10BTullis) 03NEW [11:30:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2302.codfw.wmnet with OS bullseye [11:30:36] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure for an-worker1087 - https://phabricator.wikimedia.org/T362871#9725969 (10BTullis) [11:30:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T361627)', diff saved to https://phabricator.wikimedia.org/P60913 and previous config saved to /var/cache/conftool/dbconfig/20240418-113037-marostegui.json [11:30:43] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:31:11] (03Merged) 10jenkins-bot: mw-debug: fix php.timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021427 (https://phabricator.wikimedia.org/T358308) (owner: 10Clément Goubert) [11:31:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 821.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:31:49] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:32:12] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:32:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:33:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:34:09] (03CR) 10Klausman: [C:03+1] Revert "Set ipv6dualstack for ml-staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1020738 (owner: 10Elukey) [11:34:37] (03CR) 10Klausman: [C:03+1] ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [11:35:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2333.codfw.wmnet with OS bullseye [11:36:20] (03PS5) 10NMW03: Added extendedconfirmed and templateeditor rights to dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) [11:36:25] 06SRE, 06Traffic, 06Wikimedia Enterprise: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628#9725986 (10Ottomata) 05Resolved→03Declined Hello! I don't think this task is resolved. Perhaps you meant to decline it? Being bold and d... [11:36:27] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure for an-worker1087 - https://phabricator.wikimedia.org/T362871#9726001 (10BTullis) The filesystem on the drive is unmounted and commented out from `/etc/fstab`, so the disk is out of service and can be hot-swapped at any time. [11:38:55] (03CR) 10Klausman: [C:03+1] ml-services: add logo-detection isvc to experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [11:39:56] (03CR) 10Btullis: [C:03+2] Update yarn scheduler's queues configuration [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) (owner: 10Joal) [11:40:15] (03CR) 10Btullis: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) (owner: 10Joal) [11:41:10] (03CR) 10JMeybohm: "As mentioned on IRC, I would very much prefer a solution that works for all cassandra instances and not just one particular one." [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:42:04] !log Running homer 'cr*codfw*' commit 'T351074' [11:42:07] (03CR) 10Klausman: [C:03+1] knative-serving: move net_istio configs to a dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020808 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [11:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:12] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:44:13] (03CR) 10NMW03: "You can review it now, if you want to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [11:44:35] (03CR) 10Muehlenhoff: "Looks good in general! Two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [11:45:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P60914 and previous config saved to /var/cache/conftool/dbconfig/20240418-114544-marostegui.json [11:50:17] (03PS2) 10Ilias Sarantopoulos: ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) [11:52:03] !log Pooling and uncordoning mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet - T351074 [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:08] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:52:09] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2302.codfw.wmnet|mw2303.codfw.wmnet|mw2304.codfw.wmnet|mw2332.codfw.wmnet|mw2333.codfw.wmnet|mw2334.codfw.wmnet),cluster=kubernetes,service=kubesvc [11:53:43] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [11:54:22] !log upgrading PHP security updates on eqiad baremetal servers T362511 [11:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:04] (03Merged) 10jenkins-bot: ml-services: fix revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021420 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [11:56:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:59:02] (03PS2) 10Brouberol: trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1200) [12:00:35] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:00:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P60915 and previous config saved to /var/cache/conftool/dbconfig/20240418-120051-marostegui.json [12:02:32] !log installing PHP 8.2 security updates [12:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:16] (03CR) 10Vgutierrez: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [12:04:37] (03PS3) 10Brouberol: trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) [12:05:02] (03CR) 10Brouberol: "I've updated the commit message to make it clear that this change depends on the CR provisioning the DNS record." [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [12:06:17] !log Switching phab1004 to use cfssl issued ssl cert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020190 [12:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:04] (03CR) 10Vgutierrez: [C:03+1] "looking good, just a few typos on the commit msg:" [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:08:03] !log depool ncredir2001 [12:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet [12:10:54] (03CR) 10Brouberol: "Could we use `group by certname` in the PQL query to perform the same operation for all clusters, and then inject that data (maybe massage" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:11:59] (03CR) 10Klausman: [V:03+1] "I've been trying to do something similar, but failed so far." [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:13:22] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:14:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet [12:14:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host crm2001.codfw.wmnet [12:15:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host crm2001.codfw.wmnet [12:15:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T361627)', diff saved to https://phabricator.wikimedia.org/P60916 and previous config saved to /var/cache/conftool/dbconfig/20240418-121559-marostegui.json [12:16:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:16:04] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:16:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:17:47] (03PS1) 10Arnaudb: mariadb: removes db2107 [puppet] - 10https://gerrit.wikimedia.org/r/1021388 (https://phabricator.wikimedia.org/T362798) [12:18:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:18:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:18:42] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2107 [puppet] - 10https://gerrit.wikimedia.org/r/1021388 (https://phabricator.wikimedia.org/T362798) (owner: 10Arnaudb) [12:21:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 depool', diff saved to https://phabricator.wikimedia.org/P60917 and previous config saved to /var/cache/conftool/dbconfig/20240418-122122-arnaudb.json [12:21:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2107.codfw.wmnet [12:22:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:22:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:22:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T361627)', diff saved to https://phabricator.wikimedia.org/P60918 and previous config saved to /var/cache/conftool/dbconfig/20240418-122227-marostegui.json [12:22:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:26:05] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [12:27:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T361627)', diff saved to https://phabricator.wikimedia.org/P60919 and previous config saved to /var/cache/conftool/dbconfig/20240418-122721-marostegui.json [12:27:55] (03CR) 10Elukey: [C:03+2] Revert "Set ipv6dualstack for ml-staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1020738 (owner: 10Elukey) [12:28:06] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2107.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [12:28:26] (03PS1) 10Filippo Giunchedi: jaeger: remove es-index-cleaner image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021458 (https://phabricator.wikimedia.org/T344953) [12:28:36] (03PS1) 10Filippo Giunchedi: jaeger: update builder image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021459 (https://phabricator.wikimedia.org/T362719) [12:28:39] (03PS1) 10Filippo Giunchedi: jaeger: update query/collector images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021460 (https://phabricator.wikimedia.org/T362719) [12:29:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2107.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [12:29:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:29:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2107.codfw.wmnet [12:31:05] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2107.codfw.wmnet - https://phabricator.wikimedia.org/T362798#9726119 (10ABran-WMF) [12:31:33] (03PS1) 10Arnaudb: mariadb: removes db2106 [puppet] - 10https://gerrit.wikimedia.org/r/1021389 (https://phabricator.wikimedia.org/T362799) [12:31:36] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9726125 (10MoritzMuehlenhoff) [12:32:12] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2106 [puppet] - 10https://gerrit.wikimedia.org/r/1021389 (https://phabricator.wikimedia.org/T362799) (owner: 10Arnaudb) [12:35:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 depool', diff saved to https://phabricator.wikimedia.org/P60920 and previous config saved to /var/cache/conftool/dbconfig/20240418-123502-arnaudb.json [12:35:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2106.codfw.wmnet [12:38:31] (03PS1) 10Slyngshede: Login page, improve styling of the login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021463 [12:39:47] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [12:40:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2023.codfw.wmnet [12:40:48] (03CR) 10Slyngshede: [C:03+2] Login page, improve styling of the login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021463 (owner: 10Slyngshede) [12:42:28] (03Merged) 10jenkins-bot: Login page, improve styling of the login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021463 (owner: 10Slyngshede) [12:42:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P60921 and previous config saved to /var/cache/conftool/dbconfig/20240418-124230-marostegui.json [12:42:38] (03CR) 10Elukey: [V:03+1 C:03+2] role::aqs: move codfw's instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1020266 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:43:17] (03PS1) 10Muehlenhoff: Switch es2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021464 (https://phabricator.wikimedia.org/T349619) [12:43:55] (03CR) 10Muehlenhoff: [C:03+2] Switch es2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021464 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:44:31] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2106.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [12:45:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2106.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [12:45:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:45:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2106.codfw.wmnet [12:48:06] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2106.codfw.wmnet - https://phabricator.wikimedia.org/T362799#9726195 (10ABran-WMF) [12:49:13] (03PS1) 10Arnaudb: mariadb: removes db2105 [puppet] - 10https://gerrit.wikimedia.org/r/1021390 (https://phabricator.wikimedia.org/T362800) [12:49:15] !log move aqs codfw cassandra instances to PKI TLS certs - T352647 [12:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:22] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [12:50:24] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2105 [puppet] - 10https://gerrit.wikimedia.org/r/1021390 (https://phabricator.wikimedia.org/T362800) (owner: 10Arnaudb) [12:53:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 depool', diff saved to https://phabricator.wikimedia.org/P60922 and previous config saved to /var/cache/conftool/dbconfig/20240418-125338-arnaudb.json [12:54:15] !log sudo cumin -b1 -s600 "A:dnsbox" "systemctl restart ntp.service" to pick up magru /24: T346722 [12:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2105.codfw.wmnet [12:54:25] T346722: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722 [12:55:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2023.codfw.wmnet [12:57:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P60923 and previous config saved to /var/cache/conftool/dbconfig/20240418-125739-marostegui.json [12:58:27] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [12:59:59] (03PS3) 10Volans: Add configuration for the new magru DC [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1300). [13:00:04] Jhs and NMW03: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:23] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2105.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [13:00:28] I can't wait for my sticker [13:01:23] * Jhs is present [13:01:37] I can probably deploy later in the window but not quite yet [13:01:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2105.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [13:01:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:01:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2105.codfw.wmnet [13:01:54] (depending on whether anyone else shows up in the meeting I’m in, I’ll either be free in a moment or in 15-30 minutes :P) [13:02:03] hehe [13:02:19] Hello Jhs [13:02:21] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1985/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:02:28] I forgot to change my username here lol [13:04:13] hiya NMW03 :D [13:04:36] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2105.codfw.wmnet - https://phabricator.wikimedia.org/T362800#9726253 (10ABran-WMF) [13:05:35] more people showed up in the meeting, I’ll be back later ^^ [13:05:56] (03PS1) 10Arnaudb: mariadb: removes db2103 [puppet] - 10https://gerrit.wikimedia.org/r/1021391 (https://phabricator.wikimedia.org/T362801) [13:06:53] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9726265 (10ssingh) Thanks @jcrespo! I should have silenced the alert or restarted the service; both of those are in progress now so we should see this resolve soon. @cmooney:... [13:06:55] !log aqs2001's Cassandra instances moved to PKI TLS certs [13:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:06] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12]*: Deploy new TLS Keystore - PKI - elukey@cumin1002 [13:07:25] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2103 [puppet] - 10https://gerrit.wikimedia.org/r/1021391 (https://phabricator.wikimedia.org/T362801) (owner: 10Arnaudb) [13:10:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2103 depool', diff saved to https://phabricator.wikimedia.org/P60925 and previous config saved to /var/cache/conftool/dbconfig/20240418-131027-arnaudb.json [13:12:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T361627)', diff saved to https://phabricator.wikimedia.org/P60926 and previous config saved to /var/cache/conftool/dbconfig/20240418-131248-marostegui.json [13:12:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:12:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:12:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2103.codfw.wmnet [13:13:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:13:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T361627)', diff saved to https://phabricator.wikimedia.org/P60927 and previous config saved to /var/cache/conftool/dbconfig/20240418-131311-marostegui.json [13:14:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2024.codfw.wmnet [13:15:06] (03PS1) 10Jcrespo: dbbackups: Add dbprov1005 to adding access ddbackups database [puppet] - 10https://gerrit.wikimedia.org/r/1021469 (https://phabricator.wikimedia.org/T362509) [13:17:05] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [13:17:41] (03CR) 10Elukey: [C:03+2] knative-serving: move net_istio configs to a dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020808 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:18:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T361627)', diff saved to https://phabricator.wikimedia.org/P60928 and previous config saved to /var/cache/conftool/dbconfig/20240418-131836-marostegui.json [13:18:50] (ProbeDown) firing: (36) Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:52] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:19:01] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2103.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [13:20:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2103.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [13:20:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:20:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2103.codfw.wmnet [13:21:39] (03PS12) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [13:21:49] 10ops-codfw, 10decommission-hardware, 13Patch-For-Review: decommission db2103.codfw.wmnet - https://phabricator.wikimedia.org/T362801#9726341 (10ABran-WMF) [13:22:01] (03CR) 10CI reject: [V:04-1] deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:22:13] (03PS1) 10Muehlenhoff: Switch es2024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021471 (https://phabricator.wikimedia.org/T349619) [13:23:22] (03CR) 10Muehlenhoff: [C:03+2] Switch es2024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021471 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:23:50] (ProbeDown) firing: (36) Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:18] I am going to downtime --^ [13:24:33] (03PS1) 10Slyngshede: Add dummy secrets for idmcloud role. [labs/private] - 10https://gerrit.wikimedia.org/r/1021472 (https://phabricator.wikimedia.org/T362128) [13:24:33] these are new prometheus alerts that are running on instances that still needs to be restarted [13:25:23] (ProbeDown) firing: (36) Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:11] something weird with the dowtime cookbook, it times out when using a query like aqs20[02-12]* mm [13:27:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2024.codfw.wmnet [13:27:53] I’m here now [13:28:04] NMW03, Jhs: still there? ^^ [13:28:05] (03CR) 10Volans: [C:03+2] Add configuration for the new magru DC [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) (owner: 10Volans) [13:28:11] Lucas_WMDE, aye [13:28:11] yes [13:28:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es2025.codfw.wmnet [13:28:15] elukey: can we still deploy or are there issues? [13:28:47] nono please go ahead :) [13:28:52] ok, thanks :) [13:28:57] !log installing apache2 security updates [13:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] elukey: it seems the puppetdb query is very slow [13:29:41] checking [13:29:56] (03PS1) 10Muehlenhoff: Switch es2025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021473 (https://phabricator.wikimedia.org/T349619) [13:30:21] volans: yep yep, I added a manual silence on alerts.w.o so all good from my side [13:30:38] (meeting sorry can't check more) [13:30:46] (03CR) 10Muehlenhoff: [C:03+2] Switch es2025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021473 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:30:54] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I’m fine with deploying this, but isn’t this something MobileFrontend could set globally? (But I’m not exactly sure how $wgForceUIMsgAsCon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [13:30:55] puppetdb-api.discovery.wmnet:8090 times out for cumin [13:31:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [13:31:29] elukey: on cumin2002 it works fine [13:31:29] (03PS2) 10Jcrespo: dbbackups: Add ddbackups database grant access for dbprov1005 [puppet] - 10https://gerrit.wikimedia.org/r/1021469 (https://phabricator.wikimedia.org/T362509) [13:31:33] that's only on cumin1002 [13:31:41] so you can run the cookbook from 2002 if you need it [13:31:49] (03Merged) 10jenkins-bot: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [13:31:57] nono all good atm thanks [13:32:16] (03Merged) 10jenkins-bot: Add configuration for the new magru DC [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) (owner: 10Volans) [13:32:19] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1015151|Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg (T361171)]] [13:32:24] T361171: Add message 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg by default - https://phabricator.wikimedia.org/T361171 [13:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P60930 and previous config saved to /var/cache/conftool/dbconfig/20240418-133344-marostegui.json [13:34:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es2025.codfw.wmnet [13:34:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1023.eqiad.wmnet [13:34:45] (03CR) 10Jon Harald Søby: "It's… complicated, because of the way the message is added to the interface. It displays one message, which should probably remain in the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [13:35:35] (03PS1) 10Muehlenhoff: Switch es1023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021474 (https://phabricator.wikimedia.org/T349619) [13:36:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Alright, thanks. (I just started looking into it too and noticed that the code I found so far was only in core `Skin.php`.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) (owner: 10Jon Harald Søby) [13:36:23] (03CR) 10Muehlenhoff: [C:03+2] Switch es1023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021474 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:37:02] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [13:37:12] elukey: it seems to work fine now, I was about to restart it but didn't checking logs more in depth [13:37:16] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and jhsoby: Backport for [[gerrit:1015151|Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg (T361171)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:37:24] Jhs: please test! [13:37:25] T361171: Add message 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg by default - https://phabricator.wikimedia.org/T361171 [13:38:12] ok, I can see the difference at https://en.m.wikipedia.org/wiki/Main_Page?uselang=de I think [13:38:37] Lucas_WMDE, confirmed 👍 [13:38:40] the surrounding element even has lang=de already, huh [13:38:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and jhsoby: Continuing with sync [13:38:48] so that was previously mistagged then [13:38:55] I can see too :) [13:39:06] !log installing Linux 6.1.85 on Bookworm hosts [13:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1023.eqiad.wmnet [13:40:37] (03PS1) 10EoghanGaffney: phabricator: Remove old crt after switching to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1021477 (https://phabricator.wikimedia.org/T360413) [13:40:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1024.eqiad.wmnet [13:41:17] (03PS1) 10Clément Goubert: kubernetes: move 6 api_appservers from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1021478 (https://phabricator.wikimedia.org/T351074) [13:41:42] (03PS2) 10Clément Goubert: kubernetes: move 5 api_appservers from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1021478 (https://phabricator.wikimedia.org/T351074) [13:42:12] (03PS1) 10Muehlenhoff: Switch es1024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021479 (https://phabricator.wikimedia.org/T349619) [13:42:22] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add ddbackups database grant access for dbprov1005 [puppet] - 10https://gerrit.wikimedia.org/r/1021469 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [13:43:08] (03CR) 10Muehlenhoff: [C:03+2] Switch es1024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021479 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:43:44] (03Abandoned) 10Clément Goubert: kubernetes: move 5 api_appservers from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1021478 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [13:43:46] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1991/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021477 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [13:46:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I think this all looks good to me; I’d be happy to see some more commas at the end of lines (so future diffs are nicer), but that’s not a " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [13:47:43] !log add grants for dbprov1005 at dbbackups (m1) T362509 [13:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:49] T362509: Setup new dbprov hosts and decommission the old ones - https://phabricator.wikimedia.org/T362509 [13:48:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1024.eqiad.wmnet [13:48:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P60932 and previous config saved to /var/cache/conftool/dbconfig/20240418-134852-marostegui.json [13:50:20] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9726525 (10MoritzMuehlenhoff) [13:50:47] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:51:14] jouncebot: next [13:51:15] In 2 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1600) [13:51:32] NMW03: if you can stay a bit longer, we can still deploy your change I think [13:51:35] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [13:51:41] No problem, I am here [13:51:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1015151|Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg (T361171)]] (duration: 19m 37s) [13:52:01] T361171: Add message 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg by default - https://phabricator.wikimedia.org/T361171 [13:52:14] (03PS6) 10NMW03: Added extendedconfirmed and templateeditor rights to dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) [13:52:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [13:53:01] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1021481 [13:53:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [13:53:36] (03PS1) 10Clément Goubert: kubernetes: move 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1021482 (https://phabricator.wikimedia.org/T351074) [13:54:11] (03Merged) 10jenkins-bot: Added extendedconfirmed and templateeditor rights to dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [13:54:40] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1019779|Added extendedconfirmed and templateeditor rights to dawiki (T361461)]] [13:54:46] T361461: Add extendedconfirmed and templateeditor protection to dawiki - https://phabricator.wikimedia.org/T361461 [13:54:50] (03CR) 10Ssingh: [C:03+2] admin: add derenrich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) (owner: 10Ssingh) [13:55:44] (03CR) 10Ssingh: [C:04-1] "Waiting on approval." [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) (owner: 10Ssingh) [13:55:47] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:55:57] (03PS1) 10Majavah: openstack: neutron: Fix firewall driver with openvswitch [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) [13:55:58] (03CR) 10Ssingh: [C:03+2] admin: add kgraessle to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020881 (https://phabricator.wikimedia.org/T362812) (owner: 10Ssingh) [13:56:53] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:57:20] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Backport for [[gerrit:1019779|Added extendedconfirmed and templateeditor rights to dawiki (T361461)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:58:28] !log elukey@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching aqs20[02-12]*: Deploy new TLS Keystore - PKI - elukey@cumin1002 [13:58:37] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 5 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [13:58:53] (03PS2) 10Majavah: openstack: neutron: Fix firewall driver with openvswitch [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) [13:59:00] Lucas_WMDE everything is OK for me [13:59:22] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Continuing with sync [13:59:28] thanks, https://da.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups|autopromote|restrictions&format=json&formatversion=2 looks okay to me too [13:59:34] (03CR) 10Hashar: [C:04-1] "`@resolve((contint.wikimedia.org), A)` is a DSL for Ferm firewall configuration isn't it? The `gearman_server` hiera value is injected as " [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [13:59:46] yep, also https://da.wikipedia.org/wiki/Speciel:Beskyttede_sider?uselang=en [14:00:15] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [14:01:07] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for kgraessle - https://phabricator.wikimedia.org/T362812#9726549 (10ssingh) 05Open→03Resolved @Kgraessle: The request has been merged; please let us know if there is any issue. Thanks! [14:01:55] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.2.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1021481 (owner: 10Volans) [14:02:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:02:32] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9726557 (10Eevans) [14:03:20] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9726562 (10Eevans) [14:03:35] (03PS1) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [14:03:55] (03CR) 10CI reject: [V:04-1] ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:03:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T361627)', diff saved to https://phabricator.wikimedia.org/P60934 and previous config saved to /var/cache/conftool/dbconfig/20240418-140359-marostegui.json [14:04:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:04:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:04:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:04:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T361627)', diff saved to https://phabricator.wikimedia.org/P60935 and previous config saved to /var/cache/conftool/dbconfig/20240418-140421-marostegui.json [14:06:28] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [14:07:21] (03CR) 10Kamila Součková: [C:03+1] kubernetes: move 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1021482 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:07:49] Lucas_WMDE Is there anything left of me? [14:08:00] !log installing postgresql-15 security updates [14:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:06] not really, I’m just waiting for the deployment to finish [14:08:12] maybe 2-3 more minutes [14:08:16] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [14:08:19] alright [14:08:41] (03PS2) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [14:09:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T361627)', diff saved to https://phabricator.wikimedia.org/P60936 and previous config saved to /var/cache/conftool/dbconfig/20240418-140913-marostegui.json [14:09:19] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:09:29] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[9-12]*: Deploy new TLS Keystore - PKI - elukey@cumin1002 [14:09:46] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1021481 (owner: 10Volans) [14:09:57] Sorry for the time deployments take, I may have to take a look at lowering maxSurge for mw-on-k8s, with the size of the deployments we're hitting the allocatable limits of the k8s cluster and it slows it down quite a bit [14:10:24] https://grafana.wikimedia.org/goto/-283AXaSg?orgId=1 [14:10:40] Just right down to 0 [14:11:32] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1019779|Added extendedconfirmed and templateeditor rights to dawiki (T361461)]] (duration: 16m 51s) [14:11:32] There's an option coming to scap that I want to play with to do the deployments sequentially instead of all the namespaces at the same time, I'm curious to see if it's faster than letting the k8s scheduler hit the wall and wait for resources [14:11:35] (03PS3) 10Andrew Bogott: openstack: neutron: Fix firewall driver with openvswitch [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:11:37] T361461: Add extendedconfirmed and templateeditor protection to dawiki - https://phabricator.wikimedia.org/T361461 [14:11:53] !log UTC afternoon backport+config window done [14:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:12:30] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [14:12:36] !log Depooling mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074 [14:12:45] (03CR) 10CDanis: [C:03+1] jaeger: update builder image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021459 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [14:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:47] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:12:48] (03CR) 10CDanis: [C:03+1] jaeger: update query/collector images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021460 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [14:12:53] (03PS1) 10Volans: Upstream release v1.2.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1021487 [14:13:03] claime: sounds interesting… I guess that would also “solve” the problem of how to report progress for multiple simultaneous deployments (T361747) ^^ [14:13:04] T361747: Provide some feedback in scap whilst waiting for helmfile deploys to complete - https://phabricator.wikimedia.org/T361747 [14:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:10] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1996/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:14:12] Lucas_WMDE: True x) I'm not sure it would actually be faster though, I need actual testing to determine that [14:14:15] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:14:26] yeah, performance testing is a good idea ^^ [14:15:20] (03CR) 10Andrew Bogott: [C:03+1] openstack: neutron: Fix firewall driver with openvswitch [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:15:34] good thing is that it's just a switch, and we can also maybe see to do two, or three. https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/283 [14:16:22] (03PS3) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [14:16:34] (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1021482 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:17:02] 10SRE-tools, 10conftool, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893 (10CDanis) 03NEW [14:17:09] 10SRE-tools, 10conftool, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9726654 (10CDanis) p:05Triage→03Low [14:17:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1021477 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [14:17:42] (03PS2) 10Jforrester: wikifunctions: Configure prometheus endpoints on both services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 [14:17:42] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-04-17-125039 to 2024-04-17-163312 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020874 [14:17:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1021472 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [14:18:12] (03CR) 10Slyngshede: [C:03+2] Add dummy secrets for idmcloud role. [labs/private] - 10https://gerrit.wikimedia.org/r/1021472 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [14:18:16] (03CR) 10Slyngshede: [V:03+2 C:03+2] Add dummy secrets for idmcloud role. [labs/private] - 10https://gerrit.wikimedia.org/r/1021472 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [14:19:16] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9726664 (10CDanis) >>! In T360029#9725627, @Volans wrote: > As for the commit I advocate to add dbctl support in Spicerack but IIRC that requi... [14:19:19] !log installing usrmerge bugfix updates from bookworm point release [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9726677 (10CDanis) 05Open→03Resolved Anyway I think that all that is needed to unblock VLAN migrations has been done or documented on... [14:21:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1355.eqiad.wmnet with OS bullseye [14:21:18] (03CR) 10Volans: [C:03+2] Upstream release v1.2.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1021487 (owner: 10Volans) [14:21:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1480.eqiad.wmnet with OS bullseye [14:22:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1481.eqiad.wmnet with OS bullseye [14:22:33] (03CR) 10JMeybohm: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 (owner: 10Jforrester) [14:22:51] (03PS1) 10EoghanGaffney: gitlab: Remove old phabricator key after switching to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 [14:22:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1487.eqiad.wmnet with OS bullseye [14:23:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9726691 (10MoritzMuehlenhoff) [14:23:51] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P60937 and previous config saved to /var/cache/conftool/dbconfig/20240418-142420-marostegui.json [14:24:52] (03PS4) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [14:27:04] (03PS1) 10Elukey: admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) [14:28:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [14:28:13] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9726717 (10Marostegui) >>! In T360029#9726677, @CDanis wrote: > Anyway I think that all that is needed to unblock VLAN migrations has been... [14:28:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:28:46] (03Merged) 10jenkins-bot: Upstream release v1.2.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1021487 (owner: 10Volans) [14:28:48] !log installing cryptsetup bugfix updates from bookworm point release [14:28:50] !log elukey@cumin1002 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching aqs20[9-12]*: Deploy new TLS Keystore - PKI - elukey@cumin1002 [14:28:51] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9726724 (10MoritzMuehlenhoff) [14:31:58] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9726725 (10MoritzMuehlenhoff) [14:33:08] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9726726 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [14:34:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1481.eqiad.wmnet with reason: host reimage [14:34:54] !log elukey@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[09-12]*: Deploy new TLS Keystore - PKI - elukey@cumin2002 [14:34:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1480.eqiad.wmnet with reason: host reimage [14:35:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1355.eqiad.wmnet with reason: host reimage [14:36:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1487.eqiad.wmnet with reason: host reimage [14:37:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1481.eqiad.wmnet with reason: host reimage [14:38:51] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:51] (ProbeDown) firing: (16) Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:55] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9726741 (10CDanis) 05Resolved→03Open a:05CDanis→03Volans [14:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P60938 and previous config saved to /var/cache/conftool/dbconfig/20240418-143928-marostegui.json [14:40:07] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update builder image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021459 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [14:40:08] (03CR) 10Majavah: [C:03+2] openstack: neutron: Fix firewall driver with openvswitch [puppet] - 10https://gerrit.wikimedia.org/r/1021484 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:40:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1487.eqiad.wmnet with reason: host reimage [14:40:23] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:32] (03PS1) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) [14:41:25] (03PS1) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) [14:42:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1355.eqiad.wmnet with reason: host reimage [14:43:32] (03PS2) 10Elukey: admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) [14:43:51] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1480.eqiad.wmnet with reason: host reimage [14:45:23] (SystemdUnitFailed) firing: (7) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:23] (ProbeDown) firing: (16) Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:51] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update query/collector images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021460 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [14:47:38] (03CR) 10CDanis: [C:03+1] jaeger: remove es-index-cleaner image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021458 (https://phabricator.wikimedia.org/T344953) (owner: 10Filippo Giunchedi) [14:47:55] (03PS3) 10Elukey: admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) [14:48:06] !log installing PHP 7.4 security updates (as packaged in Debian, not the WMF-internal build) [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:50] (ProbeDown) firing: (16) Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:04] !log uploaded python3-wmflib_1.2.5 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [14:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:53:51] (ProbeDown) firing: (16) Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:53:51] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:01] Appservers is me, and transient [14:54:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T361627)', diff saved to https://phabricator.wikimedia.org/P60939 and previous config saved to /var/cache/conftool/dbconfig/20240418-145435-marostegui.json [14:54:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:54:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:54:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:54:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:55:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:55:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T361627)', diff saved to https://phabricator.wikimedia.org/P60940 and previous config saved to /var/cache/conftool/dbconfig/20240418-145512-marostegui.json [14:55:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1481.eqiad.wmnet with OS bullseye [14:56:26] !log elukey@cumin2002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching aqs20[09-12]*: Deploy new TLS Keystore - PKI - elukey@cumin2002 [14:58:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1487.eqiad.wmnet with OS bullseye [14:58:51] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T361627)', diff saved to https://phabricator.wikimedia.org/P60941 and previous config saved to /var/cache/conftool/dbconfig/20240418-150003-marostegui.json [15:00:17] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:02:05] !log elukey@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2012.codfw.wmnet*: Deploy new TLS Keystore - PKI - elukey@cumin2002 [15:02:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1355.eqiad.wmnet with OS bullseye [15:03:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1480.eqiad.wmnet with OS bullseye [15:03:51] (ProbeDown) firing: (10) Service aqs2010-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:39] !log Running homer 'cr*eqiad*' commit 'T351074' [15:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:45] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:05:23] (ProbeDown) firing: (9) Service aqs2010-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:50] (ProbeDown) resolved: (8) Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:41] !log elukey@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2012.codfw.wmnet*: Deploy new TLS Keystore - PKI - elukey@cumin2002 [15:12:56] !log Pooling and uncordoning mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074 [15:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:05] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1355.eqiad.wmnet|mw1480.eqiad.wmnet|mw1481.eqiad.wmnet|mw1487.eqiad.wmnet),cluster=kubernetes,service=kubesvc [15:13:05] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:14:13] (03PS4) 10Elukey: admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) [15:14:59] (03PS5) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [15:14:59] (03PS1) 10Vgutierrez: profile::benthos: Don't require kafka config [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) [15:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P60942 and previous config saved to /var/cache/conftool/dbconfig/20240418-151510-marostegui.json [15:18:01] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2007/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:18:35] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@5fb4f99]: (no justification provided) [15:19:07] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@5fb4f99]: (no justification provided) (duration: 00m 32s) [15:19:28] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:22:37] (03PS1) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) [15:25:01] (03PS2) 10Vgutierrez: profile::benthos: Don't require kafka config [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) [15:25:01] (03PS6) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [15:25:14] (03PS1) 10Elukey: profile::httpbb: fix liftwing_staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1021506 [15:26:31] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2011/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:26:43] !log rolling python3-wmflib upgrade to 1.2.5 across the fleet [15:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] (03PS3) 10Vgutierrez: profile::benthos: Don't require kafka config [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) [15:28:07] (03PS7) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [15:29:30] 06SRE, 06Infrastructure-Foundations, 10netops: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902 (10Fabfur) 03NEW [15:29:35] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2013/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:29:59] (03CR) 10Vgutierrez: "PCC shows NOOP on existing usecases (logstash1023 & cp4037)" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:30:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P60943 and previous config saved to /var/cache/conftool/dbconfig/20240418-153017-marostegui.json [15:31:07] !log elukey@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [15:31:46] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:32:19] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9726968 (10MoritzMuehlenhoff) Looks good to me. But better initially set these up with the default DRBD settings and check if the I/O perform... [15:32:46] !log installing util-linux security updates on buster [15:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:09] (03PS3) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-04-17-125039 to 2024-04-17-163312 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020874 [15:34:10] (03PS3) 10Jforrester: wikifunctions: Configure prometheus endpoints on both services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 [15:34:10] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-04-03-210033 to 2024-04-18-150843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021507 (https://phabricator.wikimedia.org/T347901) [15:40:27] (03PS8) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [15:40:40] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-04-17-125039 to 2024-04-17-163312 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020874 (owner: 10Jforrester) [15:41:31] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-04-17-125039 to 2024-04-17-163312 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020874 (owner: 10Jforrester) [15:41:51] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:42:33] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:42:48] (03PS1) 10Clément Goubert: admin_ng: Bump kube-state-metrics memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021509 [15:43:12] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:43:33] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:44:24] (03PS9) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [15:44:43] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:44:47] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:45:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T361627)', diff saved to https://phabricator.wikimedia.org/P60944 and previous config saved to /var/cache/conftool/dbconfig/20240418-154524-marostegui.json [15:45:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:45:32] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:45:36] (03PS1) 10Phuedx: Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 [15:45:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:45:46] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:45:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T361627)', diff saved to https://phabricator.wikimedia.org/P60945 and previous config saved to /var/cache/conftool/dbconfig/20240418-154547-marostegui.json [15:45:53] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:46:14] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-04-03-210033 to 2024-04-18-150843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021507 (https://phabricator.wikimedia.org/T347901) (owner: 10Jforrester) [15:47:06] (03CR) 10Vgutierrez: [V:03+1] "ncredir2002.yaml will be dropped before merging, it's there to get both use cases covered in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [15:47:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-04-03-210033 to 2024-04-18-150843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021507 (https://phabricator.wikimedia.org/T347901) (owner: 10Jforrester) [15:49:01] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:50:31] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:50:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T361627)', diff saved to https://phabricator.wikimedia.org/P60946 and previous config saved to /var/cache/conftool/dbconfig/20240418-155038-marostegui.json [15:50:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:50:54] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:52:32] (03CR) 10Kosta Harlan: "I would go with setting `$wgWikimediaEventsIPoidUrl = null;` as that is a smaller change, but up to you. Note that this change-id had some" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 (owner: 10Phuedx) [15:52:38] (03CR) 10Vgutierrez: [C:04-1] benthos/haproxy: using hiera aliases for benthos socket address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:52:54] (03CR) 10Kosta Harlan: "e.g. Ic5c5d17ce72689396029452450f66dd271c2e575 isn't included in this revert." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 (owner: 10Phuedx) [15:53:05] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:53:07] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:53:31] (03PS2) 10Clément Goubert: admin_ng: Bump kube-state-metrics memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021509 [15:54:04] (03CR) 10Kamila Součková: [C:03+1] admin_ng: Bump kube-state-metrics memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021509 (owner: 10Clément Goubert) [15:54:55] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:55:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Configure prometheus endpoints on both services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 (owner: 10Jforrester) [15:56:17] (03Merged) 10jenkins-bot: wikifunctions: Configure prometheus endpoints on both services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020872 (owner: 10Jforrester) [15:57:10] (03CR) 10Clément Goubert: [C:03+2] admin_ng: Bump kube-state-metrics memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021509 (owner: 10Clément Goubert) [15:57:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [15:57:46] (03PS2) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) [15:58:44] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:59:13] (03CR) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:59:22] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:59:50] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:23] (03Merged) 10jenkins-bot: admin_ng: Bump kube-state-metrics memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021509 (owner: 10Clément Goubert) [16:01:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:01:14] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9727075 (10Jhancock.wm) Tried to run a diagnostic from the Lifecycle controller. Haunted because of a DIMM error on B4. It's been replaced. re-running the d... [16:01:24] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:01:35] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:01:40] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:01:45] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:02:03] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:03:05] !log repool ncredir2001 [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:32] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:04:48] 06SRE, 06Data-Engineering-Icebox, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856#9727092 (10bd808) [16:05:45] (03CR) 10Vgutierrez: [C:04-1] benthos/haproxy: using hiera aliases for benthos socket address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:05:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P60947 and previous config saved to /var/cache/conftool/dbconfig/20240418-160546-marostegui.json [16:19:51] (03PS3) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) [16:20:11] (03CR) 10CI reject: [V:04-1] benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P60948 and previous config saved to /var/cache/conftool/dbconfig/20240418-162053-marostegui.json [16:23:00] (03PS4) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) [16:27:59] (03CR) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 930.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:30:30] (03CR) 10Krinkle: [C:03+1] "Nice!" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1020840 (https://phabricator.wikimedia.org/T358253) (owner: 10Hashar) [16:32:22] (03PS1) 10Fabfur: benthos/haproxy: include haproxy current pid in messages [puppet] - 10https://gerrit.wikimedia.org/r/1021517 (https://phabricator.wikimedia.org/T358109) [16:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 930.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:36:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T361627)', diff saved to https://phabricator.wikimedia.org/P60949 and previous config saved to /var/cache/conftool/dbconfig/20240418-163600-marostegui.json [16:36:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:36:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:36:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:36:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60950 and previous config saved to /var/cache/conftool/dbconfig/20240418-163612-marostegui.json [16:37:01] (03PS4) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [16:37:38] (03CR) 10TChin: Add datasets-config helmfile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:38:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60951 and previous config saved to /var/cache/conftool/dbconfig/20240418-163827-marostegui.json [16:39:34] (03PS1) 10Jdlrobson: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 [16:44:51] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on matomo1002.eqiad.wmnet with reason: Migrating to new version [16:45:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on matomo1002.eqiad.wmnet with reason: Migrating to new version [16:45:40] (03PS2) 10Jdlrobson: Enable limited width on all main pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020886 (https://phabricator.wikimedia.org/T357706) [16:45:43] (03PS2) 10Jdlrobson: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 [16:45:48] (03PS3) 10Jdlrobson: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 [16:46:41] (03PS4) 10Jdlrobson: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 [16:46:59] (03PS5) 10Jdlrobson: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 (https://phabricator.wikimedia.org/T362747) [16:47:09] (03CR) 10Btullis: [C:03+2] Swith matomo/piwik to the new host [puppet] - 10https://gerrit.wikimedia.org/r/1020798 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [16:48:46] (03CR) 10Kevin Bazira: [C:03+2] ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [16:49:51] (03Merged) 10jenkins-bot: ml-services: add logo-detection isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020706 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [16:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P60952 and previous config saved to /var/cache/conftool/dbconfig/20240418-165334-marostegui.json [16:57:36] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1700) [17:04:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9727407 (10ovasileva) p:05Triage→03High [17:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P60954 and previous config saved to /var/cache/conftool/dbconfig/20240418-170842-marostegui.json [17:23:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T361627)', diff saved to https://phabricator.wikimedia.org/P60955 and previous config saved to /var/cache/conftool/dbconfig/20240418-172349-marostegui.json [17:23:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [17:23:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:24:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [17:24:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T361627)', diff saved to https://phabricator.wikimedia.org/P60956 and previous config saved to /var/cache/conftool/dbconfig/20240418-172412-marostegui.json [17:25:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T361627)', diff saved to https://phabricator.wikimedia.org/P60957 and previous config saved to /var/cache/conftool/dbconfig/20240418-172525-marostegui.json [17:33:02] (03PS38) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [17:33:02] (03CR) 10Klausman: "After a day of banging my head against this and with lots of help from Balthazar and others, I think this is a decent version that should " [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [17:34:49] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2033/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [17:40:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P60958 and previous config saved to /var/cache/conftool/dbconfig/20240418-174033-marostegui.json [17:41:05] !log joal@deploy1002 Started deploy [airflow-dags/analytics@0a13b42]: Deploy of Analytics airflow dags for canary-events job [airflow-dags/analytics@0a13b420] [17:41:34] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@0a13b42]: Deploy of Analytics airflow dags for canary-events job [airflow-dags/analytics@0a13b420] (duration: 00m 28s) [17:47:35] (03PS10) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:48:53] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:49:34] (03PS3) 10NMW03: Add templateeditor right to sysops in dawiki and fix typo in group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020939 (https://phabricator.wikimedia.org/T361461) [17:51:50] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access - https://phabricator.wikimedia.org/T362731#9727595 (10NBaca-WMF) As Daniel's manager I approve of this request! [17:53:26] (03CR) 10Ssingh: [V:03+2 C:03+2] admin: add derenrich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) (owner: 10Ssingh) [17:54:14] (03PS2) 10Ssingh: admin: add derenrich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) [17:55:38] (03CR) 10Ssingh: [C:03+2] admin: add derenrich to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020879 (https://phabricator.wikimedia.org/T362731) (owner: 10Ssingh) [17:55:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P60959 and previous config saved to /var/cache/conftool/dbconfig/20240418-175541-marostegui.json [18:00:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access - https://phabricator.wikimedia.org/T362731#9727646 (10ssingh) 05Open→03Resolved a:03ssingh This request is now merged; please re-open if there are any issues, thanks! [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T1800) [18:01:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T352010)', diff saved to https://phabricator.wikimedia.org/P60960 and previous config saved to /var/cache/conftool/dbconfig/20240418-180059-ladsgroup.json [18:01:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:07:34] o/ [18:08:42] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021528 (https://phabricator.wikimedia.org/T361395) [18:08:44] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021528 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:09:07] !log joal@deploy1002 Started deploy [airflow-dags/analytics@980dc72]: Deploy of Analytics airflow dags for canary-events job [airflow-dags/analytics@980dc725] [18:09:34] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021528 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:09:38] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@980dc72]: Deploy of Analytics airflow dags for canary-events job [airflow-dags/analytics@980dc725] (duration: 00m 31s) [18:10:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T361627)', diff saved to https://phabricator.wikimedia.org/P60961 and previous config saved to /var/cache/conftool/dbconfig/20240418-181048-marostegui.json [18:10:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [18:10:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:11:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [18:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance [18:14:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance [18:14:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T361627)', diff saved to https://phabricator.wikimedia.org/P60962 and previous config saved to /var/cache/conftool/dbconfig/20240418-181450-marostegui.json [18:16:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60963 and previous config saved to /var/cache/conftool/dbconfig/20240418-181606-ladsgroup.json [18:17:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T361627)', diff saved to https://phabricator.wikimedia.org/P60964 and previous config saved to /var/cache/conftool/dbconfig/20240418-181704-marostegui.json [18:17:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:22:14] 06SRE, 10Domains: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921 (10EdErhart-WMF) 03NEW [18:24:08] 06SRE, 10DNS: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9727778 (10taavi) [18:26:13] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9727788 (10Jhancock.wm) All tests passed on the diagnostic test, including the pci bus. It's pinging on the idrac and the network ips. @RKemper give it ano... [18:27:33] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.1 refs T361395 [18:27:38] T361395: 1.43.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T361395 [18:27:46] 06SRE, 10ChangeProp, 06collaboration-services, 06Infrastructure-Foundations, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9727771 (10brennen) [18:31:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60965 and previous config saved to /var/cache/conftool/dbconfig/20240418-183116-ladsgroup.json [18:32:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P60966 and previous config saved to /var/cache/conftool/dbconfig/20240418-183211-marostegui.json [18:34:45] (03PS1) 10Ryan Kemper: move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020940 [18:35:32] (03CR) 10Bking: [C:03+1] move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020940 (owner: 10Ryan Kemper) [18:35:42] (03CR) 10Ryan Kemper: [C:03+2] move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020940 (owner: 10Ryan Kemper) [18:35:55] (03PS1) 10Aklapper: AVA: Decrease new user account score multiplicator [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1021530 [18:36:36] (03CR) 10Aklapper: [V:03+2 C:03+2] AVA: Decrease new user account score multiplicator [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1021530 (owner: 10Aklapper) [18:46:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T352010)', diff saved to https://phabricator.wikimedia.org/P60967 and previous config saved to /var/cache/conftool/dbconfig/20240418-184623-ladsgroup.json [18:46:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [18:46:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:46:33] (03Abandoned) 10Stevemunene: configure datahub to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020295 (https://phabricator.wikimedia.org/T361688) (owner: 10Stevemunene) [18:46:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [18:46:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T352010)', diff saved to https://phabricator.wikimedia.org/P60968 and previous config saved to /var/cache/conftool/dbconfig/20240418-184645-ladsgroup.json [18:47:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P60969 and previous config saved to /var/cache/conftool/dbconfig/20240418-184718-marostegui.json [18:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:14] (03PS38) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [18:50:21] !log dancy@deploy1002 Installing scap version "4.78.0" for 330 hosts [18:50:22] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [18:50:44] 10ops-codfw, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9727890 (10RKemper) 05Open→03Resolved a:03RKemper Looks good on our end, thanks! [18:51:06] !log dancy@deploy1002 Installation of scap version "4.78.0" completed for 330 hosts [18:52:30] (03CR) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater (037 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [18:53:51] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:16] (03PS1) 10Jforrester: Use IResultWrapper::numRows to check for empty IResultWrapper [extensions/GlobalUsage] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020941 (https://phabricator.wikimedia.org/T362901) [18:59:16] (03PS39) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [18:59:40] (03CR) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:00:22] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:01:47] (03PS40) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:02:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T361627)', diff saved to https://phabricator.wikimedia.org/P60970 and previous config saved to /var/cache/conftool/dbconfig/20240418-190226-marostegui.json [19:02:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [19:02:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:02:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [19:02:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T361627)', diff saved to https://phabricator.wikimedia.org/P60971 and previous config saved to /var/cache/conftool/dbconfig/20240418-190249-marostegui.json [19:02:54] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:02:56] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic2088\.codfw\.wmnet [19:05:10] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw [19:07:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T361627)', diff saved to https://phabricator.wikimedia.org/P60972 and previous config saved to /var/cache/conftool/dbconfig/20240418-190722-marostegui.json [19:11:08] 06SRE, 10DNS, 06Traffic: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9727972 (10ssingh) Discussed a bit with @EdErhart-WMF on what the goal is here on Slack and will update this task later when there is more clarity. [19:21:40] !log dropping wikiadmin user on 10.64.% on RW es sections [19:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P60973 and previous config saved to /var/cache/conftool/dbconfig/20240418-192229-marostegui.json [19:23:27] (03PS1) 10Andrew Bogott: wmcs-backup image pruning: handle a couple of edge cases [puppet] - 10https://gerrit.wikimedia.org/r/1021544 [19:25:52] (03PS2) 10Andrew Bogott: wmcs-backup snapshot pruning: handle a couple of edge cases [puppet] - 10https://gerrit.wikimedia.org/r/1021544 [19:37:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P60974 and previous config saved to /var/cache/conftool/dbconfig/20240418-193737-marostegui.json [19:38:29] (03PS41) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:39:35] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:42:48] . [19:43:51] (03CR) 10Andrew Bogott: [C:03+2] wmcs-backup snapshot pruning: handle a couple of edge cases [puppet] - 10https://gerrit.wikimedia.org/r/1021544 (owner: 10Andrew Bogott) [19:47:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T352010)', diff saved to https://phabricator.wikimedia.org/P60975 and previous config saved to /var/cache/conftool/dbconfig/20240418-194711-ladsgroup.json [19:47:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:52:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T361627)', diff saved to https://phabricator.wikimedia.org/P60976 and previous config saved to /var/cache/conftool/dbconfig/20240418-195244-marostegui.json [19:52:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:53:39] (03PS1) 10Andrew Bogott: codfw1dev cinder backups: backup less, for not as long [puppet] - 10https://gerrit.wikimedia.org/r/1021550 [19:54:56] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev cinder backups: backup less, for not as long [puppet] - 10https://gerrit.wikimedia.org/r/1021550 (owner: 10Andrew Bogott) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240418T2000). [20:00:05] cjming, Jdlrobson, and NMW03: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:32] o/ [20:00:40] i can deploy since i also have a pacht [20:00:42] patch [20:00:42] cjming: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1021510 had several followups. are you certain you want a partial revert only? [20:00:48] present [20:00:52] hey again @cjming :) [20:00:56] (i mean, the patch it reverts had multiple followups) [20:00:59] it's one of those weeks lol [20:01:03] \O/ [20:01:30] urbanecm: i'm not certain - idk much about the issue other than i was asked to deploy it today [20:01:41] asked by Kosta, i presume? [20:01:52] Kosta, Sam, and Mikhail [20:02:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60977 and previous config saved to /var/cache/conftool/dbconfig/20240418-200218-ladsgroup.json [20:02:54] cjming: ack. In that case, let's add a revert of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1021338 to your patch, and make it a full revert. [20:04:40] urbanecm: that works for me -- seems like the main goal is to decommission that stream [20:04:49] yup [20:05:11] (03PS2) 10Phuedx: Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 [20:05:50] (03PS1) 10Clare Ming: Revert "ext-EventLogging: Add mediawiki.ip_reputation.score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020942 [20:06:48] urbanecm: will you give your +1 to ^^ [20:06:56] sure, gimme a sec [20:07:08] (03CR) 10Urbanecm: [C:03+1] Revert "ext-EventLogging: Add mediawiki.ip_reputation.score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020942 (owner: 10Clare Ming) [20:07:11] (03CR) 10Urbanecm: [C:03+1] Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 (owner: 10Phuedx) [20:07:12] done [20:07:20] ty! [20:07:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 (owner: 10Phuedx) [20:07:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020942 (owner: 10Clare Ming) [20:08:36] Jdlrobson: i'll do yours next [20:08:42] (03Merged) 10jenkins-bot: Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021510 (owner: 10Phuedx) [20:08:44] (03Merged) 10jenkins-bot: Revert "ext-EventLogging: Add mediawiki.ip_reputation.score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020942 (owner: 10Clare Ming) [20:09:03] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1021510|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)"]], [[gerrit:1020942|Revert "ext-EventLogging: Add mediawiki.ip_reputation.score"]] [20:09:10] NMW03: i'll do your patch thereafter [20:09:24] np [20:11:35] !log cjming@deploy1002 cjming and phuedx: Backport for [[gerrit:1021510|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)"]], [[gerrit:1020942|Revert "ext-EventLogging: Add mediawiki.ip_reputation.score"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:40] !log cjming@deploy1002 cjming and phuedx: Continuing with sync [20:15:12] (03CR) 10Dzahn: [C:03+1] phabricator: Remove old crt after switching to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1021477 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [20:15:46] (03CR) 10Dzahn: "not really gitlab-related but lgtm :)" [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [20:16:26] (03PS2) 10Dzahn: ssl: Remove old phabricator key after switching to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [20:16:39] (03CR) 10Dzahn: [C:03+1] ssl: Remove old phabricator key after switching to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [20:17:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60978 and previous config saved to /var/cache/conftool/dbconfig/20240418-201727-ladsgroup.json [20:23:59] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1021510|Revert "WikimediaEvents: Set IPoid URL and enable ip_reputation/score (2nd attempt)"]], [[gerrit:1020942|Revert "ext-EventLogging: Add mediawiki.ip_reputation.score"]] (duration: 14m 56s) [20:24:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 (https://phabricator.wikimedia.org/T362747) (owner: 10Jdlrobson) [20:25:02] (03Merged) 10jenkins-bot: Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021518 (https://phabricator.wikimedia.org/T362747) (owner: 10Jdlrobson) [20:25:17] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1021518|Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML (T362747)]] [20:25:25] (SystemdUnitFailed) firing: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:37] T362747: [regression] Minerva: Cached HTML are not getting the responsive infobox styles - https://phabricator.wikimedia.org/T362747 [20:25:45] (03CR) 10EoghanGaffney: [C:03+2] ssl: Remove old phabricator key after switching to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [20:25:49] (03CR) 10EoghanGaffney: [V:03+2 C:03+2] ssl: Remove old phabricator key after switching to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1021489 (owner: 10EoghanGaffney) [20:27:45] !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:1021518|Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML (T362747)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:51] Jdlrobson: can you test? [20:28:49] looking now :) [20:29:54] (03PS4) 10NMW03: Add templateeditor right to sysops in dawiki and fix typo in group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020939 (https://phabricator.wikimedia.org/T361461) [20:30:09] LGTM cjming please sync [20:30:15] ok! [20:30:19] !log cjming@deploy1002 jdlrobson and cjming: Continuing with sync [20:30:25] (SystemdUnitFailed) resolved: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T352010)', diff saved to https://phabricator.wikimedia.org/P60979 and previous config saved to /var/cache/conftool/dbconfig/20240418-203234-ladsgroup.json [20:32:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [20:32:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:32:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [20:32:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P60980 and previous config saved to /var/cache/conftool/dbconfig/20240418-203256-ladsgroup.json [20:40:24] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] phabricator: Remove old crt after switching to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1021477 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [20:42:32] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1021518|Temporarily restore wgMinervaApplyKnownTemplateHacks for cached HTML (T362747)]] (duration: 17m 14s) [20:42:43] T362747: [regression] Minerva: Cached HTML are not getting the responsive infobox styles - https://phabricator.wikimedia.org/T362747 [20:42:44] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508 [20:42:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020939 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [20:42:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508 [20:42:53] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [20:42:59] Jdlrobson: should be live! [20:43:17] NMW03: doing your patch now [20:43:24] (y) [20:43:29] cjming: <3 [20:43:34] (03Merged) 10jenkins-bot: Add templateeditor right to sysops in dawiki and fix typo in group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020939 (https://phabricator.wikimedia.org/T361461) (owner: 10NMW03) [20:43:50] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1020939|Add templateeditor right to sysops in dawiki and fix typo in group name (T361461)]] [20:43:55] T361461: Add extendedconfirmed and templateeditor protection to dawiki - https://phabricator.wikimedia.org/T361461 [20:46:20] !log cjming@deploy1002 cjming and nmw03: Backport for [[gerrit:1020939|Add templateeditor right to sysops in dawiki and fix typo in group name (T361461)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:32] NMW03: can you test? [20:46:46] sure one sec [20:48:00] cjming: LGTM [20:48:22] cool - syncing [20:48:27] !log cjming@deploy1002 cjming and nmw03: Continuing with sync [21:00:15] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1020939|Add templateeditor right to sysops in dawiki and fix typo in group name (T361461)]] (duration: 16m 24s) [21:00:32] NMW03: should be live! [21:00:39] T361461: Add extendedconfirmed and templateeditor protection to dawiki - https://phabricator.wikimedia.org/T361461 [21:01:01] !log end of UTC late backport window [21:01:02] <3 [21:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:30] 06SRE, 06Fundraising-Backlog, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9728207 (10AKanji-WMF) 05Declined→03Open p:05High→03Medium a:03AKanji-WMF [21:11:50] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:11:56] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [21:14:10] (03CR) 10Dzahn: [C:03+2] create wikipedia-pl-sysop.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [21:14:13] (03PS4) 10Dzahn: create wikipedia-pl-sysop.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) [21:17:42] (03CR) 10Dzahn: [V:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [21:25:25] (SystemdUnitFailed) firing: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:25] (SystemdUnitFailed) resolved: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:43] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9728277 (10Dzahn) [21:47:18] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9728278 (10Dzahn) Phabricator was done by @eoghan (thank you) and this completes the ticket. Our time should be done here now. [21:48:05] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9728280 (10Dzahn) 05In progress→03Resolved [21:48:44] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9728282 (10Dzahn) [21:56:47] (03PS2) 10Dzahn: graphite: avoid including multiple roles, define primary host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1019885 [21:57:28] (03CR) 10Dzahn: "ah, yea, I should have run this myself before adding you. Forgot renaming the role elasticsearch::alerts role to a profile. amended." [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [22:01:18] (03PS1) 10Andrew Bogott: site.pp: add entries for two new sets of cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1021576 (https://phabricator.wikimedia.org/T361366) [22:03:39] (03PS2) 10Andrew Bogott: site.pp: add entries for two new sets of cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1021576 (https://phabricator.wikimedia.org/T361366) [22:05:21] (03CR) 10Andrew Bogott: [C:03+2] site.pp: add entries for two new sets of cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1021576 (https://phabricator.wikimedia.org/T361366) (owner: 10Andrew Bogott) [22:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:04] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1019885/2038/" [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [22:16:01] (03CR) 10Cwhite: [C:03+1] "PCC looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [22:21:54] (03CR) 10Cwhite: [C:03+1] "This is indeed an incremental improvement. +1 from me as well." [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) (owner: 10Filippo Giunchedi) [22:22:56] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938 (10ops-monitoring-bot) 03NEW [22:25:42] (03CR) 10Cwhite: "I'm a fan of count in the square brackets." [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [22:31:35] (03CR) 10Cwhite: "The links are pretty long so I agree with the idea. I also use the links from these messages." [puppet] - 10https://gerrit.wikimedia.org/r/1019844 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [22:31:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:31:45] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [22:33:49] (03CR) 10Dzahn: [C:03+2] "thanks for the review, disabled puppet on primary host, deploying on other host first" [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [22:37:30] (ProbeDown) firing: (2) Service wdqs2024:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2024:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:03] (03CR) 10Dzahn: [C:03+2] "confirmed this was complete NOOP in puppet on both graphite2004 and then graphite1005. no change but now you can switch-over the graphite " [puppet] - 10https://gerrit.wikimedia.org/r/1019885 (owner: 10Dzahn) [22:43:24] (03PS1) 10Ryan Kemper: wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 [22:47:22] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (owner: 10Ryan Kemper) [22:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:16] (03PS2) 10Ryan Kemper: wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) [22:53:22] (03CR) 10Dzahn: [C:03+2] graphite: switch envoy ssl provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [22:53:51] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:50] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:57:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P60982 and previous config saved to /var/cache/conftool/dbconfig/20240418-225702-ladsgroup.json [22:57:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:59:12] (03PS3) 10Ryan Kemper: wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) [23:02:11] (03CR) 10Dzahn: [C:03+2] "before we had all these names on the cert but almost none used anymore now:" [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:03:04] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [23:06:53] !log graphite - switched SSL cert provider from cergen to cfssl - restarted envoyproxy [23:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:30] (03CR) 10Dzahn: [C:03+2] "also confirmed on primary server, restarted envoyproxy even though puppet already does that. backup of old certs is in /root for right now" [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:09:16] (03CR) 10Dzahn: [V:03+2 C:03+2] delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:09:21] (03PS2) 10Dzahn: delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) [23:11:39] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9728398 (10Dzahn) [23:11:56] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9728399 (10Dzahn) [23:12:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P60983 and previous config saved to /var/cache/conftool/dbconfig/20240418-231210-ladsgroup.json [23:12:55] (03CR) 10Dzahn: [V:03+2 C:03+2] delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:13:23] (03CR) 10Dzahn: [C:03+2] ssl: delete graphite.discovery.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1019888 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:13:29] (03PS2) 10Dzahn: ssl: delete graphite.discovery.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1019888 (https://phabricator.wikimedia.org/T360414) [23:14:07] (03CR) 10Dzahn: [V:03+2 C:03+2] ssl: delete graphite.discovery.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1019888 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [23:17:05] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9728403 (10Dzahn) https://graphite.wikimedia.org has been switched. The old certs and keys are in /root on graphite1005/graphite2004 in... [23:27:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P60984 and previous config saved to /var/cache/conftool/dbconfig/20240418-232717-ladsgroup.json [23:30:31] 06SRE, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Linguila Mailing list - https://phabricator.wikimedia.org/T362644#9728404 (10Dzahn) p:05Triage→03Medium [23:33:14] 06SRE, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Linguila Mailing list - https://phabricator.wikimedia.org/T362644#9728406 (10Dzahn) a:03CapitainAfrika @CapitainAfrika Can you let us know here when the UG is recognized by AffCom? Once that happens feel free to set this from "stalled" to "ope... [23:33:47] 06SRE, 10Wikimedia-Mailing-lists: wikimedia-northern-nigeria@lists.wikimedia.org - https://phabricator.wikimedia.org/T360227#9728408 (10Dzahn) p:05Triage→03Medium a:03Aliyushaba [23:33:56] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9728413 (10Dzahn) [23:36:01] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9728418 (10Dzahn) ` mw2382 is kubernetes::worker mw2382 is a Kubernetes worker node (kubernetes::worker) Bare Metal host on site codfw and rack A3 ` [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1021396 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1021396 (owner: 10TrainBranchBot) [23:39:36] 06SRE, 10Wikimedia-Mailing-lists: Hang up on daily article lists - https://phabricator.wikimedia.org/T349406#9728420 (10Dzahn) 05Open→03Stalled p:05Triage→03Low a:03MicrobiologyMarcus [23:42:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P60985 and previous config saved to /var/cache/conftool/dbconfig/20240418-234225-ladsgroup.json [23:42:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [23:42:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:42:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [23:42:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P60986 and previous config saved to /var/cache/conftool/dbconfig/20240418-234247-ladsgroup.json [23:45:10] 10ops-esams, 06SRE, 06Traffic: cp3079 bios settings - https://phabricator.wikimedia.org/T349314#9728431 (10Dzahn) [23:46:23] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp3079 bios settings - https://phabricator.wikimedia.org/T349314#9728432 (10Dzahn) [23:49:32] 06SRE, 06Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743#9728435 (10Dzahn) p:05Triage→03Low latest comment on T345809 and the merge from October 2023 sound like this is basically declined? [23:51:25] 06SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691#9728439 (10Dzahn) p:05Triage→03Low Based on Ladsgroup's comment above I suggest closing this as declined. [23:56:44] 06SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org pages should have a "who to contact" link - https://phabricator.wikimedia.org/T344000#9728465 (10Dzahn) > who to contact about managing the archives of old lists I would say the answer is that you should contact the list admins of the list in question. If... [23:58:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1021396 (owner: 10TrainBranchBot)