[00:00:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1015452 (owner: 10TrainBranchBot) [01:20:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:24:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [01:25:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:26:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:30:30] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:37:29] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:42:29] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:59:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [02:22:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:26:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:31:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:34:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:39:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:44:13] (03PS1) 10Krinkle: php74-sssd,php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) [05:03:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on 9 hosts with reason: Fixing intermediate master [05:03:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 9 hosts with reason: Fixing intermediate master [05:12:36] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 3 others: db2100 crashed (memory error) - https://phabricator.wikimedia.org/T361037#9675467 (10Marostegui) Yes, I think we can ignore that host for now as its replacement is ready at {T355422} - @jcrespo you okay with that? [05:13:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:13:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:13:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:13:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:14:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T356166)', diff saved to https://phabricator.wikimedia.org/P59041 and previous config saved to /var/cache/conftool/dbconfig/20240401-051402-marostegui.json [05:14:08] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:21:15] (03PS1) 10Marostegui: installserver: Do not reimage db2220 [puppet] - 10https://gerrit.wikimedia.org/r/1015692 [05:28:33] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2220 [puppet] - 10https://gerrit.wikimedia.org/r/1015692 (owner: 10Marostegui) [05:42:47] (03PS1) 10Marostegui: db2215: Clarify status [puppet] - 10https://gerrit.wikimedia.org/r/1015694 (https://phabricator.wikimedia.org/T355422) [05:43:50] (03CR) 10Marostegui: [C:03+2] db2215: Clarify status [puppet] - 10https://gerrit.wikimedia.org/r/1015694 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui) [05:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T356166)', diff saved to https://phabricator.wikimedia.org/P59042 and previous config saved to /var/cache/conftool/dbconfig/20240401-054408-marostegui.json [05:44:13] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59043 and previous config saved to /var/cache/conftool/dbconfig/20240401-055915-marostegui.json [06:14:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59044 and previous config saved to /var/cache/conftool/dbconfig/20240401-061423-marostegui.json [06:22:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:29:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T356166)', diff saved to https://phabricator.wikimedia.org/P59045 and previous config saved to /var/cache/conftool/dbconfig/20240401-062932-marostegui.json [06:29:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:29:35] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:29:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:29:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T356166)', diff saved to https://phabricator.wikimedia.org/P59046 and previous config saved to /var/cache/conftool/dbconfig/20240401-062954-marostegui.json [06:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T356166)', diff saved to https://phabricator.wikimedia.org/P59047 and previous config saved to /var/cache/conftool/dbconfig/20240401-065635-marostegui.json [06:56:39] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:11:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59048 and previous config saved to /var/cache/conftool/dbconfig/20240401-071143-marostegui.json [07:26:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59049 and previous config saved to /var/cache/conftool/dbconfig/20240401-072650-marostegui.json [07:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 851.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 832.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:41:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T356166)', diff saved to https://phabricator.wikimedia.org/P59050 and previous config saved to /var/cache/conftool/dbconfig/20240401-074158-marostegui.json [07:42:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:42:02] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:42:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:42:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T356166)', diff saved to https://phabricator.wikimedia.org/P59051 and previous config saved to /var/cache/conftool/dbconfig/20240401-074221-marostegui.json [08:04:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 926ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:09:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 892.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:14:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 985.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:15:40] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9675600 (10Aklapper) As API Gateway is nowadays owned by #ServiceOps, adding the #serviceops project tag to open API Gate... [08:19:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T356166)', diff saved to https://phabricator.wikimedia.org/P59052 and previous config saved to /var/cache/conftool/dbconfig/20240401-081941-marostegui.json [08:24:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 854.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.061s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:34:30] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 845.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:34:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59053 and previous config saved to /var/cache/conftool/dbconfig/20240401-083449-marostegui.json [08:49:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59054 and previous config saved to /var/cache/conftool/dbconfig/20240401-084956-marostegui.json [09:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T356166)', diff saved to https://phabricator.wikimedia.org/P59055 and previous config saved to /var/cache/conftool/dbconfig/20240401-090503-marostegui.json [09:05:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:05:11] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:05:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:05:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T356166)', diff saved to https://phabricator.wikimedia.org/P59056 and previous config saved to /var/cache/conftool/dbconfig/20240401-090527-marostegui.json [09:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T356166)', diff saved to https://phabricator.wikimedia.org/P59057 and previous config saved to /var/cache/conftool/dbconfig/20240401-093744-marostegui.json [09:37:48] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:42:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59058 and previous config saved to /var/cache/conftool/dbconfig/20240401-095251-marostegui.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T1000) [10:07:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59059 and previous config saved to /var/cache/conftool/dbconfig/20240401-100758-marostegui.json [10:23:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T356166)', diff saved to https://phabricator.wikimedia.org/P59060 and previous config saved to /var/cache/conftool/dbconfig/20240401-102306-marostegui.json [10:23:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [10:23:09] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:23:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [10:23:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T356166)', diff saved to https://phabricator.wikimedia.org/P59061 and previous config saved to /var/cache/conftool/dbconfig/20240401-102328-marostegui.json [10:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:55:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T356166)', diff saved to https://phabricator.wikimedia.org/P59062 and previous config saved to /var/cache/conftool/dbconfig/20240401-105521-marostegui.json [10:55:26] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:10:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59063 and previous config saved to /var/cache/conftool/dbconfig/20240401-111028-marostegui.json [11:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59064 and previous config saved to /var/cache/conftool/dbconfig/20240401-112536-marostegui.json [11:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T356166)', diff saved to https://phabricator.wikimedia.org/P59065 and previous config saved to /var/cache/conftool/dbconfig/20240401-114043-marostegui.json [11:40:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:40:46] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:40:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:41:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T356166)', diff saved to https://phabricator.wikimedia.org/P59066 and previous config saved to /var/cache/conftool/dbconfig/20240401-114105-marostegui.json [12:10:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T356166)', diff saved to https://phabricator.wikimedia.org/P59067 and previous config saved to /var/cache/conftool/dbconfig/20240401-121002-marostegui.json [12:10:07] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:21:23] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1015960 (owner: 10L10n-bot) [12:25:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59068 and previous config saved to /var/cache/conftool/dbconfig/20240401-122510-marostegui.json [12:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 802.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 802.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:40:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59069 and previous config saved to /var/cache/conftool/dbconfig/20240401-124017-marostegui.json [12:42:35] (03CR) 10Anzx: component: Add SandboxLink to Portuguese Wikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [12:55:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T356166)', diff saved to https://phabricator.wikimedia.org/P59070 and previous config saved to /var/cache/conftool/dbconfig/20240401-125524-marostegui.json [12:55:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:55:29] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:55:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:57:54] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9676049 (10akosiaris) LWN has an article titled "The race to replace Redis". I am not going to link directly as it is LWN... [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:05:41] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9676057 (10aborrero) [13:11:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [13:12:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9676059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [13:20:16] (03PS1) 10Ssingh: cp3066: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015968 (https://phabricator.wikimedia.org/T360430) [13:20:18] (03PS1) 10Ssingh: cp3067: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015969 (https://phabricator.wikimedia.org/T360430) [13:20:19] (03PS1) 10Ssingh: cp3068: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015970 (https://phabricator.wikimedia.org/T360430) [13:20:21] (03PS1) 10Ssingh: cp3069: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015971 (https://phabricator.wikimedia.org/T360430) [13:20:22] (03PS1) 10Ssingh: cp3070: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015972 (https://phabricator.wikimedia.org/T360430) [13:20:24] (03PS1) 10Ssingh: cp3071: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015973 (https://phabricator.wikimedia.org/T360430) [13:20:25] (03PS1) 10Ssingh: cp3072: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015974 (https://phabricator.wikimedia.org/T360430) [13:20:27] (03PS1) 10Ssingh: cp3073: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015975 (https://phabricator.wikimedia.org/T360430) [13:27:46] (03PS4) 10Andrew Bogott: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:27:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:31:11] (03CR) 10CI reject: [V:04-1] ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:31:28] (03CR) 10Ssingh: "I am guessing we should still merge this? It's been sitting for a while! (that's fully on Traffic to be clear)" [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:32:00] (03PS5) 10Andrew Bogott: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:32:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:32:40] (03CR) 10Ssingh: "This as well -- let's plan on merging it this week, thanks. (And again, sorry for ignoring this for a while)." [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:34:59] (03CR) 10Ssingh: "Let's merge this today." [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins) [13:46:02] (03PS6) 10Andrew Bogott: ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:46:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [13:51:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [13:51:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [13:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T356166)', diff saved to https://phabricator.wikimedia.org/P59071 and previous config saved to /var/cache/conftool/dbconfig/20240401-135204-marostegui.json [13:52:07] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:01:18] (03CR) 10Andrew Bogott: [C:03+1] ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [14:06:19] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:06:45] !log reimage cp4052 back to bullseye [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] (03CR) 10Andrew Bogott: [C:03+2] etcd:v3: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:06:54] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [14:25:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T356166)', diff saved to https://phabricator.wikimedia.org/P59072 and previous config saved to /var/cache/conftool/dbconfig/20240401-142552-marostegui.json [14:25:55] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:26:40] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [14:27:21] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [14:28:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1005.eqiad.wmnet with OS bullseye [14:28:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9676205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed w... [14:32:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [14:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic2088'] [14:37:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59073 and previous config saved to /var/cache/conftool/dbconfig/20240401-144059-marostegui.json [14:49:23] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:51:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [14:52:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [14:52:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:56:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59074 and previous config saved to /var/cache/conftool/dbconfig/20240401-145606-marostegui.json [14:57:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [14:58:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9676264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [14:59:36] (03CR) 10Ssingh: "Idea is to merge and reimage these, one by one, and hence the per-host overrides." [puppet] - 10https://gerrit.wikimedia.org/r/1015968 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [15:03:32] (03PS3) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [15:05:09] (03PS4) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [15:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T356166)', diff saved to https://phabricator.wikimedia.org/P59075 and previous config saved to /var/cache/conftool/dbconfig/20240401-151114-marostegui.json [15:11:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 953.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:11:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:11:17] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:11:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1005.eqiad.wmnet with reason: host reimage [15:11:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:13:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [15:14:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1005.eqiad.wmnet with reason: host reimage [15:16:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 887.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:18:51] (03PS4) 10Andrew Bogott: cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 [15:18:52] (03PS18) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [15:18:52] (03PS1) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015986 [15:19:30] (03PS2) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015986 [15:19:30] (03PS5) 10Andrew Bogott: cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 [15:19:30] (03PS19) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [15:19:37] (03CR) 10CI reject: [V:04-1] Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [15:19:59] (03CR) 10CI reject: [V:04-1] profile::wmcs::kubeadm::etcd: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015986 (owner: 10Andrew Bogott) [15:21:06] (03PS3) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015986 [15:21:06] (03PS6) 10Andrew Bogott: cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 [15:21:06] (03PS20) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [15:21:31] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1015986 (owner: 10Andrew Bogott) [15:22:20] (03CR) 10Ilias Sarantopoulos: [C:03+1] "I've tested this as well in the context of pytorch 2.1.2 base image where I copied the work done on this patch to create the huggingfacese" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:24:24] (03CR) 10CI reject: [V:04-1] profile::wmcs::kubeadm::etcd: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015986 (owner: 10Andrew Bogott) [15:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T1530). [15:30:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [15:30:31] (03CR) 10BryanDavis: [C:03+1] "If we could only trust the jerk who maintains this PHP extension..." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [15:30:50] (03PS4) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: install etcd package before referencing uid [puppet] - 10https://gerrit.wikimedia.org/r/1015986 [15:30:51] (03PS7) 10Andrew Bogott: cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 [15:30:51] (03PS21) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [15:32:54] 10ops-codfw, 06SRE, 06Data-Platform-SRE: Fatal error detected on elastic2088 - https://phabricator.wikimedia.org/T361286#9676321 (10Jhancock.wm) a:05Papaul→03Jhancock.wm we actually have two devices with errors. component at bus 100 device 4 function 0 component at bus 101 device 0 function 0 I'm not s... [15:33:43] !log cassandra (restbase): re-enable blocking read-repair — T354561 [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:49] T354561: Hardware refresh: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561 [15:34:44] (03CR) 10RLazarus: [C:03+2] MachineVision being sunsetted - remove dumps scripts [puppet] - 10https://gerrit.wikimedia.org/r/1015009 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [15:34:57] (03CR) 10RLazarus: [C:03+2] MachineVision extension is sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [15:35:12] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:35:50] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::kubeadm::etcd: install etcd package before referencing uid [puppet] - 10https://gerrit.wikimedia.org/r/1015986 (owner: 10Andrew Bogott) [15:36:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:36:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1005.eqiad.wmnet with OS bullseye [15:36:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9676350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye completed:... [15:37:32] (03CR) 10RLazarus: "This patch can delete modules/snapshot/manifests/systemdjobs/dump_machine_vision.pp too, since it's now unused." [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (owner: 10Cparle) [15:37:38] (03CR) 10RLazarus: [C:04-1] "Continued at https://gerrit.wikimedia.org/r/c/1015009, https://gerrit.wikimedia.org/r/c/1015010." [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [15:38:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9676355 (10Jclark-ctr) [15:44:02] 10ops-codfw, 06SRE, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9676376 (10Jhancock.wm) elastic2049 was already decommissioned under https://phabricator.wikimedia.org/T313842 [15:50:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host elastic2088.codfw.wmnet [15:53:59] (03PS1) 10Andrew Bogott: Revert "profile::wmcs::kubeadm::etcd: install etcd package before referencing uid" [puppet] - 10https://gerrit.wikimedia.org/r/1015682 [15:56:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015457 [16:05:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host elastic2088.codfw.wmnet [16:07:19] 10ops-codfw, 06SRE, 06Data-Platform-SRE: 14Fatal error detected on elastic2088 - 14https://phabricator.wikimedia.org/T361286#9676419 (10Papaul) 05Open→03Resolved 14@bking the pxe boot issue was that both 10G and 1G nic were set to pxe boot so that is why it was failing. i disable pxe boot on the 1G... [16:07:57] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:17] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9676421 (10jhathaway) >>! In T356920#9672323, @Aklapper wrote: > @jhathaway: Another question is why the task is in S4 (Hardware Procurement) while it seems to have nothing to do with hardware procurem... [16:13:22] (03CR) 10Andrew Bogott: [C:03+2] Revert "profile::wmcs::kubeadm::etcd: install etcd package before referencing uid" [puppet] - 10https://gerrit.wikimedia.org/r/1015682 (owner: 10Andrew Bogott) [16:15:55] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9676451 (10jhathaway) @DBu-WMF the current dmarc monitoring is still a work in progress. ITS has purchased a subscription to dmarcdigests via the security budget, which is currently active, but they ar... [16:22:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 819.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:25:33] (03CR) 10Dreamy Jazz: Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [16:26:38] (03CR) 10Dreamy Jazz: "I'd also like I95cee19d0e10ac58d2b6838a1989706ee06558aa to be merged before this is applied." [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [16:26:52] (03PS3) 10Ladsgroup: lists: Allow images from upload.wikimedia.org in CSP [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [16:26:57] (03CR) 10Ladsgroup: [V:03+2 C:03+2] lists: Allow images from upload.wikimedia.org in CSP [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [16:27:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 825.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:29:28] (03PS2) 10Dreamy Jazz: Deploy partial action blocks everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015373 (https://phabricator.wikimedia.org/T353496) (owner: 10Tchanders) [16:29:49] (03CR) 10Dreamy Jazz: [C:03+1] Deploy partial action blocks everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015373 (https://phabricator.wikimedia.org/T353496) (owner: 10Tchanders) [16:37:43] (03PS3) 10Majavah: libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 [16:37:43] (03PS3) 10Majavah: libraryupgrader: base_dir is not optional [puppet] - 10https://gerrit.wikimedia.org/r/997549 [16:37:43] (03PS3) 10Majavah: libraryupgrader: remove libup-web config [puppet] - 10https://gerrit.wikimedia.org/r/997550 [16:37:44] (03PS3) 10Majavah: libraryupgrader: add toggle for worker services [puppet] - 10https://gerrit.wikimedia.org/r/997551 [16:41:34] (03CR) 10Eevans: [V:03+2 C:03+2] targets: Remove decommissioned hosts [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1015538 (https://phabricator.wikimedia.org/T354561) (owner: 10Eevans) [16:41:59] (03CR) 10Majavah: [C:03+2] libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 (owner: 10Majavah) [16:42:11] (03CR) 10Majavah: [C:03+2] libraryupgrader: base_dir is not optional [puppet] - 10https://gerrit.wikimedia.org/r/997549 (owner: 10Majavah) [16:42:30] (03CR) 10Majavah: [C:03+2] libraryupgrader: remove libup-web config [puppet] - 10https://gerrit.wikimedia.org/r/997550 (owner: 10Majavah) [16:42:43] (03CR) 10Majavah: [C:03+2] libraryupgrader: add toggle for worker services [puppet] - 10https://gerrit.wikimedia.org/r/997551 (owner: 10Majavah) [16:42:43] !log eevans@deploy1002 Started deploy [cassandra/logstash-logback-encoder@42653e6]: (no justification provided) [16:43:16] !log eevans@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@42653e6]: (no justification provided) (duration: 00m 33s) [16:46:15] (03CR) 10Dzahn: [C:03+2] phabricator: Fix SafeConfigParser Python DeprecationWarning [puppet] - 10https://gerrit.wikimedia.org/r/1015474 (owner: 10Aklapper) [16:50:46] (03CR) 10Dzahn: [C:03+2] prometheus: add config for scraping apache data on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/1015590 (owner: 10Dzahn) [16:55:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:55:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:55:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:55:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:56:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T356166)', diff saved to https://phabricator.wikimedia.org/P59076 and previous config saved to /var/cache/conftool/dbconfig/20240401-165559-marostegui.json [16:56:02] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:58:00] 10ops-codfw, 06SRE, 10decommission-hardware: 14decommission elastic20[37-54].codfw.wmnet - 14https://phabricator.wikimedia.org/T361305#9676698 (10Jhancock.wm) 05Open→03Resolved a:05Papaul→03Jhancock.wm [16:59:43] (03PS2) 10Dzahn: prometheus: add config for scraping apache data on miscweb and rt [puppet] - 10https://gerrit.wikimedia.org/r/1015592 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T1700) [17:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T1700). [17:05:20] (03PS1) 10Majavah: libraryupgrader: Use python3-venv instead [puppet] - 10https://gerrit.wikimedia.org/r/1015993 [17:06:20] (03CR) 10Majavah: [C:03+2] libraryupgrader: Use python3-venv instead [puppet] - 10https://gerrit.wikimedia.org/r/1015993 (owner: 10Majavah) [17:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T356166)', diff saved to https://phabricator.wikimedia.org/P59077 and previous config saved to /var/cache/conftool/dbconfig/20240401-170713-marostegui.json [17:07:29] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P59078 and previous config saved to /var/cache/conftool/dbconfig/20240401-172221-marostegui.json [17:30:06] (03CR) 10Dzahn: "please talk with serviceops about this one" [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:30:45] (03CR) 10Ahmon Dancy: "OK. I added Alexandros as a reviewer today." [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [17:33:32] (03CR) 10Dzahn: [C:03+2] prometheus: add config for scraping apache data on miscweb and rt [puppet] - 10https://gerrit.wikimedia.org/r/1015592 (owner: 10Dzahn) [17:37:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P59079 and previous config saved to /var/cache/conftool/dbconfig/20240401-173729-marostegui.json [17:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T356166)', diff saved to https://phabricator.wikimedia.org/P59080 and previous config saved to /var/cache/conftool/dbconfig/20240401-175237-marostegui.json [17:52:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:52:41] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:52:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:53:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T356166)', diff saved to https://phabricator.wikimedia.org/P59081 and previous config saved to /var/cache/conftool/dbconfig/20240401-175300-marostegui.json [17:58:42] !log LDAP - removed uid migr from groups nda and wmde (T361266) [17:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:45] T361266: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266 [17:59:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T356166)', diff saved to https://phabricator.wikimedia.org/P59082 and previous config saved to /var/cache/conftool/dbconfig/20240401-175910-marostegui.json [17:59:13] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:04:56] 06SRE, 10LDAP-Access-Requests: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9677090 (10Aklapper) [18:08:00] 06SRE, 10LDAP-Access-Requests: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9677104 (10Dzahn) [18:12:14] 06SRE, 10LDAP-Access-Requests: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9677106 (10Aklapper) FYI I have disabled the Phabricator account @Michael as it is linked to the WMDE staff account https://www.mediawiki.org/wiki/User:Michael_Gro%C3%9Fe_(WMDE) [18:14:15] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9677129 (10Tgr) [18:14:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P59083 and previous config saved to /var/cache/conftool/dbconfig/20240401-181417-marostegui.json [18:25:05] (03PS1) 10Dzahn: admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) [18:27:46] (03PS2) 10Dzahn: admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) [18:28:41] (03PS3) 10Dzahn: admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) [18:28:42] (03CR) 10CI reject: [V:04-1] admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) (owner: 10Dzahn) [18:29:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P59084 and previous config saved to /var/cache/conftool/dbconfig/20240401-182924-marostegui.json [18:29:40] (03CR) 10CI reject: [V:04-1] admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) (owner: 10Dzahn) [18:30:22] (03PS4) 10Dzahn: admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) [18:30:54] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9677197 (10Tgr) [18:31:18] (03CR) 10CI reject: [V:04-1] admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) (owner: 10Dzahn) [18:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:46] (03PS5) 10Dzahn: admin: disable shell user migr [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) [18:39:38] (03PS1) 10Majavah: libraryupgrader: Use Keyholder to hold the key [puppet] - 10https://gerrit.wikimedia.org/r/1015997 [18:40:33] (03CR) 10Majavah: [C:03+2] libraryupgrader: Use Keyholder to hold the key [puppet] - 10https://gerrit.wikimedia.org/r/1015997 (owner: 10Majavah) [18:44:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T356166)', diff saved to https://phabricator.wikimedia.org/P59085 and previous config saved to /var/cache/conftool/dbconfig/20240401-184432-marostegui.json [18:44:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:44:36] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:44:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:44:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T356166)', diff saved to https://phabricator.wikimedia.org/P59086 and previous config saved to /var/cache/conftool/dbconfig/20240401-184455-marostegui.json [18:48:29] (03CR) 10Dzahn: "r" [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) (owner: 10Dzahn) [18:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:51:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T356166)', diff saved to https://phabricator.wikimedia.org/P59087 and previous config saved to /var/cache/conftool/dbconfig/20240401-185128-marostegui.json [18:51:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:58:09] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202#9677403 (10Dzahn) @Urbanecm What do you think? Would it be reasonable to create a text file per list on the steward... [18:59:34] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9677404 (10Dzahn) [19:06:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P59088 and previous config saved to /var/cache/conftool/dbconfig/20240401-190635-marostegui.json [19:11:50] (03PS1) 10Majavah: libraryupgrader: Create /srv/git [puppet] - 10https://gerrit.wikimedia.org/r/1016002 [19:12:28] (03CR) 10Majavah: [C:03+2] libraryupgrader: Create /srv/git [puppet] - 10https://gerrit.wikimedia.org/r/1016002 (owner: 10Majavah) [19:17:13] (03PS1) 10Eevans: restbase: remove decommissioned hosts restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016003 (https://phabricator.wikimedia.org/T354561) [19:21:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P59089 and previous config saved to /var/cache/conftool/dbconfig/20240401-192143-marostegui.json [19:27:29] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9677510 (10Urbanecm) That sounds perfect to me @dzahn! Tha... [19:28:29] (03CR) 10RLazarus: [C:03+2] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1015995 (https://phabricator.wikimedia.org/T361266) (owner: 10Dzahn) [19:31:47] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9677514 (10RLazarus) [19:35:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9677524 (10RLazarus) Clinic duty SRE here, thanks @karapayneWMDE for the ticket. I merged https://gerrit.wikimedia.org/r/1015995 (thanks @Dzahn!) and followed up with ` rzl... [19:36:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T356166)', diff saved to https://phabricator.wikimedia.org/P59090 and previous config saved to /var/cache/conftool/dbconfig/20240401-193650-marostegui.json [19:36:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [19:36:53] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [19:37:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [19:37:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T356166)', diff saved to https://phabricator.wikimedia.org/P59091 and previous config saved to /var/cache/conftool/dbconfig/20240401-193713-marostegui.json [19:42:13] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [19:42:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9677549 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [19:47:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T356166)', diff saved to https://phabricator.wikimedia.org/P59092 and previous config saved to /var/cache/conftool/dbconfig/20240401-194709-marostegui.json [19:47:13] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [19:50:09] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9677567 (10DBu-WMF) @jhathaway Thanks for the info. I started this process with ITS and they asked me to open a Phab but happy to go back and ask for access to dmarcdigests. It is fine if we don't ke... [19:58:58] (03Abandoned) 10Dzahn: create sysop-pl.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T2000). nyaa~ [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [20:02:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P59093 and previous config saved to /var/cache/conftool/dbconfig/20240401-200217-marostegui.json [20:13:58] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Platform-SRE (2024.03.25 - 2024.04.14): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105#9677642 (10bking) [20:17:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [20:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P59094 and previous config saved to /var/cache/conftool/dbconfig/20240401-201725-marostegui.json [20:19:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [20:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T356166)', diff saved to https://phabricator.wikimedia.org/P59095 and previous config saved to /var/cache/conftool/dbconfig/20240401-203232-marostegui.json [20:32:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [20:32:36] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [20:32:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [20:32:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T356166)', diff saved to https://phabricator.wikimedia.org/P59096 and previous config saved to /var/cache/conftool/dbconfig/20240401-203254-marostegui.json [20:36:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2088.codfw.wmnet with OS bullseye [20:42:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T356166)', diff saved to https://phabricator.wikimedia.org/P59097 and previous config saved to /var/cache/conftool/dbconfig/20240401-204229-marostegui.json [20:42:33] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [20:52:42] (03PS1) 10Bking: elastic: Add elastic2088 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1016009 (https://phabricator.wikimedia.org/T361286) [20:57:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P59098 and previous config saved to /var/cache/conftool/dbconfig/20240401-205736-marostegui.json [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240401T2100). [21:03:38] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [21:04:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9677766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [21:06:43] (03CR) 10Ryan Kemper: [C:03+1] elastic: Add elastic2088 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1016009 (https://phabricator.wikimedia.org/T361286) (owner: 10Bking) [21:07:09] (03CR) 10Bking: [C:03+2] elastic: Add elastic2088 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1016009 (https://phabricator.wikimedia.org/T361286) (owner: 10Bking) [21:12:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P59099 and previous config saved to /var/cache/conftool/dbconfig/20240401-211244-marostegui.json [21:27:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T356166)', diff saved to https://phabricator.wikimedia.org/P59100 and previous config saved to /var/cache/conftool/dbconfig/20240401-212751-marostegui.json [21:27:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:27:55] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:28:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:38:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [21:38:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [21:38:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T356166)', diff saved to https://phabricator.wikimedia.org/P59101 and previous config saved to /var/cache/conftool/dbconfig/20240401-213834-marostegui.json [21:38:37] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:45:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T356166)', diff saved to https://phabricator.wikimedia.org/P59102 and previous config saved to /var/cache/conftool/dbconfig/20240401-214532-marostegui.json [21:45:41] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:46:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.197s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:48:15] 10ops-codfw, 06SRE: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525 (10ops-monitoring-bot) 03NEW [21:53:42] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [21:53:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9677896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [21:56:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 981.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:59:40] (03PS1) 10Dwisehaupt: Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) [22:00:09] (03CR) 10CI reject: [V:04-1] Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:00:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P59103 and previous config saved to /var/cache/conftool/dbconfig/20240401-220040-marostegui.json [22:01:21] 10ops-codfw, 06SRE: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9677903 (10bking) Hello DC Ops, this host was acting flaky before (see T361286 ). I'm not sure what the next steps should be, but just wanted to provide that context. [22:01:30] (03CR) 10JHathaway: [C:03+1] "Puppet doesn't have a great pattern for this type of conflict, seems like a fine workaround." [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [22:01:40] (03PS2) 10Dwisehaupt: Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) [22:04:01] (03PS1) 10Dwisehaupt: Force CIVICRM_TEMPLATE_COMPILE_CHECK to false [puppet] - 10https://gerrit.wikimedia.org/r/1016014 (https://phabricator.wikimedia.org/T343486) [22:04:46] (03CR) 10JHathaway: [C:03+1] Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:06:13] (03PS1) 10Dwisehaupt: Enable the mariadb slow query log for civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016016 (https://phabricator.wikimedia.org/T343486) [22:06:49] (03CR) 10JHathaway: [C:03+1] Force CIVICRM_TEMPLATE_COMPILE_CHECK to false [puppet] - 10https://gerrit.wikimedia.org/r/1016014 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:15:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P59104 and previous config saved to /var/cache/conftool/dbconfig/20240401-221548-marostegui.json [22:21:53] (03PS1) 10Dwisehaupt: Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) [22:30:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T356166)', diff saved to https://phabricator.wikimedia.org/P59105 and previous config saved to /var/cache/conftool/dbconfig/20240401-223055-marostegui.json [22:30:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [22:31:07] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [22:31:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [22:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [22:41:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [22:43:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:48:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:56:08] (03PS1) 10Jdlrobson: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T266065) [22:57:50] (03PS2) 10Jdlrobson: Enable desktop watchlist on beta cluster, clean up old references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016022 (https://phabricator.wikimedia.org/T109277) [23:00:33] (03PS1) 10Andrew Bogott: role::wmcs::openstack::eqiad1::cinder_backups: include envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016023 [23:01:22] (03PS2) 10Andrew Bogott: role::wmcs::openstack::eqiad1::cinder_backups: include envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016023 [23:02:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016023 (owner: 10Andrew Bogott) [23:07:15] (03PS3) 10Andrew Bogott: role::wmcs::openstack::eqiad1::cinder_backups: include envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016023 [23:07:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016023 (owner: 10Andrew Bogott) [23:14:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [23:14:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9678027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [23:38:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1015459 [23:38:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1015459 (owner: 10TrainBranchBot)