[00:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.1 [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1019769 (https://phabricator.wikimedia.org/T361395) [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.1 [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1019769 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [01:08:41] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T362596 (10phaultfinder) 03NEW [01:21:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 883.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:29:40] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.1 [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1019769 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [01:31:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 835ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:36:30] 10ops-codfw, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362596#9716179 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue, no impact [01:41:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 869.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:46:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 884.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T0200) [02:24:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 880.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:29:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 830.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:38:29] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:35] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:38:40] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:42:49] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:42:54] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:51:45] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:51:52] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T0300) [03:03:14] !log mwpresync@deploy1002 Pruned MediaWiki: 1.42.0-wmf.24 (duration: 03m 11s) [03:03:29] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:40] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019946 (https://phabricator.wikimedia.org/T361395) [03:04:41] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019946 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [03:05:31] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019946 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [03:05:58] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.1 refs T361395 [03:06:03] T361395: 1.43.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T361395 [03:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:56] (03PS2) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [03:31:31] (Traffic bill over quota) firing: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [03:36:31] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [03:42:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.007s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:47:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.059s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:51:31] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [03:56:31] (Traffic bill over quota) resolved: (2) Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:03:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.223s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:03:29] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.1 refs T361395 (duration: 57m 31s) [04:03:40] T361395: 1.43.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T361395 [04:08:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.19s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:13:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 984.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:28:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 920.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:44] (03PS1) 10Marostegui: db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019966 [05:03:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2156', diff saved to https://phabricator.wikimedia.org/P60545 and previous config saved to /var/cache/conftool/dbconfig/20240416-050315-root.json [05:03:37] (03CR) 10Marostegui: [C:03+2] db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019966 (owner: 10Marostegui) [05:04:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2156.codfw.wmnet with OS bookworm [05:06:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P60546 and previous config saved to /var/cache/conftool/dbconfig/20240416-050651-ladsgroup.json [05:06:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:11:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:11:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:16:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [05:16:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [05:16:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2107 (T361627)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240416-051623-marostegui.json [05:17:49] (ProbeDown) firing: (3) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:18:03] 👀 [05:18:11] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:18:29] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:21:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60547 and previous config saved to /var/cache/conftool/dbconfig/20240416-052158-ladsgroup.json [05:22:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T361627)', diff saved to https://phabricator.wikimedia.org/P60548 and previous config saved to /var/cache/conftool/dbconfig/20240416-052241-marostegui.json [05:22:49] (ProbeDown) resolved: (3) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:22:53] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:23:11] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:24:12] (03PS1) 10Marostegui: Revert "db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019914 [05:24:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [05:26:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [05:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.345s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.345s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:37:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60549 and previous config saved to /var/cache/conftool/dbconfig/20240416-053706-ladsgroup.json [05:37:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P60550 and previous config saved to /var/cache/conftool/dbconfig/20240416-053749-marostegui.json [05:43:38] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9716296 (10phaultfinder) [05:45:12] (03CR) 10Marostegui: [C:03+2] Revert "db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019914 (owner: 10Marostegui) [05:45:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60551 and previous config saved to /var/cache/conftool/dbconfig/20240416-054528-root.json [05:49:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2156.codfw.wmnet with OS bookworm [05:52:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P60552 and previous config saved to /var/cache/conftool/dbconfig/20240416-055215-ladsgroup.json [05:52:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [05:52:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:52:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [05:52:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P60553 and previous config saved to /var/cache/conftool/dbconfig/20240416-055237-ladsgroup.json [05:52:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P60554 and previous config saved to /var/cache/conftool/dbconfig/20240416-055256-marostegui.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T0600). [06:00:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60555 and previous config saved to /var/cache/conftool/dbconfig/20240416-060034-root.json [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T361627)', diff saved to https://phabricator.wikimedia.org/P60556 and previous config saved to /var/cache/conftool/dbconfig/20240416-060803-marostegui.json [06:08:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:08:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:08:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:08:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T361627)', diff saved to https://phabricator.wikimedia.org/P60557 and previous config saved to /var/cache/conftool/dbconfig/20240416-060826-marostegui.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T361627)', diff saved to https://phabricator.wikimedia.org/P60558 and previous config saved to /var/cache/conftool/dbconfig/20240416-061536-marostegui.json [06:15:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:15:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60559 and previous config saved to /var/cache/conftool/dbconfig/20240416-061546-root.json [06:16:27] top [06:16:35] hehe wrong window [06:30:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P60560 and previous config saved to /var/cache/conftool/dbconfig/20240416-063045-marostegui.json [06:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60561 and previous config saved to /var/cache/conftool/dbconfig/20240416-063053-root.json [06:36:30] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v8.5.0 [06:36:45] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v8.5.0 [06:37:31] !log upgraed spicerack to v8.5.0 on cumin1002 [06:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:08] (03CR) 10Muehlenhoff: [C:03+2] beta::mediawiki_packages: Install lilypond from component [puppet] - 10https://gerrit.wikimedia.org/r/1019730 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [06:45:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P60562 and previous config saved to /var/cache/conftool/dbconfig/20240416-064552-marostegui.json [06:46:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60563 and previous config saved to /var/cache/conftool/dbconfig/20240416-064559-root.json [06:51:40] 06SRE, 10SRE-tools, 10Cassandra: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694#9716407 (10LSobanski) @Eevans Tagging with #cassandra in case this may be of interest. [06:53:27] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete apt::pin for buster-backports [puppet] - 10https://gerrit.wikimedia.org/r/1019721 (https://phabricator.wikimedia.org/T362518) (owner: 10Muehlenhoff) [06:54:25] (03PS2) 10Slyngshede: New SSH key validator - Block duplicate keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) [06:55:38] (03CR) 10Slyngshede: New SSH key validator - Block duplicate keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T361627)', diff saved to https://phabricator.wikimedia.org/P60564 and previous config saved to /var/cache/conftool/dbconfig/20240416-070100-marostegui.json [07:01:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:01:06] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:01:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60565 and previous config saved to /var/cache/conftool/dbconfig/20240416-070105-root.json [07:01:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:01:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:01:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:01:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T361627)', diff saved to https://phabricator.wikimedia.org/P60566 and previous config saved to /var/cache/conftool/dbconfig/20240416-070139-marostegui.json [07:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T361627)', diff saved to https://phabricator.wikimedia.org/P60567 and previous config saved to /var/cache/conftool/dbconfig/20240416-070405-marostegui.json [07:06:05] Hi Amir1, Urbanecm: Can I add this patch to this deployment window please? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1014648 [07:08:09] (03CR) 10Aklapper: "Yes, output expected to be empty. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1019715 (https://phabricator.wikimedia.org/T197699) (owner: 10Aklapper) [07:08:10] (03PS1) 10Slyngshede: R:idm Prepare for Bitu installation for labtestwikitech. [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) [07:09:59] (03PS1) 10Slyngshede: site.pp, remove redundant idp-test definition [puppet] - 10https://gerrit.wikimedia.org/r/1020086 [07:12:06] (03CR) 10Filippo Giunchedi: [C:03+1] Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [07:12:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020086 (owner: 10Slyngshede) [07:13:08] (03PS2) 10Slyngshede: site.pp, prepare for Bitu installation for labtestwikitech. [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) [07:13:42] (03CR) 10Slyngshede: [C:03+2] site.pp, remove redundant idp-test definition [puppet] - 10https://gerrit.wikimedia.org/r/1020086 (owner: 10Slyngshede) [07:14:10] (03CR) 10Muehlenhoff: "You also need to add a globbing pattern for cloudidm* in modules/profile/data/profile/installserver/preseed.yaml, otherwise the installati" [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [07:16:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60568 and previous config saved to /var/cache/conftool/dbconfig/20240416-071611-root.json [07:17:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [07:19:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P60569 and previous config saved to /var/cache/conftool/dbconfig/20240416-071913-marostegui.json [07:19:21] (03PS3) 10Slyngshede: R:idm, prepare for Bitu installation for labtestwikitech. [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) [07:22:54] 07sre-alert-triage, 06DBA: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) - https://phabricator.wikimedia.org/T362611 (10LSobanski) 03NEW [07:23:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [07:23:32] (03CR) 10Slyngshede: [C:03+2] R:idm, prepare for Bitu installation for labtestwikitech. [puppet] - 10https://gerrit.wikimedia.org/r/1020085 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [07:23:54] (03PS1) 10Volans: Add configuration for the new magry DC [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) [07:24:33] (03PS2) 10Volans: Add configuration for the new magru DC [cookbooks] - 10https://gerrit.wikimedia.org/r/1020087 (https://phabricator.wikimedia.org/T362421) [07:24:39] 07sre-alert-triage, 06DBA: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) - https://phabricator.wikimedia.org/T362611#9716540 (10ABran-WMF) p:05Triage→03Medium a:03jcrespo [07:26:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [07:27:06] (03CR) 10Volans: [C:03+2] "LGTM, I'm taking care of merging and deploying as you're out" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1019927 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [07:27:39] (03Merged) 10jenkins-bot: Netbox validators: add magru [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1019927 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [07:34:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P60570 and previous config saved to /var/cache/conftool/dbconfig/20240416-073420-marostegui.json [07:35:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 depool T360116', diff saved to https://phabricator.wikimedia.org/P60571 and previous config saved to /var/cache/conftool/dbconfig/20240416-073521-arnaudb.json [07:35:29] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [07:38:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1161.eqiad.wmnet with reason: T360116 [07:38:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1161.eqiad.wmnet with reason: T360116 [07:38:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: T360116 [07:39:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: T360116 [07:39:34] (03CR) 10Volans: "A question and an improvement to avoid one hardcoded list" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [07:40:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1161.eqiad.wmnet with OS bookworm [07:40:27] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:43:14] !log volans@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [07:43:22] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:49:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T361627)', diff saved to https://phabricator.wikimedia.org/P60572 and previous config saved to /var/cache/conftool/dbconfig/20240416-074928-marostegui.json [07:49:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [07:49:35] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:49:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:49:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [07:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T361627)', diff saved to https://phabricator.wikimedia.org/P60573 and previous config saved to /var/cache/conftool/dbconfig/20240416-074952-marostegui.json [07:50:44] (03PS1) 10Aklapper: Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) [07:50:55] (03PS1) 10Marostegui: db2105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020171 [07:50:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2105', diff saved to https://phabricator.wikimedia.org/P60574 and previous config saved to /var/cache/conftool/dbconfig/20240416-075056-root.json [07:51:36] (03CR) 10Marostegui: [C:03+2] db2105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020171 (owner: 10Marostegui) [07:52:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2105.codfw.wmnet with OS bookworm [07:52:50] (03PS2) 10Aklapper: Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) [07:54:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [07:55:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T361627)', diff saved to https://phabricator.wikimedia.org/P60575 and previous config saved to /var/cache/conftool/dbconfig/20240416-075533-marostegui.json [07:55:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:56:06] (03PS1) 10Marostegui: Revert "db2105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019922 [07:56:27] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:56:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [07:56:47] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:59:38] 07sre-alert-triage, 06DBA: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) - https://phabricator.wikimedia.org/T362611#9716617 (10jcrespo) 05Open→03Resolved ` [09:44] (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimed... [08:01:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9716624 (10SGupta-WMF) [08:05:06] (03CR) 10Jcrespo: [C:03+1] "This is now ready and right from my side. Waiting for an ok from the DBAs." [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [08:09:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2105.codfw.wmnet with reason: host reimage [08:10:28] (03CR) 10Slyngshede: [C:03+2] New SSH key validator - Block duplicate keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [08:10:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P60576 and previous config saved to /var/cache/conftool/dbconfig/20240416-081040-marostegui.json [08:11:44] (03Merged) 10jenkins-bot: New SSH key validator - Block duplicate keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) (owner: 10Slyngshede) [08:13:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2105.codfw.wmnet with reason: host reimage [08:19:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1161.eqiad.wmnet with OS bookworm [08:20:45] (03PS1) 10Muehlenhoff: Remove SSH key for Lukasz [puppet] - 10https://gerrit.wikimedia.org/r/1020175 [08:21:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60577 and previous config saved to /var/cache/conftool/dbconfig/20240416-082108-arnaudb.json [08:21:41] (03CR) 10CI reject: [V:04-1] Remove SSH key for Lukasz [puppet] - 10https://gerrit.wikimedia.org/r/1020175 (owner: 10Muehlenhoff) [08:25:04] (03CR) 10Muehlenhoff: graphite: switch envoy ssl provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [08:25:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P60578 and previous config saved to /var/cache/conftool/dbconfig/20240416-082548-marostegui.json [08:28:24] (03PS2) 10Muehlenhoff: Remove SSH key for Lukasz [puppet] - 10https://gerrit.wikimedia.org/r/1020175 [08:28:30] (03CR) 10Muehlenhoff: [C:03+1] ssl: delete graphite.discovery.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1019888 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [08:28:36] (03CR) 10Muehlenhoff: [C:03+1] delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [08:28:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9716689 (10BTullis) a:03BTullis [08:29:02] (03CR) 10Muehlenhoff: graphite: switch envoy ssl provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [08:29:34] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1019770 (https://phabricator.wikimedia.org/T362614) [08:30:11] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9716703 (10MoritzMuehlenhoff) [08:31:55] (03PS1) 10Jcrespo: mariadb: Reenable backups and delete puppet 7 host config [puppet] - 10https://gerrit.wikimedia.org/r/1020176 (https://phabricator.wikimedia.org/T318062) [08:34:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2105.codfw.wmnet with OS bookworm [08:35:28] (03CR) 10Jcrespo: "Sanity check for puppet 7 per-host config removal: https://puppet-compiler.wmflabs.org/output/1020176/1920/db2098.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1020176 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:36:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60579 and previous config saved to /var/cache/conftool/dbconfig/20240416-083614-arnaudb.json [08:37:47] (03CR) 10LSobanski: [C:03+1] Remove SSH key for Lukasz [puppet] - 10https://gerrit.wikimedia.org/r/1020175 (owner: 10Muehlenhoff) [08:38:36] (03CR) 10Muehlenhoff: [C:03+1] "Confirmed, this is configured via the role (I'll do a big cleanup patch when all mariadb roles are done, but perfectly fine to remove this" [puppet] - 10https://gerrit.wikimedia.org/r/1020176 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:38:47] (03CR) 10Muehlenhoff: [C:03+2] Remove SSH key for Lukasz [puppet] - 10https://gerrit.wikimedia.org/r/1020175 (owner: 10Muehlenhoff) [08:40:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T361627)', diff saved to https://phabricator.wikimedia.org/P60580 and previous config saved to /var/cache/conftool/dbconfig/20240416-084055-marostegui.json [08:40:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:41:01] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:41:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:41:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T361627)', diff saved to https://phabricator.wikimedia.org/P60581 and previous config saved to /var/cache/conftool/dbconfig/20240416-084118-marostegui.json [08:42:20] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host cloudidm2001-dev.codfw.wmnet [08:42:22] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:44:32] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidm2001-dev.codfw.wmnet - slyngshede@cumin1002" [08:45:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidm2001-dev.codfw.wmnet - slyngshede@cumin1002" [08:45:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:18] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache cloudidm2001-dev.codfw.wmnet on all recursors [08:45:21] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudidm2001-dev.codfw.wmnet on all recursors [08:45:50] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidm2001-dev.codfw.wmnet - slyngshede@cumin1002" [08:46:35] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidm2001-dev.codfw.wmnet - slyngshede@cumin1002" [08:47:20] !log updated rsyslog to 8.2404.0-1~bpo11+1 on wikikube codfw - T357616 [08:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:24] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [08:47:25] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T361627)', diff saved to https://phabricator.wikimedia.org/P60582 and previous config saved to /var/cache/conftool/dbconfig/20240416-084733-marostegui.json [08:47:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:48:19] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host cloudidm2001-dev.codfw.wmnet with OS bookworm [08:48:30] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9716756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host cloudidm2001... [08:48:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1019771 (https://phabricator.wikimedia.org/T362616) [08:51:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60583 and previous config saved to /var/cache/conftool/dbconfig/20240416-085120-arnaudb.json [08:55:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60584 and previous config saved to /var/cache/conftool/dbconfig/20240416-085503-root.json [08:55:07] (03CR) 10Marostegui: [C:03+2] Revert "db2105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019922 (owner: 10Marostegui) [08:58:42] 06SRE, 10LDAP-Access-Requests: Grant Access to 'nda' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618 (10Michael) 03NEW [08:59:18] !log updated rsyslog to 8.2404.0-1~bpo11+1 on wikikube eqiad - T357616 [08:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:25] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [09:00:52] 06SRE, 10LDAP-Access-Requests: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9716833 (10taavi) [09:02:29] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619 (10SLyngshede-WMF) 03NEW [09:02:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P60585 and previous config saved to /var/cache/conftool/dbconfig/20240416-090240-marostegui.json [09:02:51] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudidm2001-dev.codfw.wmnet with reason: host reimage [09:03:52] 06SRE, 10Cumin, 06Infrastructure-Foundations: cumin could use randomization/splay options - https://phabricator.wikimedia.org/T164587#9716858 (10Volans) [09:05:44] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9716865 (10taavi) Also removed from `wmde-mediawiki` Gerrit group. [09:05:53] 06SRE, 10Cumin, 06Infrastructure-Foundations: cumin could use randomization/splay options - https://phabricator.wikimedia.org/T164587#9716866 (10Volans) 05Open→03Declined See also T224097 for a similar use case. Given the lack of interest in the last few years I'm resolving this as declined. With t... [09:05:58] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudidm2001-dev.codfw.wmnet with reason: host reimage [09:06:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P60586 and previous config saved to /var/cache/conftool/dbconfig/20240416-090625-arnaudb.json [09:07:00] (03CR) 10Muehlenhoff: [C:03+2] Pass the Ceph cluster address as an array [puppet] - 10https://gerrit.wikimedia.org/r/1019063 (owner: 10Muehlenhoff) [09:07:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s5 T362614 [09:07:28] T362614: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T362614 [09:07:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s5 T362614 [09:07:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2213 with weight 0 T362614', diff saved to https://phabricator.wikimedia.org/P60587 and previous config saved to /var/cache/conftool/dbconfig/20240416-090755-arnaudb.json [09:10:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60588 and previous config saved to /var/cache/conftool/dbconfig/20240416-091009-root.json [09:12:05] (03PS2) 10Muehlenhoff: Move cloudcephosd2001-dev to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) [09:12:07] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266#9716925 (10Urbanecm_WMF) >>! In T361266#9677105, @Aklapper wrote: > FYI I have disabled the Phabricator account @Michael as it is linked to the WMDE staff account https:... [09:13:29] 06SRE, 10LDAP-Access-Requests: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9716928 (10Urbanecm_WMF) [09:14:32] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse (WMDE) from WMF systems - https://phabricator.wikimedia.org/T361266#9716934 (10Aklapper) [09:17:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P60590 and previous config saved to /var/cache/conftool/dbconfig/20240416-091747-marostegui.json [09:18:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) (owner: 10Muehlenhoff) [09:19:12] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9716944 (10Urbanecm_WMF) >>! In T362533#9714116, @Aitolkyn wrote: >>>! In T362533#9713681, @ssingh wrote: >>>>! In T362533#9713602, @Aitolkyn wrote: >>> @ssingh Thank you for c... [09:20:38] 06SRE, 10Cumin, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900#9716946 (10Volans) [09:20:50] (03CR) 10Jcrespo: "Thank you, will merge when back, in a few minutes." [puppet] - 10https://gerrit.wikimedia.org/r/1020176 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [09:20:58] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudidm2001-dev.codfw.wmnet with OS bookworm [09:20:59] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudidm2001-dev.codfw.wmnet [09:21:16] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9716947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host cloudidm2001-dev... [09:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60591 and previous config saved to /var/cache/conftool/dbconfig/20240416-092517-root.json [09:26:05] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1019770 (https://phabricator.wikimedia.org/T362614) (owner: 10Gerrit maintenance bot) [09:28:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T362614', diff saved to https://phabricator.wikimedia.org/P60592 and previous config saved to /var/cache/conftool/dbconfig/20240416-092800-arnaudb.json [09:28:08] T362614: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T362614 [09:29:07] 06SRE, 10Cumin, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900#9716976 (10Volans) [09:30:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2123 T362614', diff saved to https://phabricator.wikimedia.org/P60593 and previous config saved to /var/cache/conftool/dbconfig/20240416-093041-arnaudb.json [09:31:45] !log Starting s5 codfw failover from db2123 to db2213 - T362614 (forgot to send it) [09:31:46] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9717001 (10jijiki) [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T361627)', diff saved to https://phabricator.wikimedia.org/P60594 and previous config saved to /var/cache/conftool/dbconfig/20240416-093255-marostegui.json [09:32:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [09:33:01] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:33:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [09:33:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60595 and previous config saved to /var/cache/conftool/dbconfig/20240416-093318-marostegui.json [09:35:24] (03CR) 10Filippo Giunchedi: [C:03+1] Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [09:35:43] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9717014 (10jijiki) [09:39:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60596 and previous config saved to /var/cache/conftool/dbconfig/20240416-093924-marostegui.json [09:39:30] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:39:49] (03PS1) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [09:40:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60597 and previous config saved to /var/cache/conftool/dbconfig/20240416-094023-root.json [09:41:36] (03CR) 10Hnowlan: [C:03+2] mw-jobrunner: set more php-specific settings to match metal instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019823 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [09:41:53] (03PS2) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [09:42:32] (03Merged) 10jenkins-bot: mw-jobrunner: set more php-specific settings to match metal instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019823 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [09:43:15] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Evaluate options for non-root operations with cumin and spicerack cookbooks - https://phabricator.wikimedia.org/T244840#9717060 (10Volans) Cumin is currently working with the running user from the `cuminunpriv1001` host (after a kinit) towards... [09:46:50] (03PS1) 10JMeybohm: k8s: Enable audit logging for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020186 (https://phabricator.wikimedia.org/T273507) [09:48:05] !log hnowlan@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [09:48:05] !log hnowlan@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [09:48:33] (03CR) 10Filippo Giunchedi: "I don't think we should be removing these links, I find it quite useful to have links straight from the alert" [puppet] - 10https://gerrit.wikimedia.org/r/1019844 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [09:48:57] !log hnowlan@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [09:49:02] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020186 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:49:34] (03PS3) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [09:49:58] !log hnowlan@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [09:53:17] (03PS1) 10JMeybohm: admin_ng: Enable restriced PSS profile in audit mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020187 (https://phabricator.wikimedia.org/T273507) [09:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P60598 and previous config saved to /var/cache/conftool/dbconfig/20240416-095432-marostegui.json [09:54:52] !log hnowlan@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [09:54:52] !log hnowlan@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [09:55:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60599 and previous config saved to /var/cache/conftool/dbconfig/20240416-095528-root.json [09:55:38] !log hnowlan@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [09:55:47] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable backups and delete puppet 7 host config [puppet] - 10https://gerrit.wikimedia.org/r/1020176 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [09:56:39] !log hnowlan@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [09:58:51] (03CR) 10Clément Goubert: [C:03+1] k8s: Enable audit logging for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020186 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:59:19] (03CR) 10Clément Goubert: [C:03+1] admin_ng: Enable restriced PSS profile in audit mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020187 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:59:25] (03CR) 10Cathal Mooney: "LGTM overall. One typo I'll fix and upload a new patch set then merge." [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1000) [10:00:10] (03CR) 10JMeybohm: [V:03+1 C:03+2] k8s: Enable audit logging for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020186 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [10:00:51] (03CR) 10JMeybohm: [C:03+2] admin_ng: Enable restriced PSS profile in audit mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020187 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [10:03:49] (03Merged) 10jenkins-bot: admin_ng: Enable restriced PSS profile in audit mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020187 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [10:04:10] (03PS9) 10JMeybohm: kubernetes::node: Add support for the SeccompDefault feature gate [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [10:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:20] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:07:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:06] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:08:07] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:08:18] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:08:19] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:08:42] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:08:43] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:09:27] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:09:28] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:09:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P60600 and previous config saved to /var/cache/conftool/dbconfig/20240416-100939-marostegui.json [10:10:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60601 and previous config saved to /var/cache/conftool/dbconfig/20240416-101034-root.json [10:10:44] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:10:46] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:12:46] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:12:48] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:13:16] !log uploaded PHP 7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u2 to buster-wikimedia/component/icu67 T362511 [10:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:39] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:15:38] !log updated rsyslog to 8.2404.0-1~bpo11+1 on all k8s nodes - T357616 [10:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:45] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [10:16:36] (03CR) 10Volans: Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [10:17:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:19:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:01] !log upgrading PHP on remaining mwdebug servers T362511 [10:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:39] (03CR) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:24:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T361627)', diff saved to https://phabricator.wikimedia.org/P60602 and previous config saved to /var/cache/conftool/dbconfig/20240416-102447-marostegui.json [10:24:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [10:24:50] 06SRE, 10Cumin, 06Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811#9717278 (10Volans) I see only one case where the implementation is straightforward and clean on the UI side, the... [10:24:52] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:25:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [10:25:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60603 and previous config saved to /var/cache/conftool/dbconfig/20240416-102510-marostegui.json [10:25:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60604 and previous config saved to /var/cache/conftool/dbconfig/20240416-102540-root.json [10:27:47] (03PS1) 10Filippo Giunchedi: alertmanager: avoid ferm-specific syntax for irc webhook [puppet] - 10https://gerrit.wikimedia.org/r/1020189 [10:27:57] (03PS9) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [10:30:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60605 and previous config saved to /var/cache/conftool/dbconfig/20240416-103040-marostegui.json [10:30:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:33:18] (03CR) 10Hnowlan: [C:03+1] mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:34:12] (ProbeDown) firing: (2) Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:35] kubemaster issues? [10:34:40] jayme: ^ ? [10:34:44] the auditlog thing? [10:34:44] acking [10:34:57] very plausible [10:35:12] no immediate impact, right? [10:35:16] nope [10:35:20] thanks [10:35:28] * jayme double checking [10:35:55] (03PS2) 10Hnowlan: restbase: migrate to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) [10:36:55] (03PS4) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [10:37:26] there are lots of connections to it on port 6443 [10:37:49] topranks: what source? [10:38:29] jynus: let me paste the output - many and varied [10:38:50] yeah, I can see them, seems its regular traffic [10:39:22] yeah to my untrained eyes looks very normal [10:39:22] https://phabricator.wikimedia.org/P60606 [10:39:57] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:40:37] yeah, all should be fine - sorry for the page [10:40:53] apiserver is sometimes taking too long on restarts (https://phabricator.wikimedia.org/T358936) [10:41:00] topranks: I think we should worry more about the cpu usage [10:41:13] https://grafana.wikimedia.org/goto/nTO4RE-Sg?orgId=1 [10:41:32] yeah that is likely the cause [10:41:54] kube-apiserver maxing out all cores [10:41:59] yep [10:42:21] (03Merged) 10jenkins-bot: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:42:27] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619#9717315 (10SLyngshede-WMF) [10:43:11] (ProbeDown) firing: (3) Service miscweb1003:30443 has failed probes (http_security_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:43:50] journal for that service looks relatively normal - nothing that jumps out as being errors [10:43:54] https://www.irccloud.com/pastebin/QJZTTtKm/ [10:44:12] (ProbeDown) resolved: (3) Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:14] a lot of those repeated [10:44:46] resolved now, but the miscweb thing is worrying [10:45:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:45:39] looking ^ [10:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P60607 and previous config saved to /var/cache/conftool/dbconfig/20240416-104547-marostegui.json [10:45:52] cpu use now dropped to ~50%, but that was higher than before it skyrocketed [10:45:54] so I belive those could be fallouts [10:46:08] miscweb, jobrunner [10:46:10] yeah could be, errors already dropped back to baseliune [10:46:37] https://grafana-rw.wikimedia.org/d/VLFehqB4z/node-detail-cathal?forceLogin=&orgId=1&refresh=5m&var-instance=kubemaster1002%3A9100&var-Mountpoint=All&var-netdev=All&var-num_cpus=4&var-chip=All&from=1713242775415&to=1713264375415&viewPanel=31 [10:46:45] seems "normal" usage level is about 25% cpu [10:47:41] I don't see calico-node failures/restarts [10:48:11] (ProbeDown) resolved: (8) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:48:15] for some reason here it dropped to like 10% at ~10:15, then maxed itself at ~10:30 [10:48:23] so there was a bump at the end [10:48:41] dropping is shutting down apiserver I soppose, the bump after is the startup [10:48:44] could you share what you did jayme- I would like to know in case it happens again? [10:49:04] I basically restarted apiservers [10:49:08] jayme: yep that makes perfect sense [10:49:34] jayme: and yep calico seems healthy - at least on network side all the BGP sessions to core routers are established 20+ hours [10:49:45] but just the plain systemd? [10:49:48] I can see calico kube-controllers being restarted, but that is not an issue for BGP sessions [10:50:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:50:20] jynus: kube-apiserver-safe-restart.service was triggered because of a config change rolled out via puppet [10:50:45] ah ok. so this was simply caused by the service restart? [10:50:47] sorry, I meant for solving it :-D [10:50:51] nothing [10:51:02] it's the restart on its own [10:51:18] I see, so the restart was not as smooth as normally, gotcha [10:51:20] as said, it takes too long sometimes. Leading to the alert being fired [10:51:23] is it the case it takes too long - i.e. stops the service for too long, and then when it restarts there is a backlog of requests which maxes the CPU / causes the probe failures etc? [10:51:56] let me get the ticket so we can note to the other people on call so they are aware [10:51:59] before returning to normal once the backlog is processed? [10:52:01] no, I think it's all the connection handling after coming back topranks [10:52:04] thank you for being around, jayme [10:52:08] (03PS5) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [10:52:21] jayme: ok right. so not a backlog of requests [10:52:31] but yeah, all the hosts reconnecting creates a spike in usage [10:52:47] I'd like to roll out the cfssl change for restbase - should I hold off or are things calm enough now? [10:53:03] maybe a backlog as well (the ones clients might hold) - but there is nothing like a queue or so in the apiserver itself [10:53:10] gotcha [10:53:12] hnowlan: things should be fine now, unless someone else objects [10:53:18] hnowlan: go ahead [10:53:29] jynus: https://phabricator.wikimedia.org/T358936 is the ticket [10:53:33] thank you also for asking, hnowlan! [10:53:45] but anyway, it sort of makes sense here given what happened, no real "fault" as such just a spike in activity maxing cpu creating temporary resource crunch/probe alerts [10:54:18] yeah...we're planning on changing the apiserver design in https://phabricator.wikimedia.org/T353464 [10:54:24] I'll prioritize that [10:54:40] jouncebot: now [10:54:40] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1000) [10:54:48] jouncebot: next [10:54:48] In 1 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1200) [10:54:57] cool. being on call is never less than educational :) [10:55:13] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:55:19] * topranks wonders with beefier hardware for ganeti if we could up that 12GB limit [10:55:40] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:55:54] jynus: are you going to add something to that task to report today's blip? I can do it also [10:56:02] !log disabling puppet on A:restbase before switching to cfssl [10:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:08] topranks: thanks, I will [10:56:19] jynus: cheers! [10:56:26] I was more thinking on making a mental note for the handover [10:56:27] (03CR) 10Hnowlan: [C:03+2] restbase: migrate to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) (owner: 10Hnowlan) [10:58:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1042.eqiad.wmnet [10:58:56] (03PS1) 10EoghanGaffney: phabricator: Switch certificate generation to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) [11:00:32] BTW, that fact that you have also your custom node panel, topranks, tells me we need to make a better one for everybody (I also lack lots of important io metrics)! [11:00:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P60608 and previous config saved to /var/cache/conftool/dbconfig/20240416-110055-marostegui.json [11:01:08] there isn't much good on it, I should get rid of it, I made it to play with prometheus / learn grafana [11:01:28] but then I end up using it - I think it's covered in the normal node panel anyway [11:01:39] (03CR) 10Muehlenhoff: phabricator: Switch certificate generation to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [11:02:18] (03PS2) 10EoghanGaffney: phabricator: Switch certificate generation to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) [11:02:31] (03CR) 10EoghanGaffney: phabricator: Switch certificate generation to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [11:03:36] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9717390 (10JMeybohm) We had this happening again in eqiad today because of a (planned) apiserver safe restart. We'll prioritize {T353464} to give more resources to wikikube apise... [11:04:10] https://phabricator.wikimedia.org/T358936#9717389 [11:04:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [11:04:53] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9717389 (10jcrespo) Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the servic... [11:05:59] jynus: ah, thanks for the much more sophisticated comment :D [11:06:00] (03PS6) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [11:07:20] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1042.eqiad.wmnet [11:08:50] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:09:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1927/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020185 (owner: 10Jelto) [11:14:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717418 (10WDoranWMF) Approved [11:14:31] restbase migration is complete [11:14:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1020189 (owner: 10Filippo Giunchedi) [11:16:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T361627)', diff saved to https://phabricator.wikimedia.org/P60609 and previous config saved to /var/cache/conftool/dbconfig/20240416-111602-marostegui.json [11:16:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:16:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:16:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:16:29] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:16:37] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:17:57] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9717440 (10hnowlan) [11:18:09] (03PS1) 10Effie Mouzeli: memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) [11:18:18] hnowlan: gg :D [11:19:52] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9717443 (10Aitolkyn) >>! In T362533#9716944, @Urbanecm_WMF wrote: >>>! In T362533#9714116, @Aitolkyn wrote: >>>>! In T362533#9713681, @ssingh wrote: >>>>>! In T362533#9713602,... [11:19:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2169.codfw.wmnet [11:21:10] (03CR) 10CI reject: [V:04-1] memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) (owner: 10Effie Mouzeli) [11:21:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance [11:21:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance [11:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T361627)', diff saved to https://phabricator.wikimedia.org/P60610 and previous config saved to /var/cache/conftool/dbconfig/20240416-112134-marostegui.json [11:21:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:21:41] (03PS1) 10Muehlenhoff: Switch db2169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020192 (https://phabricator.wikimedia.org/T349619) [11:21:47] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:23:13] hnowlan: this sounded a lot like: https://www.101soundboards.com/sounds/182026-upgrade-complete (starcraft sounds) [11:24:26] <3 [11:26:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T361627)', diff saved to https://phabricator.wikimedia.org/P60611 and previous config saved to /var/cache/conftool/dbconfig/20240416-112648-marostegui.json [11:26:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:30:49] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:31:47] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:41:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P60612 and previous config saved to /var/cache/conftool/dbconfig/20240416-114155-marostegui.json [11:43:24] (03CR) 10Muehlenhoff: [C:03+2] Switch db2169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020192 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:48:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2169.codfw.wmnet [11:50:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2151.codfw.wmnet [11:51:46] (03PS1) 10Muehlenhoff: Switch db2151 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020195 (https://phabricator.wikimedia.org/T349619) [11:52:07] (03PS1) 10Cathal Mooney: Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [11:53:00] (03CR) 10CI reject: [V:04-1] Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:53:32] (03CR) 10Muehlenhoff: [C:03+2] Switch db2151 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020195 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:55:53] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: avoid ferm-specific syntax for irc webhook [puppet] - 10https://gerrit.wikimedia.org/r/1020189 (owner: 10Filippo Giunchedi) [11:56:26] (03PS2) 10Cathal Mooney: Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [11:57:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P60613 and previous config saved to /var/cache/conftool/dbconfig/20240416-115703-marostegui.json [11:57:15] (03CR) 10CI reject: [V:04-1] Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:57:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2151.codfw.wmnet [11:58:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2180.codfw.wmnet [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1200) [12:03:07] (03PS1) 10Muehlenhoff: alertmanager: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1020198 [12:04:04] (03PS1) 10Muehlenhoff: Switch db2180 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020199 (https://phabricator.wikimedia.org/T349619) [12:04:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020198 (owner: 10Muehlenhoff) [12:05:36] (03CR) 10Muehlenhoff: [C:03+2] Switch db2180 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020199 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:06:39] (03PS7) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 [12:09:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2180.codfw.wmnet [12:09:35] (03PS1) 10Filippo Giunchedi: alertmanager: fix srange for irc webhook [puppet] - 10https://gerrit.wikimedia.org/r/1020200 [12:10:26] (03CR) 10Ssingh: [C:03+1] Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [12:11:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020200 (owner: 10Filippo Giunchedi) [12:11:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1932/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020185 (owner: 10Jelto) [12:12:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T361627)', diff saved to https://phabricator.wikimedia.org/P60615 and previous config saved to /var/cache/conftool/dbconfig/20240416-121211-marostegui.json [12:12:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:12:25] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1934/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020200 (owner: 10Filippo Giunchedi) [12:13:33] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2124.codfw.wmnet [12:13:53] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for aitolkyn - https://phabricator.wikimedia.org/T362533#9717676 (10ssingh) Thanks indeed @Urbanecm_WMF! Nice catch. @Aitolkyn: the contract expiry and date have been updated. If this has been resolved for you, please feel free to th... [12:15:06] (03PS1) 10Muehlenhoff: Switch db2124 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020201 (https://phabricator.wikimedia.org/T349619) [12:15:22] (03CR) 10Filippo Giunchedi: alertmanager: Avoid Ferm-specific syntax (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1020198 (owner: 10Muehlenhoff) [12:15:53] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] alertmanager: fix srange for irc webhook [puppet] - 10https://gerrit.wikimedia.org/r/1020200 (owner: 10Filippo Giunchedi) [12:16:34] (03PS2) 10Muehlenhoff: alertmanager: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1020198 [12:16:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020198 (owner: 10Muehlenhoff) [12:17:07] (03PS1) 10Ssingh: reverse-proxy: use larger subnets for eqiad/codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 [12:17:12] (03CR) 10Muehlenhoff: [C:03+2] Switch db2124 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020201 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:21:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2124.codfw.wmnet [12:22:25] (03PS1) 10Vgutierrez: prometheus: make lvs-realserver-mss work on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020203 (https://phabricator.wikimedia.org/T357258) [12:23:31] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020198 (owner: 10Muehlenhoff) [12:25:36] (03PS2) 10Vgutierrez: prometheus: make lvs-realserver-mss work on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020203 (https://phabricator.wikimedia.org/T357258) [12:28:56] (03CR) 10Vgutierrez: "manually tested on ncredir1001:" [puppet] - 10https://gerrit.wikimedia.org/r/1020203 (https://phabricator.wikimedia.org/T357258) (owner: 10Vgutierrez) [12:31:14] (03CR) 10Ssingh: [C:03+1] prometheus: make lvs-realserver-mss work on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020203 (https://phabricator.wikimedia.org/T357258) (owner: 10Vgutierrez) [12:32:04] (03PS1) 10Filippo Giunchedi: icinga: fix and simplify rsync firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1020205 [12:32:33] (03CR) 10CI reject: [V:04-1] icinga: fix and simplify rsync firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [12:32:55] (03CR) 10Vgutierrez: [C:03+2] prometheus: make lvs-realserver-mss work on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020203 (https://phabricator.wikimedia.org/T357258) (owner: 10Vgutierrez) [12:33:17] jouncebot: next [12:33:17] In 0 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1300) [12:34:30] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1936/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [12:40:59] (03PS1) 10Effie Mouzeli: mw-api-int: use mcrouter daemonset on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020207 (https://phabricator.wikimedia.org/T346690) [12:43:37] !log repool ncredir1002 [12:45:17] (03PS2) 10Filippo Giunchedi: icinga: fix and simplify rsync firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1020205 [12:45:42] (03CR) 10Filippo Giunchedi: icinga: fix and simplify rsync firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [12:51:38] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-api-int: use mcrouter daemonset on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020207 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:51:55] 06SRE, 06Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411#9717800 (10cmooney) >>! In T347411#9203210, @cmooney wrote: > Some things that may be possible, if still trying to predict the names from redfish data: > # Change... [12:52:15] (03CR) 10Effie Mouzeli: [C:03+2] mw-api-int: use mcrouter daemonset on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020207 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:53:58] (03Merged) 10jenkins-bot: mw-api-int: use mcrouter daemonset on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020207 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:55:41] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:56:55] (03PS5) 10Cathal Mooney: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [12:56:59] (03CR) 10Elukey: "Hellooo! IIUC Aiko and Fabian are trying to test the FIFO queue to run Hadoop jobs on GPU hardware (we still have two workers with a GPU e" [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) (owner: 10Joal) [12:57:55] (03PS6) 10Cathal Mooney: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [12:57:55] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:58:59] (03CR) 10Cathal Mooney: Add magru to homer-public (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [12:59:57] (ProbeDown) firing: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1300). [13:00:05] Msz2001 and NMW03: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] checking [13:00:15] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:00:17] I did that [13:00:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717857 (10ssingh) [13:00:31] o/ [13:00:32] this is me [13:00:36] I was about to ask, anything to revert? [13:00:36] please hangon [13:00:39] oook [13:00:43] standing by [13:00:49] I can deploy, but standing by [13:01:04] Lucas_WMDE: wait until incident over, please [13:01:08] ack [13:01:10] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [13:01:19] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:01:45] This is my first time I submitted a config patch to be deployed, so sorry for any obvious questions :) [13:02:05] Msz2001: we have an ongoing incident, blocking mw deploys until things are stable [13:02:09] ok it should come back in a wee bit [13:02:21] effie: thanks! [13:02:30] memcached errors are down to zero afaics [13:02:44] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9717872 (10Papaul) 05Open→03Resolved Since Monday I setup in rack D1 and D2 the juniper switch as management switch and... [13:02:55] ah no too soon [13:03:03] effie: capacity issue or something else? [13:03:11] cdanis: no, pebcak [13:03:32] give me a few secs, and I will revert [13:03:37] ok :) [13:03:55] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9717882 (10Papaul) @ssingh unfortunately using the fs DAC didn't fix the issue. So we are back to zero. I am still working on it [13:04:57] (ProbeDown) resolved: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:05:10] alright, thank you for your patience [13:05:15] (MediaWikiMemcachedHighErrorRate) resolved: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:05:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:06:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717899 (10ssingh) [13:06:34] !log depool ncredir1001 [13:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:44] Lucas_WMDE: please go ahead [13:06:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:06:51] alright, thanks for the ping! [13:06:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:06:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:07:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:07:04] Msz2001: we can go ahead now [13:07:09] thank you, effie [13:07:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60616 and previous config saved to /var/cache/conftool/dbconfig/20240416-130710-marostegui.json [13:07:15] okay [13:07:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:07:47] Msz2001: do you know how you can test the config change? [13:08:14] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9717912 (10Papaul) a:05Jhancock.wm→03Papaul [13:08:19] Yes, I have WikimediaDebug and I have prepared setting in preferences to toggle just to be changed to see the effect [13:08:32] great, that answers both the things I had in mind :) [13:08:43] (03PS4) 10Msz2001: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019176 (https://phabricator.wikimedia.org/T362414) [13:09:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019176 (https://phabricator.wikimedia.org/T362414) (owner: 10Msz2001) [13:09:30] (03PS1) 10Ssingh: admin: set kerberos for sg912 [puppet] - 10https://gerrit.wikimedia.org/r/1020211 (https://phabricator.wikimedia.org/T362602) [13:09:44] (03PS2) 10Ilias Sarantopoulos: ml-services: fix indentation in mistral model resources and increase memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) [13:10:22] (03Merged) 10jenkins-bot: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019176 (https://phabricator.wikimedia.org/T362414) (owner: 10Msz2001) [13:11:11] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1019176|Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia (T362414)]] [13:11:14] !log pool ncredir1001 [13:11:16] T362414: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia - https://phabricator.wikimedia.org/T362414 [13:11:19] !log depool ncredir2001 [13:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:27] I can also see the difference in behavior between enwiki and wikidatawiki [13:11:33] (only the former complains about in a signature) [13:11:47] Yes, I've checked it there before making the patch [13:12:04] My prepared one uses but it's the same case [13:13:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60617 and previous config saved to /var/cache/conftool/dbconfig/20240416-131344-marostegui.json [13:13:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717931 (10BTullis) [13:13:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:14:24] !log lucaswerkmeister-wmde@deploy1002 msz2001 and lucaswerkmeister-wmde: Backport for [[gerrit:1019176|Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia (T362414)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:47] (03CR) 10Ssingh: [C:03+2] admin: set kerberos for sg912 [puppet] - 10https://gerrit.wikimedia.org/r/1020211 (https://phabricator.wikimedia.org/T362602) (owner: 10Ssingh) [13:15:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717940 (10BTullis) I have created the principal for Surbhi. ` btullis@krb1001:~$ sudo sudo manage_principals.py get sg912 get_principal: Principal... [13:15:45] I've just checked and can confirm that the patch has intended effect [13:16:06] Lucas_WMDE what happened yesterday's deployment by the way. I had to leave IRC, so I couldn't follow up [13:16:06] great, thanks! [13:16:13] !log lucaswerkmeister-wmde@deploy1002 msz2001 and lucaswerkmeister-wmde: Continuing with sync [13:16:24] NMW03: I don’t remember, tbh… I’d have to check the IRC log [13:16:54] ah, right, all deployments were blocked [13:17:09] !log pool ncredir2001 [13:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:26] but that should be resolved now [13:17:45] I moved my patch to today's deployment [13:18:26] !log depool ncredir1001 [13:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9717943 (10ssingh) >>! In T362602#9717940, @BTullis wrote: > I have created the principal for Surbhi. > ` > btullis@krb1001:~$ sudo sudo manage_pri... [13:19:27] (03CR) 10Lucas Werkmeister (WMDE): Restrict local uploads to uploader user group in azwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:19:43] NMW03: I left some minor comments on the change, but apart from that it should be okay to deploy once the current deployment is done [13:20:22] !log pool ncredir1001 [13:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:26] !log depool ncredir2001 [13:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] (03CR) 10Arnaudb: [C:03+1] "you have it!" [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [13:21:54] Lucas_WMDE thanks, let me fix that [13:23:23] !log pool ncredir2001 [13:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:37] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 4 others: Decommission db2101 (was: db2101 crashed) - https://phabricator.wikimedia.org/T362311#9717974 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:24:47] 06SRE, 10LDAP-Access-Requests: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718011 (10Michael) [13:25:15] (03PS7) 10NMW03: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) [13:27:30] Lucas_WMDE fixed now [13:27:35] thanks! [13:27:47] (03CR) 10NMW03: Restrict local uploads to uploader user group in azwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:27:54] (03PS8) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 (https://phabricator.wikimedia.org/T361090) [13:28:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:28:26] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T362465#9718026 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alert cleared. being decommed in T362438 [13:28:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P60618 and previous config saved to /var/cache/conftool/dbconfig/20240416-132851-marostegui.json [13:29:51] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1019176|Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia (T362414)]] (duration: 18m 39s) [13:29:56] T362414: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia - https://phabricator.wikimedia.org/T362414 [13:30:05] (03PS8) 10NMW03: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) [13:30:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:30:22] Msz2001: should be live everywhere now [13:30:57] (03Merged) 10jenkins-bot: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [13:30:58] Thanks for deploying! It appears to work without WikimediaDebug, so let's assume it's okay :) [13:31:06] \o/ [13:31:29] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1014648|Restrict local uploads to uploader user group in azwiki (T360847)]] [13:31:37] T360847: Add uploader user group to az.wiki - https://phabricator.wikimedia.org/T360847 [13:33:25] !log depool ncredir2001 [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:33] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Backport for [[gerrit:1014648|Restrict local uploads to uploader user group in azwiki (T360847)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:50] NMW03: please test :) [13:34:51] I get an error on https://az.wikipedia.org/wiki/X%C3%BCsusi:Y%C3%BCkl%C9%99, at least [13:34:52] (03PS3) 10Jcrespo: installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) [13:36:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: T360116 [13:36:08] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [13:36:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: T360116 [13:37:32] Lucas_WMDE sorry, what do you mean by test? This is my first time lol [13:38:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2123.codfw.wmnet with OS bookworm [13:38:45] ah, good to know ^^ [13:39:03] NMW03: see https://wikitech.wikimedia.org/wiki/WikimediaDebug – you can install the browser extension linked there (Firefox or Chrome) [13:39:13] I already installed that [13:39:24] okay, then enable it and select one of the debug servers (doesn’t matter which one) [13:39:32] and then you should be able to see the change in effect already [13:39:41] and try out whether it behaves as expected or not [13:40:22] do I have to select one of the options? Excimer UI etc. [13:40:48] nope [13:40:59] only set the big toggle to “on” and select one server from the dropdown [13:41:16] probably one of the first two if you’re planning to make any changes (e.g. add yourself to the group) [13:41:30] or k8s-experimental [13:41:42] true [13:41:46] Which we really should rename since it's now 70% of external traffic and not experimental at all x) [13:41:51] I was about to say :P [13:41:58] k8s-futuristic [13:42:12] Lucas_WMDE: is there anything else to deploy for this window? [13:42:15] lol [13:42:19] (03CR) 10Jelto: "wmflib::service::probe::http_module_options and wmflib::service::probe::module_options also need refactoring to accept multiple probes. I'" [puppet] - 10https://gerrit.wikimedia.org/r/1020185 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto) [13:42:24] it logs "Find in XHGui" messages [13:42:27] effie: not as far as I’m aware [13:42:33] great, tx [13:42:39] should I ping you when I’m done? [13:43:01] (03Abandoned) 10Jelto: prometheus::blackbox::modules::service_catalog: support multiple probes [puppet] - 10https://gerrit.wikimedia.org/r/1020185 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto) [13:43:02] NMW03: you can ignore those, I think [13:43:43] (03Abandoned) 10Jelto: miscweb/service::catalog: move blackbox checks to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1019039 (https://phabricator.wikimedia.org/T361090) (owner: 10Jelto) [13:43:44] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [13:43:47] (03PS4) 10Ssingh: Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:43:59] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:43:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P60619 and previous config saved to /var/cache/conftool/dbconfig/20240416-134358-marostegui.json [13:44:07] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [13:44:17] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:45:17] (03CR) 10CI reject: [V:04-1] Puppet: add magru [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:46:11] (03CR) 10Filippo Giunchedi: Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:46:43] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:48:05] Lucas_WMDE there is nothing else except "Find in XHGui" ¯\_(ツ)_/¯ [13:48:19] NMW03: I’m not sure what you mean [13:48:43] you should see something like https://wikitech.wikimedia.org/wiki/File:WikimediaDebug_v2_on.png [13:48:52] (note, you’ll need to be on a Wikimedia site for this, e.g. az.wikipedia.org) [13:49:21] and after you enable that, you can just browse the wiki normally, and all the requests will go to the debug server [13:49:36] (at least until the extension turns itself off again after 15 minutes) [13:49:41] I am talking about this https://i.imgur.com/WnxZCQS.png [13:49:48] Oh, I didn't know that [13:50:12] okay, that looks right [13:50:17] you can just close that and use the wiki then [13:50:52] so this extension just move requests to testing server [13:50:58] to check new change, right? [13:51:10] (instead of production server) [13:51:18] 10ops-codfw, 06SRE, 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718113 (10Jhancock.wm) @Papaul @Andrew what are we doing with cloudbackup2001-array1 and cloudbackup2002-array1? [13:51:21] yes, exactly [13:51:43] hm, I guess you might not be able to easily test that normal users can’t upload anymore, since sysops still have the upload right [13:51:49] unless you have a secondary account or something [13:51:59] I can test it for him. [13:52:02] Give me a moment. [13:52:18] (03CR) 10Ssingh: "volans: I have updated the CR with your comments about check_cumin_aliases." [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [13:52:41] Don't forget that "Upload" in sidebar redirects you to commons [13:52:50] you can go to Xüsusi:Yüklə [13:52:57] Looks good Lucas_WMDE, you can backport patch. [13:52:57] (thanks by the way) [13:53:06] sounds good, thank you both! [13:53:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting kerberos identity for Surbhi Gupta - https://phabricator.wikimedia.org/T362602#9718122 (10ssingh) 05Open→03Resolved Marking this as resolved; if `kinit` doesn't work for you or if there are any issues, please re-open t... [13:53:09] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Continuing with sync [13:53:10] I went directly to Special:Upload NMW03, don't worry. [13:53:36] Great Lucas_WMDE, thank you too. [13:53:51] (btw, I don't understand why link in sidebar redirects to Commons, when local upload is possible) [13:54:22] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662 (10Clement_Goubert) 03NEW [13:54:29] I don't understand that either. It was a weird consensus of local community [13:54:30] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9718155 (10Clement_Goubert) p:05Triage→03Low [13:55:32] (03PS11) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [13:56:01] (03CR) 10Jcrespo: [C:03+2] installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [13:56:08] (03PS4) 10Jcrespo: installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) [13:56:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: host reimage [13:56:12] (03CR) 10Muehlenhoff: [C:03+1] "Fix is correct and good to merge, but see inline comment for an alternative solution" [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [13:57:36] (03PS2) 10Samtar: IS: Set Phonos to Inline Audio Player mode on test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446 [13:57:48] jouncebot: nowandnext [13:57:49] For the next 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1300) [13:57:49] In 1 hour(s) and 2 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1500) [13:58:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: host reimage [13:59:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T361627)', diff saved to https://phabricator.wikimedia.org/P60620 and previous config saved to /var/cache/conftool/dbconfig/20240416-135906-marostegui.json [13:59:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:59:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:59:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T361627)', diff saved to https://phabricator.wikimedia.org/P60621 and previous config saved to /var/cache/conftool/dbconfig/20240416-135928-marostegui.json [14:00:08] Lucas_WMDE: can you ping me when you're done with the deploy? Just want to quickly get https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/980446 out :) [14:01:26] TheresNoTime: effie was also interested in when the deploy would be done [14:01:40] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [14:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T361627)', diff saved to https://phabricator.wikimedia.org/P60622 and previous config saved to /var/cache/conftool/dbconfig/20240416-140142-marostegui.json [14:01:57] (though she didn’t say why, unless I missed it :P) [14:01:59] (03PS1) 10CDanis: force enable etcd v2 proto [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 [14:02:20] Lucas_WMDE: ack, will let them go first if needed :) [14:02:39] TheresNoTime: thank you, I will make a quick change and deploy [14:02:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2114.codfw.wmnet [14:04:00] 10ops-codfw, 06SRE, 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718218 (10Andrew) >>! In T362438#9718112, @Jhancock.wm wrote: > what are we doing with cloudbackup2001-array1 and cloudbackup200... [14:04:08] TheresNoTime: let me join the queue :-) [14:04:29] (03PS1) 10Ssingh: admin: add migr to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020225 (https://phabricator.wikimedia.org/T362618) [14:04:42] taavi: :p mine is only a very small config patch, so feel free to do yours after ef/fie [14:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:33] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1014648|Restrict local uploads to uploader user group in azwiki (T360847)]] (duration: 35m 04s) [14:06:38] effie: over to you :) [14:06:43] T360847: Add uploader user group to az.wiki - https://phabricator.wikimedia.org/T360847 [14:06:45] jouncebot: nowandnext [14:06:45] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [14:06:45] In 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1500) [14:06:47] ok [14:06:57] Lucas_WMDE: TheresNoTime tx tx [14:07:09] I will let you lot know [14:07:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:14] (03CR) 10Ssingh: [C:03+2] admin: add migr to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1020225 (https://phabricator.wikimedia.org/T362618) (owner: 10Ssingh) [14:10:23] (03PS3) 10Filippo Giunchedi: icinga: use rsync::server::module auto_firewall [puppet] - 10https://gerrit.wikimedia.org/r/1020205 [14:10:34] (03CR) 10Jcrespo: [V:03+2 C:03+2] installserver: Setup db and dbprov hosts back to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019816 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [14:11:03] (03CR) 10CI reject: [V:04-1] icinga: use rsync::server::module auto_firewall [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [14:12:31] (03PS4) 10Filippo Giunchedi: icinga: use rsync::server::module auto_firewall [puppet] - 10https://gerrit.wikimedia.org/r/1020205 [14:13:19] (03PS1) 10Muehlenhoff: Switch db2114 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020248 (https://phabricator.wikimedia.org/T349619) [14:15:03] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1948/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [14:16:13] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "Went with your suggestion in PS4" [puppet] - 10https://gerrit.wikimedia.org/r/1020205 (owner: 10Filippo Giunchedi) [14:16:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P60623 and previous config saved to /var/cache/conftool/dbconfig/20240416-141649-marostegui.json [14:17:19] (03CR) 10Muehlenhoff: [C:03+2] Switch db2114 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020248 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:19:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2123.codfw.wmnet with OS bookworm [14:20:05] (ProbeDown) firing: (17) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:24] 17 [14:20:33] (03PS1) 10Effie Mouzeli: mw-api-int: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 [14:21:03] aqs? the old aqs? [14:21:14] akosiaris: literally everything is failing probes rn [14:21:16] according to alertmanager [14:21:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2114.codfw.wmnet [14:21:32] so i'm inclined to believe something is wrong with the prober itself [14:21:37] ah, ouch, thanks [14:21:53] ... ofc the grafana dashboard does not agree [14:22:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:22:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:22:29] ugh, also the alerts are two months old? I'll take a look [14:22:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718330 (10ssingh) 05Open→03Resolved a:03ssingh Added to `wmf` LDAP group (as well as Phabricator). Please try to access Logstash and... [14:22:45] uh yeah lol just noticed that too [14:22:46] my internet went down, what is going on? [14:22:47] thanks godog [14:22:56] jynus: we think monitoring error, not a real outage [14:23:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1187.eqiad.wmnet [14:23:22] jynus: you made a trip to the past. [14:23:26] 2 months ago past [14:23:26] :P [14:23:33] Lucas_WMDE: TheresNoTime taavi go ahead [14:23:45] (JobUnavailable) firing: (4) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:52] taavi: after you :) [14:23:52] I am super confused [14:24:20] ah, a.w.o just refreshed for me and all pages are gone [14:24:23] TheresNoTime: thanks :-) [14:24:44] godog: did you do anything to fix it or did it just happen on its own? [14:24:59] (03PS1) 10Muehlenhoff: Switch db1187 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020252 (https://phabricator.wikimedia.org/T349619) [14:25:05] (ProbeDown) resolved: (17) Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:13] 🤔 [14:25:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) (owner: 10Majavah) [14:25:40] akosiaris: no active doing on my part, though I was changing alert* firewall and I think that was the reason, confirming [14:26:25] (03Merged) 10jenkins-bot: Disallow changing email on Wikitech directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) (owner: 10Majavah) [14:26:33] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [14:26:43] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [14:26:54] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1013976|Disallow changing email on Wikitech directly (T360883)]] [14:26:58] T360883: Disable email address changes in Wikitech - https://phabricator.wikimedia.org/T360883 [14:27:04] godog: it was only ipv4 alerts, not ipv6, so I did suspect something like firewall/network [14:27:08] (03CR) 10Muehlenhoff: [C:03+2] Switch db1187 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020252 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:27:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [14:28:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [14:28:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P60624 and previous config saved to /var/cache/conftool/dbconfig/20240416-142808-arnaudb.json [14:28:13] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:28:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 1%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60625 and previous config saved to /var/cache/conftool/dbconfig/20240416-142840-arnaudb.json [14:28:45] (JobUnavailable) resolved: (4) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:20] cdanis: indeed [14:29:43] (03PS2) 10CDanis: force enable etcd v2 proto [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 [14:29:50] 10ops-codfw, 06SRE, 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718387 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:29:56] !log taavi@deploy1002 taavi: Backport for [[gerrit:1013976|Disallow changing email on Wikitech directly (T360883)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:12] 10ops-codfw, 06SRE, 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718400 (10Jhancock.wm) ty! [14:30:18] !log taavi@deploy1002 taavi: Continuing with sync [14:31:26] so what I think happened is that ferm was effectively and silently stuck on some broken configuration/rules, once I fixed that the alertmanager cluster might have been getting confused and old notifications got out [14:31:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P60626 and previous config saved to /var/cache/conftool/dbconfig/20240416-143126-arnaudb.json [14:31:33] (03CR) 10Alexandros Kosiaris: [C:03+1] "Adding Scott too. LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 (owner: 10CDanis) [14:31:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P60627 and previous config saved to /var/cache/conftool/dbconfig/20240416-143157-marostegui.json [14:32:05] (03PS2) 10Effie Mouzeli: mediawiki deployments: use mcrouter daemonset for both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020251 [14:32:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1187.eqiad.wmnet [14:32:56] !log pool ncredir2001 [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:14] !log depool ncredir2002 [14:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1019777 (https://phabricator.wikimedia.org/T362668) [14:34:27] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1019778 (https://phabricator.wikimedia.org/T362668) [14:36:14] !log pool ncredir2002 [14:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2193.codfw.wmnet [14:37:56] (03PS1) 10Muehlenhoff: Switch db2193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020257 (https://phabricator.wikimedia.org/T349619) [14:38:16] (03PS1) 10NMW03: Added extendedconfirmed and templateeditor rights to dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019779 (https://phabricator.wikimedia.org/T281860) [14:38:29] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:34] (03CR) 10Muehlenhoff: [C:03+2] Switch db2193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020257 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:43:04] (03PS1) 10Muehlenhoff: Remove obsolete restbase discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/1020258 (https://phabricator.wikimedia.org/T360636) [14:43:18] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1013976|Disallow changing email on Wikitech directly (T360883)]] (duration: 16m 24s) [14:43:23] TheresNoTime: over to you [14:43:32] taavi: thank you [14:43:40] (03PS3) 10Samtar: IS: Set Phonos to Inline Audio Player mode on test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446 [14:43:41] T360883: Disable email address changes in Wikitech - https://phabricator.wikimedia.org/T360883 [14:43:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 2%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60628 and previous config saved to /var/cache/conftool/dbconfig/20240416-144346-arnaudb.json [14:44:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446 (owner: 10Samtar) [14:45:10] (03CR) 10Scott French: [C:03+1] "LGTM. Thanks, Chris." [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 (owner: 10CDanis) [14:45:34] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019780 [14:45:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2193.codfw.wmnet [14:46:03] (03Merged) 10jenkins-bot: IS: Set Phonos to Inline Audio Player mode on test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446 (owner: 10Samtar) [14:46:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60629 and previous config saved to /var/cache/conftool/dbconfig/20240416-144634-arnaudb.json [14:46:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:980446|IS: Set Phonos to Inline Audio Player mode on test.wiki]] [14:46:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020258 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [14:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T361627)', diff saved to https://phabricator.wikimedia.org/P60630 and previous config saved to /var/cache/conftool/dbconfig/20240416-144704-marostegui.json [14:47:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:47:11] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:47:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:47:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T361627)', diff saved to https://phabricator.wikimedia.org/P60631 and previous config saved to /var/cache/conftool/dbconfig/20240416-144727-marostegui.json [14:47:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1201.eqiad.wmnet [14:48:38] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [14:48:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Post clone', diff saved to https://phabricator.wikimedia.org/P60632 and previous config saved to /var/cache/conftool/dbconfig/20240416-144850-arnaudb.json [14:49:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s3 T362616 [14:49:23] T362616: Switchover s3 master (db2127 -> db2205) - https://phabricator.wikimedia.org/T362616 [14:49:40] !log samtar@deploy1002 samtar: Backport for [[gerrit:980446|IS: Set Phonos to Inline Audio Player mode on test.wiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:49:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s3 T362616 [14:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T362616', diff saved to https://phabricator.wikimedia.org/P60633 and previous config saved to /var/cache/conftool/dbconfig/20240416-144957-root.json [14:50:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1019771 (https://phabricator.wikimedia.org/T362616) (owner: 10Gerrit maintenance bot) [14:50:58] !log samtar@deploy1002 samtar: Continuing with sync [14:52:58] (03PS1) 10Marostegui: db2127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020260 [14:53:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T361627)', diff saved to https://phabricator.wikimedia.org/P60634 and previous config saved to /var/cache/conftool/dbconfig/20240416-145316-marostegui.json [14:53:22] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:53:40] (03PS1) 10Muehlenhoff: Switch db1201 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020261 (https://phabricator.wikimedia.org/T349619) [14:56:19] (03CR) 10Muehlenhoff: [C:03+2] Switch db1201 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020261 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:56:40] (03CR) 10Marostegui: [C:03+2] db2127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1020260 (owner: 10Marostegui) [14:58:29] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60635 and previous config saved to /var/cache/conftool/dbconfig/20240416-145851-arnaudb.json [14:59:38] (03CR) 10CDanis: [C:03+2] force enable etcd v2 proto [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 (owner: 10CDanis) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a SRE Collaboration Services office hours deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1500). [15:00:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1201.eqiad.wmnet [15:01:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P60636 and previous config saved to /var/cache/conftool/dbconfig/20240416-150141-arnaudb.json [15:03:15] (03Merged) 10jenkins-bot: force enable etcd v2 proto [software/conftool] - 10https://gerrit.wikimedia.org/r/1020224 (owner: 10CDanis) [15:03:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:54] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:980446|IS: Set Phonos to Inline Audio Player mode on test.wiki]] (duration: 17m 17s) [15:03:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 20%: Post clone', diff saved to https://phabricator.wikimedia.org/P60637 and previous config saved to /var/cache/conftool/dbconfig/20240416-150356-arnaudb.json [15:05:43] !log brennen@deploy1002 Started deploy [phabricator/deployment@7773191]: test deploy phab2002 for T362666 [15:05:48] T362666: Deploy Phabricator/Phorge 2024-04-16 - https://phabricator.wikimedia.org/T362666 [15:05:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1224.eqiad.wmnet [15:06:15] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7773191]: test deploy phab2002 for T362666 (duration: 00m 32s) [15:06:46] !log brennen@deploy1002 Started deploy [phabricator/deployment@7773191]: deploy phab1004 for T362666 [15:07:17] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7773191]: deploy phab1004 for T362666 (duration: 00m 30s) [15:08:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P60638 and previous config saved to /var/cache/conftool/dbconfig/20240416-150824-marostegui.json [15:08:47] !log Starting s3 codfw failover from db2127 to db2205 - T362616 [15:08:50] (03PS1) 10Muehlenhoff: Switch db1224 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020262 (https://phabricator.wikimedia.org/T349619) [15:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:53] T362616: Switchover s3 master (db2127 -> db2205) - https://phabricator.wikimedia.org/T362616 [15:09:08] (03PS1) 10DCausse: rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020263 (https://phabricator.wikimedia.org/T362508) [15:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary T362616', diff saved to https://phabricator.wikimedia.org/P60639 and previous config saved to /var/cache/conftool/dbconfig/20240416-150933-root.json [15:10:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718667 (10Michael) I now have Gerrit +2 rights again, but sadly, I still cannot access Logstash or log in to Grafana: {F46993689} Do I ha... [15:10:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2127 T362616', diff saved to https://phabricator.wikimedia.org/P60640 and previous config saved to /var/cache/conftool/dbconfig/20240416-151032-root.json [15:10:40] (03CR) 10Muehlenhoff: [C:03+2] Switch db1224 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020262 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:13:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2127.codfw.wmnet with OS bookworm [15:13:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60641 and previous config saved to /var/cache/conftool/dbconfig/20240416-151357-arnaudb.json [15:14:02] (03PS1) 10Marostegui: Revert "db2127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020228 [15:15:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1224.eqiad.wmnet [15:16:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T360332)', diff saved to https://phabricator.wikimedia.org/P60642 and previous config saved to /var/cache/conftool/dbconfig/20240416-151649-arnaudb.json [15:16:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [15:17:04] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:17:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [15:17:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2129.codfw.wmnet [15:19:02] (03PS1) 10Muehlenhoff: Switch db2129 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020265 (https://phabricator.wikimedia.org/T349619) [15:19:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P60643 and previous config saved to /var/cache/conftool/dbconfig/20240416-151902-arnaudb.json [15:20:01] 06SRE, 10SRE-tools, 10Cassandra: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694#9718771 (10Eevans) →14Duplicate dup:03T297944 [15:22:45] (03PS1) 10Elukey: role::aqs: move codfw's instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1020266 (https://phabricator.wikimedia.org/T352647) [15:23:12] (03PS5) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [15:23:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P60644 and previous config saved to /var/cache/conftool/dbconfig/20240416-152331-marostegui.json [15:24:44] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020266 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:25:41] (03CR) 10Muehlenhoff: [C:03+2] Switch db2129 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020265 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:26:24] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:27:24] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718814 (10MoritzMuehlenhoff) Can you try accessing https://idp.wikimedia.org/logout and then retrying a login to https://idp.wikimedia.org/... [15:29:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 15%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60645 and previous config saved to /var/cache/conftool/dbconfig/20240416-152902-arnaudb.json [15:31:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2129.codfw.wmnet [15:31:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2127.codfw.wmnet with reason: host reimage [15:32:13] (03PS1) 10Elukey: role::aqs: complete the move of Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1020267 (https://phabricator.wikimedia.org/T352647) [15:32:31] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:32:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1168.eqiad.wmnet [15:34:04] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1020267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:34:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P60646 and previous config saved to /var/cache/conftool/dbconfig/20240416-153408-arnaudb.json [15:34:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2127.codfw.wmnet with reason: host reimage [15:37:32] (03CR) 10Hnowlan: [C:03+1] Remove obsolete restbase discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/1020258 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [15:38:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T361627)', diff saved to https://phabricator.wikimedia.org/P60647 and previous config saved to /var/cache/conftool/dbconfig/20240416-153839-marostegui.json [15:38:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [15:38:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:38:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [15:38:59] (03CR) 10JHathaway: "The CI check is a false positive, so the patch can still be reviewed, I'll create a phab task for wmf_styleguide-check for the CI issue." [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T361627)', diff saved to https://phabricator.wikimedia.org/P60648 and previous config saved to /var/cache/conftool/dbconfig/20240416-153902-marostegui.json [15:39:07] (03PS1) 10Muehlenhoff: Switch db1168 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020269 (https://phabricator.wikimedia.org/T349619) [15:40:46] (03CR) 10Muehlenhoff: [C:03+2] Switch db1168 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020269 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:41:25] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020266 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:41:44] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:43:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T361627)', diff saved to https://phabricator.wikimedia.org/P60649 and previous config saved to /var/cache/conftool/dbconfig/20240416-154316-marostegui.json [15:44:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60650 and previous config saved to /var/cache/conftool/dbconfig/20240416-154408-arnaudb.json [15:44:10] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718964 (10Michael) >>! In T362618#9718814, @MoritzMuehlenhoff wrote: > Can you try accessing https://idp.wikimedia.org/logout and then retr... [15:44:58] so... the 1.43.0-wmf.1 blockers phab task is https://phabricator.wikimedia.org/T361395 .. but I'm getting an "Unhandled Exception ("RuntimeException")" from that URL. [15:45:17] I guess "phab is broken" is a blocker for 1.43.0-wmf.1 ? [15:45:30] hehe. I would say so [15:46:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1168.eqiad.wmnet [15:46:10] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718986 (10ssingh) 05Open→03Resolved [15:46:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1231.eqiad.wmnet [15:47:09] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to 'wmf' ldap group for Michael to allow logstash access - https://phabricator.wikimedia.org/T362618#9718968 (10ssingh) 05Resolved→03Open Thanks @Muehlenhoff! And good to know @Michael that this is resolved; closing this task. [15:47:53] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:48:11] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:49:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P60651 and previous config saved to /var/cache/conftool/dbconfig/20240416-154915-arnaudb.json [15:52:31] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60652 and previous config saved to /var/cache/conftool/dbconfig/20240416-155440-root.json [15:54:48] (03CR) 10Marostegui: [C:03+2] Revert "db2127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020228 (owner: 10Marostegui) [15:55:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1231.eqiad.wmnet [15:55:20] I wanted to mention that https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020249 probably needs to be backported to 1.43.0-wmf.1 to prevent logspam from ProofreadPages when we deploy to wikisource... but I can't comment at https://phabricator.wikimedia.org/T361395 because phab is dead. [15:56:53] (03PS1) 10Hnowlan: CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 [15:57:22] cscott: https://phabricator.wikimedia.org/T362666 sounds like there was a phabricator deployment today… [15:57:38] (and I *think* I looked at the train blockers task successfully earlier today, so that would match) [15:57:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2127.codfw.wmnet with OS bookworm [15:58:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P60653 and previous config saved to /var/cache/conftool/dbconfig/20240416-155823-marostegui.json [15:58:32] cscott: I'm about to file a ticket about the phab problem [15:58:44] (03CR) 10Effie Mouzeli: [C:03+1] CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [15:58:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2217.codfw.wmnet [15:59:01] (03PS2) 10Hnowlan: CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 [15:59:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60654 and previous config saved to /var/cache/conftool/dbconfig/20240416-155914-arnaudb.json [16:00:04] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1600). [16:00:05] Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:33] Dreamy_Jazz: hi! lgtm, would you like me to merge it and also kick off a test run right away? or do you want to wait until Sunday? [16:00:51] cscott: https://phabricator.wikimedia.org/T362689 filed. [16:00:55] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on cp1115.eqiad.wmnet with reason: testing PXE boot issues [16:01:08] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on cp1115.eqiad.wmnet with reason: testing PXE boot issues [16:01:08] (03PS1) 10Muehlenhoff: Switch db2217 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020278 (https://phabricator.wikimedia.org/T349619) [16:01:46] (03CR) 10Dzahn: "the "svc" records are not in use so far and don't exist in DNS. they would be created if a service was behind the full LVS and geodns setu" [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) (owner: 10EoghanGaffney) [16:03:50] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1951/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [16:04:29] (03PS1) 10Clément Goubert: ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 [16:04:52] rzl: Hi. We can wait till Sunday or do a run now [16:04:58] Happy either way [16:05:19] (03CR) 10CI reject: [V:04-1] ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 (owner: 10Clément Goubert) [16:05:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:05:50] Dreamy_Jazz: fully up to you, doesn't make a difference to me :) recommend starting a run now if you'd like to be able to stop it in case of any unexpected trouble, roll it back promptly etc [16:05:53] (03CR) 10RLazarus: [V:03+1 C:03+2] Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [16:06:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:06:17] (03PS2) 10Clément Goubert: ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 [16:06:18] (03CR) 10Muehlenhoff: [C:03+2] Switch db2217 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1020278 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:06:37] Dreamy_Jazz, rzl: mind letting me know when things are clear for a phab deploy for an UBN? (T362689) [16:06:38] T362689: "Undefined offset: 5" error when visiting https://phabricator.wikimedia.org/T361395 - https://phabricator.wikimedia.org/T362689 [16:06:47] dancy: luckily (?) i can make a new task be a blocker to the 1.43-wmf,1-blockers task, even though I can't access the blockers task itself. [16:06:49] brennen: no conflict, you can go ahead [16:06:55] rzl: ty [16:07:04] Sure. I can check the database tables after it is run. [16:07:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:40] !log starting phabricator deploy for T362689 [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:28] Dreamy_Jazz: sounds good, Puppet's running now and I'll let you know when the job is kicked off [16:08:34] Thanks [16:09:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60655 and previous config saved to /var/cache/conftool/dbconfig/20240416-160946-root.json [16:12:10] !log brennen@deploy1002 Started deploy [phabricator/deployment@098b9c2]: test deploy phab2002 for T362689 [16:12:15] T362689: "Undefined offset: 5" error when visiting https://phabricator.wikimedia.org/T361395 - https://phabricator.wikimedia.org/T362689 [16:12:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2217.codfw.wmnet [16:12:43] !log brennen@deploy1002 Finished deploy [phabricator/deployment@098b9c2]: test deploy phab2002 for T362689 (duration: 00m 32s) [16:13:07] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_globalblocking-fixGlobalBlockWhitelist.service # T360516 [16:13:08] !log brennen@deploy1002 Started deploy [phabricator/deployment@098b9c2]: deploy phab1004 for T362689 [16:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:12] T360516: Periodically remove orphaned global_block_whitelist entries - https://phabricator.wikimedia.org/T360516 [16:13:25] Dreamy_Jazz: you can see logs on mwmaint1002 with `journalctl -u mediawiki_job_globalblocking-fixGlobalBlockWhitelist.service` [16:13:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P60656 and previous config saved to /var/cache/conftool/dbconfig/20240416-161330-marostegui.json [16:13:33] Thanks! [16:13:36] (03PS2) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [16:13:37] or add `-f` to tail [16:13:50] !log brennen@deploy1002 Finished deploy [phabricator/deployment@098b9c2]: deploy phab1004 for T362689 (duration: 00m 42s) [16:13:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:14:11] I'll be around if you need any followup, otherwise you're all set [16:14:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:14:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60657 and previous config saved to /var/cache/conftool/dbconfig/20240416-161420-arnaudb.json [16:14:24] cscott: We're back! [16:14:25] Apparently I don't have permission to run that command [16:14:41] It says "No journal files were opened due to insufficient permissions." [16:14:52] oh sorry, I thought you were in the group for that! no worries, I'll dump it to a text file for you when the job finishes [16:15:03] Thanks. [16:15:33] remind me your shell username? [16:15:42] dreamyjazz [16:15:48] 👍 [16:16:01] !log finished phabricator deploy for T362689 - believe things are currently stable [16:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:53] (03PS3) 10Cathal Mooney: Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [16:18:45] (03CR) 10CI reject: [V:04-1] Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:18:50] Dreamy_Jazz: /home/dreamyjazz/fixGlobalBlockWhitelist.txt [16:18:59] Thanks. [16:20:29] (03PS4) 10Cathal Mooney: Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) [16:20:53] Reviewing the logs. [16:21:20] (03CR) 10CI reject: [V:04-1] Reverse DNS changes for new Magru prefixes [dns] - 10https://gerrit.wikimedia.org/r/1020196 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:22:00] (03PS2) 10Dzahn: graphite: switch envoy ssl provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019887 (https://phabricator.wikimedia.org/T360414) [16:23:17] Looks good for the first few bits. Still looking at the logs. [16:24:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P60659 and previous config saved to /var/cache/conftool/dbconfig/20240416-162443-ladsgroup.json [16:24:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:24:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60660 and previous config saved to /var/cache/conftool/dbconfig/20240416-162452-root.json [16:28:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T361627)', diff saved to https://phabricator.wikimedia.org/P60661 and previous config saved to /var/cache/conftool/dbconfig/20240416-162838-marostegui.json [16:28:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:28:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:28:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:29:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T361627)', diff saved to https://phabricator.wikimedia.org/P60662 and previous config saved to /var/cache/conftool/dbconfig/20240416-162900-marostegui.json [16:29:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P60663 and previous config saved to /var/cache/conftool/dbconfig/20240416-162926-arnaudb.json [16:31:55] rzl: The logs look all good. Thanks. [16:32:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T361627)', diff saved to https://phabricator.wikimedia.org/P60664 and previous config saved to /var/cache/conftool/dbconfig/20240416-163215-marostegui.json [16:34:40] Dreamy_Jazz: fwiw, even if you can't use `journalctl` directly, you should be able to read the logs in /var/log/mediawiki/mediawiki_job_globalblocking-fixGlobalBlockWhitelist/syslog.log (cc rzl) [16:37:44] (03CR) 10Cathal Mooney: [C:04-1] "I wholeheartedly approve of the idea here, but I think there is a snag. The are cloud hosts in (i.e. https://netbox.wikimedia.org/ipam/pr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 (owner: 10Ssingh) [16:38:43] Thanks! [16:39:02] (03PS3) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [16:39:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60665 and previous config saved to /var/cache/conftool/dbconfig/20240416-163951-ladsgroup.json [16:39:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60666 and previous config saved to /var/cache/conftool/dbconfig/20240416-163958-root.json [16:42:21] (03CR) 10Ssingh: "Thanks, fair point and I think this might be a concern. I will let this rest for a while and then maybe see if we can just skip over 10.64" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 (owner: 10Ssingh) [16:43:05] (03CR) 10Ssingh: "The ideal and the correct way of fixing this of course is to generate this automatically. So I guess let's maybe spend time there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 (owner: 10Ssingh) [16:45:57] 10ops-codfw, 06SRE, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9719496 (10Papaul) @blink is there anything left for DC-ops to do on this task? Thanks [16:47:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P60668 and previous config saved to /var/cache/conftool/dbconfig/20240416-164722-marostegui.json [16:47:34] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9719504 (10Jhancock.wm) 05Open→03Resolved [16:48:46] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! I mean the change looks good. The fact we have so much duplicate data makes me cry 😞" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:50:48] (03CR) 10Cathal Mooney: [C:03+1] Puppet: add magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:51:35] (03CR) 10Ssingh: "Yeah it's a bit sad. Every single time we do this, we say we will fix it next time. Maybe we should fix it this time. Let's see -- once th" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:51:41] 10ops-codfw, 06SRE, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9719527 (10bking) 05Open→03Resolved @Papaul Sorry for the misleading message above. This task is finished from the Search Platform SRE standpoint as the above alerts have... [16:52:14] (03CR) 10Ssingh: "We will merge this on Wednesday April 17" [puppet] - 10https://gerrit.wikimedia.org/r/1019810 (owner: 10Ayounsi) [16:53:52] 10ops-codfw, 06SRE: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9719570 (10Jhancock.wm) I updated the sheet with the needed information but spaced submitting that to this task. Please let me know if there's anything else I can do to help out with the tasks. Thanks! [16:54:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60669 and previous config saved to /var/cache/conftool/dbconfig/20240416-165458-ladsgroup.json [16:55:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60670 and previous config saved to /var/cache/conftool/dbconfig/20240416-165504-root.json [16:59:04] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:00:02] (03PS3) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1700) [17:00:47] (03PS2) 10Joal: Update yarn scheduler's queues configuration [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) [17:01:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2009 DNS add - pt1979@cumin2002" [17:01:38] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1955/console" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:01:49] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9719616 (10Papaul) [17:02:13] (03CR) 10Joal: "You're right! I got a message from Fabian about this. I renamed the fifo queue gpus. I wonder if using queues for this is better than usin" [puppet] - 10https://gerrit.wikimedia.org/r/1019683 (https://phabricator.wikimedia.org/T361499) (owner: 10Joal) [17:02:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P60671 and previous config saved to /var/cache/conftool/dbconfig/20240416-170231-marostegui.json [17:02:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2009 DNS add - pt1979@cumin2002" [17:02:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:03:12] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:04:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2009-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:06:51] (03PS1) 10Jdlrobson: [phase 4] Vector-2022.js should no longer load legacy Vector site and user scripts/styles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020292 (https://phabricator.wikimedia.org/T301212) [17:07:50] (03PS4) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:10:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P60672 and previous config saved to /var/cache/conftool/dbconfig/20240416-171006-ladsgroup.json [17:10:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:10:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60673 and previous config saved to /var/cache/conftool/dbconfig/20240416-171010-root.json [17:10:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:10:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:10:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:10:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:10:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P60674 and previous config saved to /var/cache/conftool/dbconfig/20240416-171047-ladsgroup.json [17:14:42] (03PS3) 10Jdlrobson: Use WikimediaMessages for template overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) [17:16:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [17:17:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T361627)', diff saved to https://phabricator.wikimedia.org/P60675 and previous config saved to /var/cache/conftool/dbconfig/20240416-171738-marostegui.json [17:17:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:17:44] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:17:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:21:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [17:21:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [17:22:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T361627)', diff saved to https://phabricator.wikimedia.org/P60676 and previous config saved to /var/cache/conftool/dbconfig/20240416-172201-marostegui.json [17:23:00] (03PS1) 10Stevemunene: configure datahub to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020295 (https://phabricator.wikimedia.org/T361688) [17:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T361627)', diff saved to https://phabricator.wikimedia.org/P60677 and previous config saved to /var/cache/conftool/dbconfig/20240416-172415-marostegui.json [17:24:30] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:24:37] (03PS3) 10MusikAnimal: [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) [17:25:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60678 and previous config saved to /var/cache/conftool/dbconfig/20240416-172515-root.json [17:27:03] (03PS1) 10C. Scott Ananian: [Parser] Temporarily disable deprecation warnings for dynamic properties [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020231 (https://phabricator.wikimedia.org/T362692) [17:27:38] (03CR) 10Herron: [C:03+2] mailman: switch HELO checks from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/1019861 (https://phabricator.wikimedia.org/T173338) (owner: 10Herron) [17:28:22] (03CR) 10C. Scott Ananian: "If the train team is on-board, we don't need to merge this to master, just merge the cherry-pick to 1.43-wmf.1, which avoids having to rev" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020231 (https://phabricator.wikimedia.org/T362692) (owner: 10C. Scott Ananian) [17:28:51] (03CR) 10C. Scott Ananian: [C:03+1] [Parser] Temporarily disable deprecation warnings for dynamic properties [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020231 (https://phabricator.wikimedia.org/T362692) (owner: 10C. Scott Ananian) [17:35:25] (03PS1) 10Dzahn: contint: disable zuul merger on contint1002, preparing for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1020296 (https://phabricator.wikimedia.org/T334517) [17:35:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2009-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:36:41] (03CR) 10Dzahn: [C:03+2] contint: disable zuul merger on contint1002, preparing for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1020296 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [17:36:56] jouncebot: now [17:36:56] For the next 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1700) [17:37:34] !log CI - disabling zuul-merger on contint1002 - there is another on contint2002 [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P60679 and previous config saved to /var/cache/conftool/dbconfig/20240416-173923-marostegui.json [17:48:44] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1002.wikimedia.org with reason: reimage https://phabricator.wikmedia.org/T334517 [17:48:59] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1002.wikimedia.org with reason: reimage https://phabricator.wikmedia.org/T334517 [17:50:23] !log bearloga@deploy1002 Started deploy [airflow-dags/analytics_product@bb33843]: (no justification provided) [17:50:30] !log bearloga@deploy1002 Finished deploy [airflow-dags/analytics_product@bb33843]: (no justification provided) (duration: 00m 06s) [17:51:22] !log CI - jenkins on contint1002 disabled - reimaging in progress [17:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:01] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint1002.wikimedia.org with OS bullseye [17:54:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P60680 and previous config saved to /var/cache/conftool/dbconfig/20240416-175431-marostegui.json [17:56:04] (03CR) 10Dzahn: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [17:56:34] (03PS2) 10Cathal Mooney: Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) [17:56:51] (03CR) 10Cathal Mooney: "Thanks for the feedback! Fixed up those bits now." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [17:57:29] (03CR) 10CI reject: [V:04-1] Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T1800) [18:00:11] o/ [18:00:46] currently we have only one zuul-merger and jenkins. but that _should_ not be a problem. [18:01:14] thx [18:02:35] cscott: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020231 Looks good to me. OK for me to merge? [18:02:56] dancy: yup i was about to tell you that :) [18:03:02] Awesome [18:03:17] i scheduled it for backport and didn't realize i'd gotten the ordering of backport and train backwards [18:04:20] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1002.wikimedia.org with reason: host reimage [18:04:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020231 (https://phabricator.wikimedia.org/T362692) (owner: 10C. Scott Ananian) [18:04:40] https://www.traingeek.ca/wp/faq/can-trains-run-backwards/ [18:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:02] "It is possible to turn an entire train around." [18:07:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1002.wikimedia.org with reason: host reimage [18:07:38] the analogies seem to fit :) [18:07:56] well, i'm hoping i'm not the reason for the train to turn around today :) [18:08:51] I think you have "someone has to be on the “leading” end of the train (the back) to watch for any obstacles and to tell the train engineer when it is time to slow down and to stop." :) [18:09:31] fair enough [18:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T361627)', diff saved to https://phabricator.wikimedia.org/P60681 and previous config saved to /var/cache/conftool/dbconfig/20240416-180938-marostegui.json [18:09:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:09:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:09:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:10:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T361627)', diff saved to https://phabricator.wikimedia.org/P60682 and previous config saved to /var/cache/conftool/dbconfig/20240416-181001-marostegui.json [18:15:21] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9719964 (10Papaul) [18:18:23] (03PS1) 10Dzahn: create ae.wikimedia.org for United Arab Emirates User Group [dns] - 10https://gerrit.wikimedia.org/r/1020311 (https://phabricator.wikimedia.org/T362529) [18:19:19] (03PS2) 10Dzahn: create ae.wikimedia.org for United Arab Emirates User Group [dns] - 10https://gerrit.wikimedia.org/r/1020311 (https://phabricator.wikimedia.org/T362529) [18:23:48] (03PS1) 10JMeybohm: helmfile_psp: Remove seccomp/apparmor mutations from PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [18:25:30] (03Merged) 10jenkins-bot: [Parser] Temporarily disable deprecation warnings for dynamic properties [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1020231 (https://phabricator.wikimedia.org/T362692) (owner: 10C. Scott Ananian) [18:26:00] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1020231|[Parser] Temporarily disable deprecation warnings for dynamic properties (T362692)]] [18:26:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T361627)', diff saved to https://phabricator.wikimedia.org/P60683 and previous config saved to /var/cache/conftool/dbconfig/20240416-182606-marostegui.json [18:26:09] T362692: Expected logspam from ProofreadPage - https://phabricator.wikimedia.org/T362692 [18:26:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:26:26] cscott: Do you have a way to exercise the change? [18:26:39] i was just checking which wikis are in group 0 [18:27:08] i think that any page which has ProofreadPage on it should trigger a deprecation warning pre-patch, but I only found htwikisource in group0 [18:27:38] I think pages using LabeledSectionTransclusion would have triggered the warning too [18:27:53] i just need to find them on a group0 wiki [18:28:33] Note that the train is only at testwikis at the moment. [18:29:05] !log dancy@deploy1002 cscott and dancy: Backport for [[gerrit:1020231|[Parser] Temporarily disable deprecation warnings for dynamic properties (T362692)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:29:28] !log bearloga@deploy1002 Started deploy [airflow-dags/analytics_product@77af7cb]: (no justification provided) [18:29:36] !log bearloga@deploy1002 Finished deploy [airflow-dags/analytics_product@77af7cb]: (no justification provided) (duration: 00m 07s) [18:29:42] test.wikipedia.org has LabeledSectionTransclusion installed, i should be able to make a test page [18:32:23] ok, https://test.wikipedia.org/wiki/User:Cscott/LST triggers LST; if no deprecation warning showed up in logstash then the backport patch worked [18:33:38] (03PS1) 10Dzahn: contint: set bullseye docker version just for host contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020316 (https://phabricator.wikimedia.org/T334517) [18:34:13] Nothing notable! Let's roll [18:34:28] \o/ [18:35:10] !log dancy@deploy1002 cscott and dancy: Continuing with sync [18:36:05] (03PS2) 10Dzahn: contint: set bullseye docker version just for host contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020316 (https://phabricator.wikimedia.org/T334517) [18:36:39] (03CR) 10Dzahn: [V:03+2 C:03+2] contint: set bullseye docker version just for host contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020316 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:36:39] (03PS5) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [18:38:37] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [18:40:01] !log contint1002 - sudo a2dismod mpm_event to work around known race condition and fix failed initial puppet run - T334517 [18:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:09] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [18:40:17] (03PS16) 10Bking: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [18:40:31] (03PS6) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [18:41:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P60684 and previous config saved to /var/cache/conftool/dbconfig/20240416-184113-marostegui.json [18:43:40] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [18:44:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1002.wikimedia.org with OS bullseye [18:46:07] (03PS3) 10Cathal Mooney: Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) [18:48:57] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1020231|[Parser] Temporarily disable deprecation warnings for dynamic properties (T362692)]] (duration: 22m 56s) [18:49:04] T362692: Expected logspam from ProofreadPage - https://phabricator.wikimedia.org/T362692 [18:49:25] !log dancy@deploy1002 Installing scap version "4.77.0" for 340 hosts [18:50:06] mutante: that is fast :) [18:50:11] !log dancy@deploy1002 Installation of scap version "4.77.0" completed for 340 hosts [18:50:53] hashar: needed fixes but I was able to apply them while the cookbook was still running and had not given up yet to detect a succesful puppet run :) [18:50:54] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020318 (https://phabricator.wikimedia.org/T361395) [18:50:55] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020318 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:51:31] mutante: hopefully nothing too bad :) [18:51:44] (03PS17) 10Bking: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [18:51:49] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020318 (https://phabricator.wikimedia.org/T361395) (owner: 10TrainBranchBot) [18:51:52] hashar: fix 1: we had to set the right docker version in Hiera for just this host, not changing the default. fix 2: race condition with apache modules on first run, fixable by manual "a2dismod" [18:51:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [18:52:09] (03PS7) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [18:52:26] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9208108]: Regular analytics weekly train [airflow-dags/analytics@9208108e] [18:52:52] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9208108]: Regular analytics weekly train [airflow-dags/analytics@9208108e] (duration: 00m 26s) [18:53:35] (03PS18) 10Bking: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [18:54:26] (03PS19) 10Bking: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [18:54:42] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [18:55:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [18:56:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P60685 and previous config saved to /var/cache/conftool/dbconfig/20240416-185621-marostegui.json [18:57:07] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Disable /srv/mediawiki-staging/php symlink management [puppet] - 10https://gerrit.wikimedia.org/r/1020321 (https://phabricator.wikimedia.org/T359643) [18:58:56] (03PS20) 10Bking: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [18:59:02] (03CR) 10Andrew Bogott: [C:03+2] New files/templates for OpenStack Bobcat (2023.2) [puppet] - 10https://gerrit.wikimedia.org/r/1019879 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:04] (03CR) 10Andrew Bogott: [C:03+2] neutron/bobcat: remove an l3 conf override hack [puppet] - 10https://gerrit.wikimedia.org/r/1019898 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:05] (03CR) 10Andrew Bogott: [C:03+2] cinder/bobcat: remove volume_type_access hack [puppet] - 10https://gerrit.wikimedia.org/r/1019894 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:06] (03CR) 10Andrew Bogott: [C:03+2] bobcat cinder: remove backup scheduler hack [puppet] - 10https://gerrit.wikimedia.org/r/1019895 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:08] (03CR) 10Andrew Bogott: [C:03+2] cinder/bobcat: removing chunkeddriver.py.patch [puppet] - 10https://gerrit.wikimedia.org/r/1019896 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:09] (03CR) 10Andrew Bogott: [C:03+2] openstacksdk/bobcat: remove sdk hack about clouds.yaml load ordering [puppet] - 10https://gerrit.wikimedia.org/r/1019897 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [18:59:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [18:59:35] hashar: zuul has not been deployed by puppet [19:00:02] it should have a scap::target for zuul though [19:00:04] hashar: but: /srv is own partition, docker is running and git-daemon is up [19:00:47] can I "scap pull" but for zuul instead of mediawiki? [19:01:18] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [19:02:15] [contint1002:/srv/deployment/zuul] $ du -hs . [19:02:15] 115M . [19:02:27] /srv/deployment/zuul/venv/bin/zuul-merger' (No such file [19:02:32] well puppet should have cloned and deployed it [19:02:37] but I can look at it tomorrow [19:02:58] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:03:03] there are 115M in /srv/deployment/zuul but there is no ./venv/ dir [19:03:04] (03CR) 10Bking: [C:03+2] elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:03:25] ack,ty [19:04:43] (03PS3) 10Andrew Bogott: neutron/bobcat: remove an l3 conf override hack [puppet] - 10https://gerrit.wikimedia.org/r/1019898 (https://phabricator.wikimedia.org/T356287) [19:04:43] (03PS1) 10Andrew Bogott: codfw1dev openstack to version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1020323 (https://phabricator.wikimedia.org/T356287) [19:04:49] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] neutron/bobcat: remove an l3 conf override hack [puppet] - 10https://gerrit.wikimedia.org/r/1019898 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [19:05:51] mutante: or maybe I misunderstand what scap does :D [19:06:01] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev openstack to version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1020323 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [19:06:08] I think it is puppet calling `scap deploy --init` or something [19:06:17] which might just do the repo cloning and basic deploy [19:06:18] hashar: it's kind of common we have an issue on the very first time it runs [19:06:20] but not all the later stages [19:06:35] yea [19:07:33] yea, it did some things but not the deploy, /usr/local/bin/zuul: broken symbolic link to /srv/deployment/zuul/venv/bin/zuul [19:07:57] maybe I can find out, otherwise tomorrow [19:08:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.1 refs T361395 [19:08:28] T361395: 1.43.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T361395 [19:11:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T361627)', diff saved to https://phabricator.wikimedia.org/P60686 and previous config saved to /var/cache/conftool/dbconfig/20240416-191128-marostegui.json [19:11:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:11:34] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:11:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:12:27] !log hashar@deploy1002 Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 [19:12:31] !log hashar@deploy1002 Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 03s) [19:12:32] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [19:13:43] W T F [19:14:05] (03PS1) 10Bking: Elastic: unique systemd timer name per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1020326 (https://phabricator.wikimedia.org/T358389) [19:14:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020326 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:14:35] !log hashar@deploy1002 Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 [19:14:49] !log hashar@deploy1002 Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 13s) [19:15:03] RuntimeError: failed to find interpreter for Builtin discover of python_spec='python2.7' [19:15:07] so yeah I don't know [19:15:13] scap hasn't done the first install [19:15:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:15:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T361627)', diff saved to https://phabricator.wikimedia.org/P60687 and previous config saved to /var/cache/conftool/dbconfig/20240416-191522-marostegui.json [19:17:02] !log Deployment train for analytics/refinery [19:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:58] hashar: I tried this: sudo -u deploy-zuul scap deploy-local -r zuul [19:18:09] but nope:) [19:18:15] yeah I don't know [19:18:22] my guess is it only do the first stage [19:18:35] !log aqu@deploy1002 Started deploy [analytics/refinery@59f7d09]: Regular analytics weekly train [analytics/refinery@59f7d091] [19:18:39] I ran a regular deployment from the deployment server but then it fails because python2.7 is missing [19:18:39] yea, I think the same [19:18:52] and I don't know what installed it on contint1003 [19:19:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1020326 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:19:03] given we have: hieradata/role/common/ci.yaml:profile::base::remove_python2_on_bullseye: false [19:20:09] (03CR) 10Ryan Kemper: [C:03+1] Elastic: unique systemd timer name per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1020326 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:20:16] (03CR) 10Bking: [C:03+2] Elastic: unique systemd timer name per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1020326 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [19:21:38] hashar: jnuche did that on contint1003 [19:21:55] apt installed it ? [19:23:57] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9208108]: Regular analytics weekly train [airflow-dags/analytics_test@9208108e] [19:24:07] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9208108]: Regular analytics weekly train [airflow-dags/analytics_test@9208108e] (duration: 00m 10s) [19:24:36] hashar: https://phabricator.wikimedia.org/T358237#9599861 [19:24:42] those things after this comment [19:25:10] but as you say we have "profile::base::remove_python2_on_bullseye: false" as default for ci role [19:25:32] python2.7 is installed [19:27:11] (03PS1) 10Hashar: zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) [19:27:51] I KNOW [19:27:52] so [19:27:58] I imagine the host is reimaged [19:28:04] but all those packages are installed [19:28:05] Puppet run and thus remove the python2.7 packages [19:28:28] THEN the role is applied which bring in the "remove_python2_on_bullseye: false" but that does not magically reinistall it [19:28:35] end result: host lacks python2.7 :D [19:28:41] ii libpython2.7:amd64 2.7.18-8+deb11u1 amd64 Shared Python runtime library (version 2.7) [19:28:58] because that package is not in the list of packages to remove [19:29:00] arrgg. yes. you are right [19:29:04] modules/base/manifests/standard_packages.pp is wrong [19:29:05] this was from 1003, arr [19:29:08] PUPPET IS WRONG [19:29:13] THE WORLD IS WORONNG [19:29:16] aezhgrghgh [19:29:17] lol [19:30:21] (03CR) 10Dzahn: [C:03+2] zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:30:22] (03Abandoned) 10Hashar: zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:30:38] mutante: I will ask M.oritz tomorrow I guess [19:31:06] I have no issue merging that, I already checked it's installed on the other hosts [19:31:43] !log aqu@deploy1002 Finished deploy [analytics/refinery@59f7d09]: Regular analytics weekly train [analytics/refinery@59f7d091] (duration: 13m 08s) [19:31:56] but yea, it's good for standard_packages [19:32:20] also we can be happy jnuche already fixed this: https://gerrit.wikimedia.org/r/c/integration/zuul/deploy/+/1008866 [19:33:06] hashar: also see https://gerrit.wikimedia.org/r/1008849 where he suggested the same thing but then manually installed them :) well, have a good night [19:33:34] hmm [19:33:51] !log aqu@deploy1002 Started deploy [analytics/refinery@59f7d09] (thin): Regular analytics weekly train THIN [analytics/refinery@59f7d091] [19:34:32] yeah maybe we should just install it [19:35:03] (03Restored) 10Hashar: zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:35:31] mutante: so we can do that ^ or I can ask m.oritz tomorrow afternoon :) [19:36:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 829.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:36:27] hashar: BOTH :) [19:36:36] +1 :) [19:36:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T361627)', diff saved to https://phabricator.wikimedia.org/P60689 and previous config saved to /var/cache/conftool/dbconfig/20240416-193643-marostegui.json [19:36:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:37:10] and I will look at adding the zuul-merger tomorrow afternoon and validating it is working (it is too late for me to do it now) [19:37:54] (03PS2) 10Dzahn: zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:38:01] !log aqu@deploy1002 Finished deploy [analytics/refinery@59f7d09] (thin): Regular analytics weekly train THIN [analytics/refinery@59f7d091] (duration: 04m 10s) [19:39:10] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1020329/1963/" [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:39:50] hashar: thanks, yeah, time to leave IRC for you:) [19:40:05] !log aqu@deploy1002 Started deploy [analytics/refinery@59f7d09] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@59f7d091] [19:41:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 829.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:41:24] (03CR) 10Dzahn: [C:03+2] "This should fix zuul deployments on contint servers on bullseye - same issue happened on the contint1003 test host but packages were insta" [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:42:30] !log aqu@deploy1002 Finished deploy [analytics/refinery@59f7d09] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@59f7d091] (duration: 02m 24s) [19:46:21] (03CR) 10Dzahn: [V:03+2 C:03+2] zuul: require python2.7 [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:47:44] !log hashar@deploy1002 Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 [19:47:50] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [19:47:52] !log hashar@deploy1002 Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 08s) [19:50:11] (03CR) 10Dzahn: [V:03+2 C:03+2] "contint2002/contint1003; noop" [puppet] - 10https://gerrit.wikimedia.org/r/1020329 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [19:51:45] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9720355 (10Dzahn) [19:51:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P60690 and previous config saved to /var/cache/conftool/dbconfig/20240416-195151-marostegui.json [19:52:06] hashar@contint1002:~$ /usr/local/bin/zuul --version [19:52:06] Zuul version: 2.5.2.dev30 [19:52:20] mutante: that worked! (after I ran a `scap deploy` from the deployment server) [19:53:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2009-dev'] [19:55:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2009-dev'] [19:55:20] hashar: I just wanted to try that myself:) ok! cool [19:55:53] guess then I can enable services:) [19:59:51] (03PS1) 10Dzahn: Revert "contint: disable zuul merger on contint1002, preparing for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1020234 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240416T2000). [20:00:05] Jdlrobson and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] (03CR) 10CI reject: [V:04-1] Revert "contint: disable zuul merger on contint1002, preparing for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1020234 (owner: 10Dzahn) [20:02:57] (03PS2) 10Dzahn: Revert "contint: disable zuul merger on contint1002, preparing for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1020234 [20:03:38] present [20:03:50] o/ [20:03:53] i can deploy [20:04:13] Jdlrobson: i'll start with yours [20:04:21] cjming: thanks Clare! [20:04:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1019910 (https://phabricator.wikimedia.org/T360388) (owner: 10Jdlrobson) [20:06:01] (03PS1) 10Papaul: Add cloudcontrol2009 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020343 (https://phabricator.wikimedia.org/T354896) [20:06:34] (03PS1) 10Dzahn: contint: set new default docker version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) [20:06:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P60691 and previous config saved to /var/cache/conftool/dbconfig/20240416-200659-marostegui.json [20:08:20] !log Weekly deploy of refinery using scap, then deployed onto hdfs [20:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:55] (03CR) 10Papaul: [C:03+2] Add cloudcontrol2009 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020343 (https://phabricator.wikimedia.org/T354896) (owner: 10Papaul) [20:17:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:42] (03CR) 10Dzahn: [C:03+2] Revert "contint: disable zuul merger on contint1002, preparing for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1020234 (owner: 10Dzahn) [20:22:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T361627)', diff saved to https://phabricator.wikimedia.org/P60693 and previous config saved to /var/cache/conftool/dbconfig/20240416-202206-marostegui.json [20:22:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:22:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:22:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:22:48] !log CI - re-enabled jenkins and zuul-merged on contint1002 after distro upgrade - T360964 [20:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:53] T360964: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964 [20:23:04] I literally can't type "merger" without making it "merged" [20:26:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2009-dev.codfw.wmnet with OS bookworm [20:27:09] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2... [20:30:21] !log CI - jenkins and zuul-merger are re-enabled on contint1002 after distro upgrade to bullseye - T334517 [20:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:30] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [20:33:01] (03Merged) 10jenkins-bot: Thumbnail styles generalized and moved to core [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1019910 (https://phabricator.wikimedia.org/T360388) (owner: 10Jdlrobson) [20:33:30] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1019910|Thumbnail styles generalized and moved to core (T360388)]] [20:33:35] T360388: Upstream thumbnail and table rules from Minerva to ResourceLoader/SkinModule - https://phabricator.wikimedia.org/T360388 [20:36:37] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1019910|Thumbnail styles generalized and moved to core (T360388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:42] Jdlrobson: can you test 1st patch? [20:38:07] (03PS1) 10Andrew Bogott: Trove/bobcat: update the patch for a puppet-applied hack [puppet] - 10https://gerrit.wikimedia.org/r/1020355 (https://phabricator.wikimedia.org/T356287) [20:38:16] cjming: yep [20:38:18] looking now [20:39:11] (03CR) 10Andrew Bogott: [C:03+2] Trove/bobcat: update the patch for a puppet-applied hack [puppet] - 10https://gerrit.wikimedia.org/r/1020355 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [20:42:20] cjming: looks good please merge! [20:42:22] (03PS1) 10Eevans: {echo,session}store (staging): use wmf-ca-certificates.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) [20:42:30] great - syncing [20:42:36] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:45:13] (03PS2) 10Eevans: {echo,session}store (staging): use wmf-ca-certificates.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020356 (https://phabricator.wikimedia.org/T352647) [20:47:40] hopefully the two config patches will be much quicker [20:48:09] ya - CI seems to take longer every time [20:56:18] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1019910|Thumbnail styles generalized and moved to core (T360388)]] (duration: 22m 48s) [20:56:24] T360388: Upstream thumbnail and table rules from Minerva to ResourceLoader/SkinModule - https://phabricator.wikimedia.org/T360388 [20:56:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020292 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:57:22] (03Merged) 10jenkins-bot: [phase 4] Vector-2022.js should no longer load legacy Vector site and user scripts/styles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020292 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [20:57:43] (03PS4) 10Jdlrobson: Use WikimediaMessages for template overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) [20:57:52] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1020292|[phase 4] Vector-2022.js should no longer load legacy Vector site and user scripts/styles (T301212)]] [20:57:57] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [20:58:01] w00t [21:01:06] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1020292|[phase 4] Vector-2022.js should no longer load legacy Vector site and user scripts/styles (T301212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:01:09] Jdlrobson: 2nd patch ok to sync? [21:01:16] cjming: looking now [21:02:35] cjming: please sync [21:02:41] ok! [21:02:43] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [21:06:15] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse (WMDE) from WMF systems - https://phabricator.wikimedia.org/T361266#9720548 (10Dzahn) >>! In T361266#9716925, @Urbanecm_WMF wrote: >>> Since Michael was hired by WMF Do we expect a new access request to add him to "wmf" LDAP grou... [21:10:19] cjming: ready for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1019941 ? [21:10:24] i know we are over time but..... [21:10:34] yes - just waiting for last scap to finish [21:11:54] seems to be hanging [21:12:24] (03CR) 10Dzahn: [C:04-1] "to be merged next week once contint2002 is also reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [21:13:30] (03PS1) 10Bking: site.pp: move elastic2088 back into production [puppet] - 10https://gerrit.wikimedia.org/r/1020375 (https://phabricator.wikimedia.org/T361525) [21:15:27] k [21:16:19] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1020292|[phase 4] Vector-2022.js should no longer load legacy Vector site and user scripts/styles (T301212)]] (duration: 18m 26s) [21:16:24] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [21:17:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) (owner: 10Jdlrobson) [21:18:05] (03Merged) 10jenkins-bot: Use WikimediaMessages for template overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019941 (https://phabricator.wikimedia.org/T361589) (owner: 10Jdlrobson) [21:18:34] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1019941|Use WikimediaMessages for template overrides (T361589)]] [21:18:41] T361589: [Config] Enable the WikimediaMessage module on all wikis - https://phabricator.wikimedia.org/T361589 [21:21:37] !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:1019941|Use WikimediaMessages for template overrides (T361589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:52] Jdlrobson: ok to sync 3rd patch? [21:23:42] cjming: checking now [21:25:10] cjming: please sync! [21:25:14] !log cjming@deploy1002 jdlrobson and cjming: Continuing with sync [21:29:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 849.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:29:59] (03PS3) 10EoghanGaffney: phabricator: Switch certificate generation to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1020190 (https://phabricator.wikimedia.org/T360413) [21:34:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 843.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:34:25] did it go through cjming ? [21:34:32] almost [21:36:30] even syncing takes longer than i remember [21:38:05] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1019941|Use WikimediaMessages for template overrides (T361589)]] (duration: 19m 30s) [21:38:10] T361589: [Config] Enable the WikimediaMessage module on all wikis - https://phabricator.wikimedia.org/T361589 [21:38:17] Jdlrobson: all 3 patches should be live! [21:38:49] !log end of UTC late backport window [21:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:54] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Michael Grosse (WMDE) from WMF systems - https://phabricator.wikimedia.org/T361266#9720643 (10MoritzMuehlenhoff) > Do we expect a new access request to add him to "wmf" LDAP group, WMF-NDA in Phabricator etc? Already happened in https://phabric... [21:42:24] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:16] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:45:04] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:45:14] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:45:45] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:46:08] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:46:29] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:46:31] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:47:16] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:47:44] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:47:53] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:48:03] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2009-dev.codfw.wmnet with OS bookworm [21:54:29] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2009-... [21:59:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 801.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:04:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 801.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:06:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.038s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:06:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:19] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9720707 (10VRiley-WMF) We have been able to get dell support on this unit. After sending over the logs for and they have reviewed it they suggested to update the BIOS a... [22:11:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 806.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:12:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:37] 10ops-eqiad, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T362366#9720733 (10phaultfinder) [22:15:29] (03PS1) 10Papaul: Fix entry for cloudcontrol2009-dev in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020378 (https://phabricator.wikimedia.org/T354896) [22:15:52] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720737 (10Papaul) [22:18:40] (03CR) 10Papaul: [C:03+2] Fix entry for cloudcontrol2009-dev in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1020378 (https://phabricator.wikimedia.org/T354896) (owner: 10Papaul) [22:25:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2009-dev.codfw.wmnet with OS bookworm [22:25:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2... [22:29:36] (03PS1) 10Andrew Bogott: designate/bobcat: update nova_fixed_multi to catch up with upstream refactor [puppet] - 10https://gerrit.wikimedia.org/r/1020384 (https://phabricator.wikimedia.org/T356287) [22:30:01] (03CR) 10CI reject: [V:04-1] designate/bobcat: update nova_fixed_multi to catch up with upstream refactor [puppet] - 10https://gerrit.wikimedia.org/r/1020384 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [22:31:11] (03PS2) 10Andrew Bogott: designate/bobcat: update nova_fixed_multi to catch up with upstream refactor [puppet] - 10https://gerrit.wikimedia.org/r/1020384 (https://phabricator.wikimedia.org/T356287) [22:33:27] (03PS3) 10Andrew Bogott: designate/bobcat: update nova_fixed_multi to catch up with upstream refactor [puppet] - 10https://gerrit.wikimedia.org/r/1020384 (https://phabricator.wikimedia.org/T356287) [22:34:00] (03CR) 10Andrew Bogott: [C:03+2] designate/bobcat: update nova_fixed_multi to catch up with upstream refactor [puppet] - 10https://gerrit.wikimedia.org/r/1020384 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [22:36:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 875.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:38:28] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729 (10RobH) 03NEW [22:38:47] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9720811 (10RobH) [22:41:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 817.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:39] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730 (10RobH) 03NEW [22:42:42] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9720829 (10RobH) [22:43:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2009-dev.codfw.wmnet with reason: host reimage [22:43:36] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9720831 (10RobH) [22:46:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2009-dev.codfw.wmnet with reason: host reimage [22:47:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 872.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:52:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 872.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:53:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 863.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:58:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 843.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:00:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 841.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:03:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:04:10] (03PS4) 10MusikAnimal: [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) [23:05:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 881.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:06:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:06:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2009-dev.codfw.wmnet with OS bookworm [23:07:01] (03CR) 10HMonroy: [C:03+2] [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [23:07:12] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2009-... [23:07:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [23:07:47] (03Merged) 10jenkins-bot: [mediawikiwiki] enable CodeMirror V6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019893 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [23:08:14] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:1019893|[mediawikiwiki] enable CodeMirror V6 (T357795)]] [23:08:19] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [23:11:16] !log hmonroy@deploy1002 musikanimal and hmonroy: Backport for [[gerrit:1019893|[mediawikiwiki] enable CodeMirror V6 (T357795)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:11:47] 06SRE, 10LDAP-Access-Requests: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access - https://phabricator.wikimedia.org/T362731 (10derenrich) 03NEW [23:12:21] !log hmonroy@deploy1002 musikanimal and hmonroy: Continuing with sync [23:15:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 953.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:25:43] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:1019893|[mediawikiwiki] enable CodeMirror V6 (T357795)]] (duration: 17m 29s) [23:25:49] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [23:37:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019784 [23:37:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019784 (owner: 10TrainBranchBot) [23:50:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 850.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:58:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 871.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded