[00:10:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P59947 and previous config saved to /var/cache/conftool/dbconfig/20240409-001037-arnaudb.json [00:25:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P59948 and previous config saved to /var/cache/conftool/dbconfig/20240409-002545-arnaudb.json [00:40:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T360332)', diff saved to https://phabricator.wikimedia.org/P59949 and previous config saved to /var/cache/conftool/dbconfig/20240409-004052-arnaudb.json [00:40:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [00:41:03] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [00:41:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [00:41:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T360332)', diff saved to https://phabricator.wikimedia.org/P59950 and previous config saved to /var/cache/conftool/dbconfig/20240409-004115-arnaudb.json [00:41:57] (ProbeDown) firing: (4) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:43:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T360332)', diff saved to https://phabricator.wikimedia.org/P59951 and previous config saved to /var/cache/conftool/dbconfig/20240409-004336-arnaudb.json [00:46:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T356166)', diff saved to https://phabricator.wikimedia.org/P59952 and previous config saved to /var/cache/conftool/dbconfig/20240409-004645-marostegui.json [00:46:49] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [00:46:57] (ProbeDown) resolved: (4) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:58:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P59953 and previous config saved to /var/cache/conftool/dbconfig/20240409-005843-arnaudb.json [01:01:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P59954 and previous config saved to /var/cache/conftool/dbconfig/20240409-010152-marostegui.json [01:03:39] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T362126 (10phaultfinder) 03NEW [01:07:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.26 [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1017460 (https://phabricator.wikimedia.org/T360158) [01:07:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.42.0-wmf.26 [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1017460 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [01:13:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P59955 and previous config saved to /var/cache/conftool/dbconfig/20240409-011351-arnaudb.json [01:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P59956 and previous config saved to /var/cache/conftool/dbconfig/20240409-011700-marostegui.json [01:28:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T360332)', diff saved to https://phabricator.wikimedia.org/P59957 and previous config saved to /var/cache/conftool/dbconfig/20240409-012858-arnaudb.json [01:29:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [01:29:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [01:29:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [01:29:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [01:29:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [01:29:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T360332)', diff saved to https://phabricator.wikimedia.org/P59958 and previous config saved to /var/cache/conftool/dbconfig/20240409-012949-arnaudb.json [01:29:55] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.26 [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1017460 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [01:32:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T356166)', diff saved to https://phabricator.wikimedia.org/P59959 and previous config saved to /var/cache/conftool/dbconfig/20240409-013208-marostegui.json [01:32:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T360332)', diff saved to https://phabricator.wikimedia.org/P59960 and previous config saved to /var/cache/conftool/dbconfig/20240409-013208-arnaudb.json [01:32:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [01:32:21] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [01:32:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [01:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T356166)', diff saved to https://phabricator.wikimedia.org/P59961 and previous config saved to /var/cache/conftool/dbconfig/20240409-013231-marostegui.json [01:41:33] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128 (10Andrew) 03NEW [01:41:48] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9699485 (10Andrew) [01:43:43] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9699486 (10Andrew) [01:46:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P59962 and previous config saved to /var/cache/conftool/dbconfig/20240409-014716-arnaudb.json [01:49:14] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@6bb821b]: (no justification provided) [01:49:46] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@6bb821b]: (no justification provided) (duration: 00m 31s) [01:55:38] (03PS1) 10Jforrester: ExtensionDistributor: Add REL1_42 as a beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017952 (https://phabricator.wikimedia.org/T359844) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0200) [02:02:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P59963 and previous config saved to /var/cache/conftool/dbconfig/20240409-020223-arnaudb.json [02:05:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 874.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:10:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 874.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:11:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 881ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:16:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 890.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:17:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T360332)', diff saved to https://phabricator.wikimedia.org/P59964 and previous config saved to /var/cache/conftool/dbconfig/20240409-021731-arnaudb.json [02:17:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [02:17:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [02:17:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [02:17:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T360332)', diff saved to https://phabricator.wikimedia.org/P59965 and previous config saved to /var/cache/conftool/dbconfig/20240409-021755-arnaudb.json [02:20:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T360332)', diff saved to https://phabricator.wikimedia.org/P59966 and previous config saved to /var/cache/conftool/dbconfig/20240409-022015-arnaudb.json [02:31:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 892.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:35:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P59967 and previous config saved to /var/cache/conftool/dbconfig/20240409-023522-arnaudb.json [02:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:43:08] (03PS1) 10Andrew Bogott: vendordata/cloud-init: remove disk and fs entries [puppet] - 10https://gerrit.wikimedia.org/r/1017955 (https://phabricator.wikimedia.org/T355963) [02:43:09] (03PS1) 10Andrew Bogott: vendordata/cloud-init: remove ruby-sorted-set package request [puppet] - 10https://gerrit.wikimedia.org/r/1017956 (https://phabricator.wikimedia.org/T355963) [02:44:33] (03CR) 10Andrew Bogott: [C:03+2] vendordata/cloud-init: remove disk and fs entries [puppet] - 10https://gerrit.wikimedia.org/r/1017955 (https://phabricator.wikimedia.org/T355963) (owner: 10Andrew Bogott) [02:44:45] (03CR) 10Andrew Bogott: [C:03+2] vendordata/cloud-init: remove ruby-sorted-set package request [puppet] - 10https://gerrit.wikimedia.org/r/1017956 (https://phabricator.wikimedia.org/T355963) (owner: 10Andrew Bogott) [02:46:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 844.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:50:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P59968 and previous config saved to /var/cache/conftool/dbconfig/20240409-025030-arnaudb.json [02:54:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 926.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:58:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 926.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0300) [03:05:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T360332)', diff saved to https://phabricator.wikimedia.org/P59969 and previous config saved to /var/cache/conftool/dbconfig/20240409-030537-arnaudb.json [03:05:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [03:05:46] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [03:05:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [03:05:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [03:06:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [03:06:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T360332)', diff saved to https://phabricator.wikimedia.org/P59970 and previous config saved to /var/cache/conftool/dbconfig/20240409-030617-arnaudb.json [03:08:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 985.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:08:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T360332)', diff saved to https://phabricator.wikimedia.org/P59971 and previous config saved to /var/cache/conftool/dbconfig/20240409-030836-arnaudb.json [03:18:32] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 954.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:23:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P59972 and previous config saved to /var/cache/conftool/dbconfig/20240409-032344-arnaudb.json [03:25:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.036s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:38:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P59973 and previous config saved to /var/cache/conftool/dbconfig/20240409-033851-arnaudb.json [03:53:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T360332)', diff saved to https://phabricator.wikimedia.org/P59974 and previous config saved to /var/cache/conftool/dbconfig/20240409-035359-arnaudb.json [03:54:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [03:54:03] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [03:54:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [03:54:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59975 and previous config saved to /var/cache/conftool/dbconfig/20240409-035422-arnaudb.json [03:56:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59976 and previous config saved to /var/cache/conftool/dbconfig/20240409-035641-arnaudb.json [04:11:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P59977 and previous config saved to /var/cache/conftool/dbconfig/20240409-041149-arnaudb.json [04:25:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 870.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:26:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P59978 and previous config saved to /var/cache/conftool/dbconfig/20240409-042657-arnaudb.json [04:28:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 851.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:33:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 843.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:35:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 851.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:40:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 847ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:42:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T360332)', diff saved to https://phabricator.wikimedia.org/P59979 and previous config saved to /var/cache/conftool/dbconfig/20240409-044204-arnaudb.json [04:42:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [04:42:08] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [04:42:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [04:42:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T360332)', diff saved to https://phabricator.wikimedia.org/P59980 and previous config saved to /var/cache/conftool/dbconfig/20240409-044216-arnaudb.json [04:43:33] (03PS1) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) [04:44:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T360332)', diff saved to https://phabricator.wikimedia.org/P59981 and previous config saved to /var/cache/conftool/dbconfig/20240409-044438-arnaudb.json [04:48:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.154s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:53:29] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.154s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:59:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P59982 and previous config saved to /var/cache/conftool/dbconfig/20240409-045946-arnaudb.json [05:07:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 976.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:09:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s2 T362036 [05:10:01] T362036: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T362036 [05:10:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s2 T362036 [05:10:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1222 with weight 0 T362036', diff saved to https://phabricator.wikimedia.org/P59983 and previous config saved to /var/cache/conftool/dbconfig/20240409-051027-marostegui.json [05:11:47] (03PS1) 10Marostegui: installserver: Format db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1017964 (https://phabricator.wikimedia.org/T361968) [05:12:18] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1017451 (https://phabricator.wikimedia.org/T362036) (owner: 10Gerrit maintenance bot) [05:14:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P59984 and previous config saved to /var/cache/conftool/dbconfig/20240409-051454-arnaudb.json [05:15:04] (03CR) 10Marostegui: [C:03+2] installserver: Format db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1017964 (https://phabricator.wikimedia.org/T361968) (owner: 10Marostegui) [05:28:14] !log Starting s2 eqiad failover from db1162 to db1222 - T362036 [05:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:18] T362036: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T362036 [05:28:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T362036', diff saved to https://phabricator.wikimedia.org/P59985 and previous config saved to /var/cache/conftool/dbconfig/20240409-052827-marostegui.json [05:28:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1222 to s2 primary and set section read-write T362036', diff saved to https://phabricator.wikimedia.org/P59986 and previous config saved to /var/cache/conftool/dbconfig/20240409-052855-marostegui.json [05:29:43] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1017452 (https://phabricator.wikimedia.org/T362036) (owner: 10Gerrit maintenance bot) [05:30:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T360332)', diff saved to https://phabricator.wikimedia.org/P59987 and previous config saved to /var/cache/conftool/dbconfig/20240409-053001-arnaudb.json [05:30:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [05:30:07] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [05:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T362036', diff saved to https://phabricator.wikimedia.org/P59988 and previous config saved to /var/cache/conftool/dbconfig/20240409-053005-root.json [05:30:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [05:30:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59989 and previous config saved to /var/cache/conftool/dbconfig/20240409-053024-arnaudb.json [05:32:18] (03PS1) 10Marostegui: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1017965 (https://phabricator.wikimedia.org/T361543) [05:32:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59990 and previous config saved to /var/cache/conftool/dbconfig/20240409-053245-arnaudb.json [05:33:39] (03CR) 10Marostegui: [C:03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1017965 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui) [05:33:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1162.eqiad.wmnet with OS bookworm [05:36:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.263s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:39:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.027s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:39:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [05:39:39] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9699705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm [05:44:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.006s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:45:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: host reimage [05:47:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P59991 and previous config saved to /var/cache/conftool/dbconfig/20240409-054752-arnaudb.json [05:48:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: host reimage [05:51:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [05:53:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.079s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:55:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [05:58:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.086s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:58:45] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 955.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0600). [06:03:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P59992 and previous config saved to /var/cache/conftool/dbconfig/20240409-060300-arnaudb.json [06:08:45] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.086s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:09:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1162.eqiad.wmnet with OS bookworm [06:15:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [06:16:24] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9699735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**) - Downtimed on Icinga/A... [06:18:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T360332)', diff saved to https://phabricator.wikimedia.org/P59993 and previous config saved to /var/cache/conftool/dbconfig/20240409-061807-arnaudb.json [06:18:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [06:18:11] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [06:18:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [06:18:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T360332)', diff saved to https://phabricator.wikimedia.org/P59994 and previous config saved to /var/cache/conftool/dbconfig/20240409-061830-arnaudb.json [06:20:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T360332)', diff saved to https://phabricator.wikimedia.org/P59995 and previous config saved to /var/cache/conftool/dbconfig/20240409-062050-arnaudb.json [06:29:21] (03PS1) 10KartikMistry: ContentTranslation: Limit publishing in zhwiki for extendedconfirmed users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018135 (https://phabricator.wikimedia.org/T349959) [06:31:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P59996 and previous config saved to /var/cache/conftool/dbconfig/20240409-063558-arnaudb.json [06:38:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:40:57] (03PS1) 10Marostegui: Revert "installserver: Format db1246" [puppet] - 10https://gerrit.wikimedia.org/r/1017997 [06:41:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:49:57] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9699785 (10MoritzMuehlenhoff) @odimitrijevic @Ahoelzl @WDoranWMF This needs your approval. [06:51:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P59997 and previous config saved to /var/cache/conftool/dbconfig/20240409-065105-arnaudb.json [06:51:49] (03CR) 10Marostegui: [C:03+2] Revert "installserver: Format db1246" [puppet] - 10https://gerrit.wikimedia.org/r/1017997 (owner: 10Marostegui) [06:57:35] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [06:59:48] (03PS1) 10Muehlenhoff: Re-add LDAP access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1018136 (https://phabricator.wikimedia.org/T361665) [06:59:59] (03CR) 10Anzx: [C:03+1] "you can schedule this patch for deployment through backport windows https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:23] OK. I'm here. [07:01:59] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1162.eqiad.wmnet onto db1246.eqiad.wmnet [07:02:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018135 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:03:08] 06SRE, 06collaboration-services, 10Znuny: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#9699793 (10MoritzMuehlenhoff) [07:03:39] (03Merged) 10jenkins-bot: ContentTranslation: Limit publishing in zhwiki for extendedconfirmed users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018135 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:03:45] 06SRE, 10SRE-tools, 06Data-Platform-SRE, 06Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#9699794 (10MoritzMuehlenhoff) [07:04:11] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1018135|ContentTranslation: Limit publishing in zhwiki for extendedconfirmed users only (T349959)]] [07:04:15] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9699796 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:04:17] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:04:38] 06SRE: Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779#9699798 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:05:01] (03CR) 10Muehlenhoff: [C:03+2] Re-add LDAP access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1018136 (https://phabricator.wikimedia.org/T361665) (owner: 10Muehlenhoff) [07:05:39] (03PS1) 10Marostegui: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018155 [07:06:02] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9699800 (10Marostegui) Host reimaged. Now recloning. [07:06:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T360332)', diff saved to https://phabricator.wikimedia.org/P59998 and previous config saved to /var/cache/conftool/dbconfig/20240409-070613-arnaudb.json [07:06:17] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [07:08:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:08:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:08:50] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9699806 (10MoritzMuehlenhoff) a:05RLazarus→03MoritzMuehlenhoff [07:09:14] !log kartik@deploy1002 kartik: Backport for [[gerrit:1018135|ContentTranslation: Limit publishing in zhwiki for extendedconfirmed users only (T349959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:09:20] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:10:23] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: 14Grant Access to wmf for AndyRussG - 14https://phabricator.wikimedia.org/T361665#9699804 (10MoritzMuehlenhoff) 05Open→03Resolved 14@AndyRussG : I've enabled your LDAP access with the wmf group, you should now be able to access the services listed u... [07:16:04] !log kartik@deploy1002 kartik: Continuing with sync [07:21:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9699812 (10MoritzMuehlenhoff) @SToyofuku-WMF @NBaca-WMF Can you please clarify which specific type of access you want/need: You mentioned access to analytics-privateda... [07:22:12] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016312 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [07:26:05] (03CR) 10Filippo Giunchedi: [C:03+2] node-exporter: ignore run/credentials mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017784 (owner: 10Filippo Giunchedi) [07:29:42] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1018135|ContentTranslation: Limit publishing in zhwiki for extendedconfirmed users only (T349959)]] (duration: 25m 30s) [07:29:46] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:29:58] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9699824 (10Xover) But now I haven't gotten any more copies since April 5, so whatever it was seems to have cleared for now. It's probably still a good... [07:30:46] No more patches in backport/config window. [07:33:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2213.codfw.wmnet with reason: Silence for reimage [07:33:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: Silence for reimage [07:34:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 depool for reimage T360116', diff saved to https://phabricator.wikimedia.org/P59999 and previous config saved to /var/cache/conftool/dbconfig/20240409-073406-arnaudb.json [07:34:11] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [07:37:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2113.codfw.wmnet with OS bookworm [07:46:10] 10SRE-tools, 06Infrastructure-Foundations: 14sre.hosts.decommission: don't FAIL when unable to set icinga downtime - 14https://phabricator.wikimedia.org/T282019#9699847 (10Volans) 05Open→03Declined 14This specific failure is due to the special nature of the secondary Icinga host that is not monitored... [07:50:27] (03CR) 10Filippo Giunchedi: [C:03+2] titan: trim 5m retention to 4y + 1w [puppet] - 10https://gerrit.wikimedia.org/r/1017806 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [07:53:50] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: 14Fine-tune CAS logging - 14https://phabricator.wikimedia.org/T233949#9699871 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff 14We're happy with the current logging for CAS. [07:53:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2113.codfw.wmnet with reason: host reimage [07:54:02] !log puppet cert clean swift_codfw T361844 [07:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:06] T361844: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844 [07:56:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2113.codfw.wmnet with reason: host reimage [07:58:20] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9699892 (10Jan.Kamenicek) Confirm, neither have I. Hope it will not recur. [07:58:47] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate swift_codfw is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:00:04] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T0800) [08:00:14] :) [08:00:28] why is it late by 4 seconds?! [08:00:56] !sal [08:00:56] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [08:02:58] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018187 (https://phabricator.wikimedia.org/T360158) [08:03:00] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018187 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [08:03:17] (03CR) 10Fabfur: [V:03+1] haproxy: remove timestamp from unique-id-format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017913 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [08:03:22] (03CR) 10Fabfur: [V:03+1 C:03+2] haproxy: remove timestamp from unique-id-format [puppet] - 10https://gerrit.wikimedia.org/r/1017913 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [08:03:44] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018187 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [08:09:01] (03PS1) 10Fabfur: haproxy: increase capture header length for UA [puppet] - 10https://gerrit.wikimedia.org/r/1018188 (https://phabricator.wikimedia.org/T351117) [08:11:43] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9699902 (10Reedy) [08:13:23] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018188 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [08:13:28] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:17:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2113.codfw.wmnet with OS bookworm [08:21:51] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1018188 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [08:22:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 1%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60000 and previous config saved to /var/cache/conftool/dbconfig/20240409-082227-arnaudb.json [08:23:58] (03CR) 10Fabfur: [V:03+1 C:03+2] haproxy: increase capture header length for UA [puppet] - 10https://gerrit.wikimedia.org/r/1018188 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [08:24:53] (03PS1) 10MVernon: SSL: update swift_codfw TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018190 (https://phabricator.wikimedia.org/T361844) [08:26:27] (03PS6) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [08:28:15] !log hashar@deploy1002 Started scap: testwikis wikis to 1.42.0-wmf.26 refs T360158 [08:28:18] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [08:32:45] (03PS7) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [08:35:03] (03PS4) 10Effie Mouzeli: WIP: mcrouter: update comments in mcrouter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692371 [08:35:18] that train is gonna take a long time [08:35:19] :D [08:35:36] (03PS5) 10Effie Mouzeli: mcrouter: update comments in mcrouter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692371 [08:36:34] (03CR) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:37:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 2%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60001 and previous config saved to /var/cache/conftool/dbconfig/20240409-083733-arnaudb.json [08:37:54] (03CR) 10Marostegui: [C:03+1] SSL: update swift_codfw TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018190 (https://phabricator.wikimedia.org/T361844) (owner: 10MVernon) [08:39:32] (03CR) 10MVernon: [C:03+2] SSL: update swift_codfw TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018190 (https://phabricator.wikimedia.org/T361844) (owner: 10MVernon) [08:41:46] (03CR) 10JMeybohm: [C:03+1] php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:45:03] (03CR) 10Brouberol: [WIP] Add datasets-config helm chart and helmfile (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [08:46:31] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9699983 (10Marostegui) Host recloned. I am going to leave it running for 24h before repooling it back into production, just in case. Thanks for the help @VRiley-WMF! [08:46:58] (CertAlmostExpired) firing: (2) Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:47:25] 10SRE-swift-storage, 13Patch-For-Review: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844#9699986 (10MatthewVernon) codfw done OK, cert now says `Not After : Apr 8 08:00:23 2029 GMT`. [08:47:40] (03CR) 10JMeybohm: mediawiki: add MW__MCROUTER_SERVER variable in chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:52:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 4%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60002 and previous config saved to /var/cache/conftool/dbconfig/20240409-085238-arnaudb.json [08:53:32] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:44] 10SRE-swift-storage, 10Phabricator, 07git-lfs, 10Release-Engineering-Team (Seen): 14Connect Phabricator to swift for storage of git-lfs and file uploads. - 14https://phabricator.wikimedia.org/T182085#9700011 (10hashar) 14> There was maybe a suggestion of using it for files uploaded to phab? Ideally m... [08:56:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1162.eqiad.wmnet onto db1246.eqiad.wmnet [08:59:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:59:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:00:23] (03CR) 10Muehlenhoff: [C:03+2] schema: Remove obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016315 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [09:01:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:01:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:01:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:02:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:02:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60003 and previous config saved to /var/cache/conftool/dbconfig/20240409-090210-arnaudb.json [09:02:15] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:04:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:04:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:04:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T360332)', diff saved to https://phabricator.wikimedia.org/P60004 and previous config saved to /var/cache/conftool/dbconfig/20240409-090435-arnaudb.json [09:04:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60005 and previous config saved to /var/cache/conftool/dbconfig/20240409-090442-arnaudb.json [09:07:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 8%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60006 and previous config saved to /var/cache/conftool/dbconfig/20240409-090744-arnaudb.json [09:07:50] (03PS1) 10Muehlenhoff: Switch testreduce to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018199 (https://phabricator.wikimedia.org/T360636) [09:08:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018199 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [09:08:38] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9700038 (10MoritzMuehlenhoff) >>! In T360636#9698325, @akosiaris wrote: > I 'll finish parsoid and testreduce in T359387 If I'm not mistaken testreduce is still unrelated,... [09:10:27] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Add shell script to replace rsync bare commands [puppet] - 10https://gerrit.wikimedia.org/r/1017823 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [09:10:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T360332)', diff saved to https://phabricator.wikimedia.org/P60007 and previous config saved to /var/cache/conftool/dbconfig/20240409-091057-arnaudb.json [09:11:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:11:12] !log installing postgresql-13 security updates [09:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:08] (03PS1) 10Peter Fischer: cirrus: Increase taskManager.resource.memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018200 [09:13:19] (03CR) 10Peter Fischer: [C:03+2] cirrus: Increase taskManager.resource.memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018200 (owner: 10Peter Fischer) [09:15:40] (03Merged) 10jenkins-bot: cirrus: Increase taskManager.resource.memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018200 (owner: 10Peter Fischer) [09:18:26] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.42.0-wmf.26 refs T360158 (duration: 50m 11s) [09:18:29] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [09:19:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P60008 and previous config saved to /var/cache/conftool/dbconfig/20240409-091949-arnaudb.json [09:20:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T356166)', diff saved to https://phabricator.wikimedia.org/P60009 and previous config saved to /var/cache/conftool/dbconfig/20240409-092043-marostegui.json [09:20:48] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:22:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 16%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60010 and previous config saved to /var/cache/conftool/dbconfig/20240409-092249-arnaudb.json [09:26:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P60011 and previous config saved to /var/cache/conftool/dbconfig/20240409-092605-arnaudb.json [09:26:08] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9700070 (10MoritzMuehlenhoff) grafana-labs.wikimedia.org is just a redirect to https://wikitech.wikimedia.org/wiki/News/2023_Cloud_VPS_me... [09:26:52] doing group0 now [09:27:03] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018202 (https://phabricator.wikimedia.org/T360158) [09:27:05] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018202 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [09:27:08] (03CR) 10Marostegui: [C:03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018155 (owner: 10Marostegui) [09:27:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60012 and previous config saved to /var/cache/conftool/dbconfig/20240409-092740-root.json [09:27:54] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018202 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [09:29:26] (03CR) 10LSobanski: "Does this change need a new reviewer or can it be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/842950 (owner: 10Majavah) [09:33:49] (03Abandoned) 10Majavah: rsync: drop support for auto_ferm_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/842950 (owner: 10Majavah) [09:34:12] (03CR) 10Clément Goubert: [C:03+1] php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [09:34:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P60013 and previous config saved to /var/cache/conftool/dbconfig/20240409-093457-arnaudb.json [09:35:21] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [09:35:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P60014 and previous config saved to /var/cache/conftool/dbconfig/20240409-093551-marostegui.json [09:37:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60015 and previous config saved to /var/cache/conftool/dbconfig/20240409-093755-arnaudb.json [09:39:22] (03PS1) 10JMeybohm: Revert "Revert "Remove flink RBAC snowflakes"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018214 (https://phabricator.wikimedia.org/T350784) [09:39:59] (03PS2) 10JMeybohm: Revert "Revert "Remove flink RBAC snowflakes"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018214 (https://phabricator.wikimedia.org/T326409) [09:40:36] (03PS3) 10Majavah: P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) [09:40:36] (03PS3) 10Majavah: P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) [09:40:56] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.26 refs T360158 [09:40:59] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [09:41:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P60016 and previous config saved to /var/cache/conftool/dbconfig/20240409-094113-arnaudb.json [09:41:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1812/co" [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [09:41:53] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017464 [09:42:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60017 and previous config saved to /var/cache/conftool/dbconfig/20240409-094246-root.json [09:45:47] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:46:13] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:46:33] (03CR) 10JMeybohm: [C:03+2] Revert "Revert "Remove flink RBAC snowflakes"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018214 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [09:47:54] (03PS4) 10Majavah: P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) [09:47:54] (03PS4) 10Majavah: P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) [09:49:03] (03Merged) 10jenkins-bot: Revert "Revert "Remove flink RBAC snowflakes"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018214 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [09:49:12] (03CR) 10DCausse: Add Flink alerts for Cirrus Streaming Updater (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [09:49:56] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9700094 (10akosiaris) >>! In T360636#9700038, @MoritzMuehlenhoff wrote: >>>! In T360636#9698325, @akosiaris wrote: >> I 'll finish parsoid and testreduce in T359387 > > If... [09:50:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60018 and previous config saved to /var/cache/conftool/dbconfig/20240409-095004-arnaudb.json [09:50:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:50:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:50:10] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:50:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:50:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:50:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T360332)', diff saved to https://phabricator.wikimedia.org/P60019 and previous config saved to /var/cache/conftool/dbconfig/20240409-095054-arnaudb.json [09:51:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P60020 and previous config saved to /var/cache/conftool/dbconfig/20240409-095105-marostegui.json [09:51:44] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9700096 (10fgiunchedi) Indeed, I think `grafana_labs.certs.yaml` as a whole can be ditched [09:53:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60021 and previous config saved to /var/cache/conftool/dbconfig/20240409-095300-arnaudb.json [09:53:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T360332)', diff saved to https://phabricator.wikimedia.org/P60022 and previous config saved to /var/cache/conftool/dbconfig/20240409-095323-arnaudb.json [09:53:37] (03CR) 10Gmodena: [WIP] Add datasets-config helm chart and helmfile (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:55:02] (03PS5) 10Majavah: P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) [09:55:02] (03PS5) 10Majavah: P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) [09:56:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T360332)', diff saved to https://phabricator.wikimedia.org/P60023 and previous config saved to /var/cache/conftool/dbconfig/20240409-095620-arnaudb.json [09:56:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:56:24] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:56:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:56:36] (03PS5) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) [09:56:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60024 and previous config saved to /var/cache/conftool/dbconfig/20240409-095642-arnaudb.json [09:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60025 and previous config saved to /var/cache/conftool/dbconfig/20240409-095751-root.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1000) [10:02:58] !log puppet cert clean swift_eqiad T361844 [10:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:01] T361844: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844 [10:03:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60026 and previous config saved to /var/cache/conftool/dbconfig/20240409-100308-arnaudb.json [10:03:11] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:06:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T356166)', diff saved to https://phabricator.wikimedia.org/P60027 and previous config saved to /var/cache/conftool/dbconfig/20240409-100612-marostegui.json [10:06:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:06:16] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:06:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:06:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T356166)', diff saved to https://phabricator.wikimedia.org/P60028 and previous config saved to /var/cache/conftool/dbconfig/20240409-100635-marostegui.json [10:08:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60029 and previous config saved to /var/cache/conftool/dbconfig/20240409-100806-arnaudb.json [10:08:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P60030 and previous config saved to /var/cache/conftool/dbconfig/20240409-100830-arnaudb.json [10:08:47] (PuppetCertificateAboutToExpire) resolved: Puppet CA certificate swift_eqiad is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:09:25] (03PS1) 10MVernon: SSL: update swift_eqiad TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018227 (https://phabricator.wikimedia.org/T361844) [10:12:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60031 and previous config saved to /var/cache/conftool/dbconfig/20240409-101257-root.json [10:14:03] (03PS1) 10Majavah: utils: format-code: Remove --apply from isort [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018229 [10:14:03] (03PS1) 10Majavah: alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) [10:18:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P60032 and previous config saved to /var/cache/conftool/dbconfig/20240409-101815-arnaudb.json [10:19:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host snapshot1015.eqiad.wmnet with OS bullseye [10:19:51] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapsh... [10:20:17] (03CR) 10Marostegui: [C:03+1] SSL: update swift_eqiad TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018227 (https://phabricator.wikimedia.org/T361844) (owner: 10MVernon) [10:20:38] (03CR) 10MVernon: [C:03+2] SSL: update swift_eqiad TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1018227 (https://phabricator.wikimedia.org/T361844) (owner: 10MVernon) [10:20:59] (03CR) 10Alexandros Kosiaris: [WIP] Add datasets-config helm chart and helmfile (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [10:23:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P60033 and previous config saved to /var/cache/conftool/dbconfig/20240409-102312-arnaudb.json [10:23:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P60034 and previous config saved to /var/cache/conftool/dbconfig/20240409-102337-arnaudb.json [10:25:20] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018229 (owner: 10Majavah) [10:25:53] (03CR) 10Majavah: [C:03+2] utils: format-code: Remove --apply from isort [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018229 (owner: 10Majavah) [10:26:58] (CertAlmostExpired) resolved: Certificate for service swift-https:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:28:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60035 and previous config saved to /var/cache/conftool/dbconfig/20240409-102803-root.json [10:28:11] (03PS3) 10Muehlenhoff: chartmuseum: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018228 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [10:28:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1018228 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [10:28:49] (03CR) 10FNegri: "LGTM! Two questions:" [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [10:29:12] moritzm not even letting me putting it up for review before actually reviewing it x) [10:30:08] 10SRE-swift-storage, 13Patch-For-Review: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844#9700266 (10MatthewVernon) eqiad done, `Not After : Apr 8 10:04:14 2029 GMT` [10:30:39] hehe [10:31:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:29] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1015.eqiad.wmnet with reason: host reimage [10:31:59] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on db2097.codfw.wmnet with reason: host weirdness and possible decom [10:32:12] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on db2097.codfw.wmnet with reason: host weirdness and possible decom [10:33:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P60036 and previous config saved to /var/cache/conftool/dbconfig/20240409-103323-arnaudb.json [10:34:09] (03Merged) 10jenkins-bot: utils: format-code: Remove --apply from isort [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018229 (owner: 10Majavah) [10:34:43] (03CR) 10Clément Goubert: [V:03+1 C:03+2] chartmuseum: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018228 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [10:34:47] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1015.eqiad.wmnet with reason: host reimage [10:35:58] (03PS2) 10Majavah: alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) [10:38:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T360332)', diff saved to https://phabricator.wikimedia.org/P60037 and previous config saved to /var/cache/conftool/dbconfig/20240409-103845-arnaudb.json [10:38:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:38:49] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:38:57] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9700282 (10Clement_Goubert) [10:39:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:39:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60038 and previous config saved to /var/cache/conftool/dbconfig/20240409-103908-arnaudb.json [10:41:14] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:41:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60039 and previous config saved to /var/cache/conftool/dbconfig/20240409-104143-arnaudb.json [10:41:53] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:42:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:42:43] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9700299 (10MoritzMuehlenhoff) [10:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60040 and previous config saved to /var/cache/conftool/dbconfig/20240409-104308-root.json [10:43:12] (03CR) 10David Caro: [C:03+1] P:wmcs::metricsinfra::alertmanager: add basic auth support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [10:43:55] (03CR) 10CI reject: [V:04-1] alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:44:11] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:45:01] 10SRE-swift-storage, 13Patch-For-Review: 14Swift TLS certificates will expire soon (14 April) - 14https://phabricator.wikimedia.org/T361844#9700303 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon 14I've added [[https://wikitech.wikimedia.org/wiki/Swift/How_To#Update_internal_TLS_certificates|a... [10:45:03] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:45:05] (03PS3) 10Majavah: alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) [10:45:38] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:47:43] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700307 (10BTullis) [10:48:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60041 and previous config saved to /var/cache/conftool/dbconfig/20240409-104830-arnaudb.json [10:48:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:48:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:48:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:48:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60042 and previous config saved to /var/cache/conftool/dbconfig/20240409-104853-arnaudb.json [10:56:26] (03CR) 10Alexandros Kosiaris: [C:03+2] Route /w/docs/ to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/1013389 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [10:56:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P60043 and previous config saved to /var/cache/conftool/dbconfig/20240409-105650-arnaudb.json [10:58:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60044 and previous config saved to /var/cache/conftool/dbconfig/20240409-105814-root.json [10:58:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60045 and previous config saved to /var/cache/conftool/dbconfig/20240409-105827-arnaudb.json [10:58:31] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:58:49] (03CR) 10Majavah: "> Should we have a dummy entry in labs/private? Or do we add dummy entries only when puppet is failing without them?" [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [11:01:01] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1017779 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [11:01:03] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:01:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1015.eqiad.wmnet with OS bullseye [11:01:19] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9700341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot10... [11:02:19] (03PS4) 10Majavah: alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) [11:02:42] (03CR) 10Majavah: [C:03+2] alertmanager: Add support for per-instance HTTP proxy configuration (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:04:22] (03CR) 10FNegri: [C:03+1] P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [11:05:36] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146 (10BTullis) 03NEW [11:08:13] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9700375 (10BTullis) a:03BTullis I'll add the second disk after the initial creation by the cookbook. This will be useful to allow us t... [11:10:01] (03Merged) 10jenkins-bot: alertmanager: Add support for per-instance HTTP proxy configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018230 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:10:04] (03PS1) 10Hnowlan: mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) [11:10:12] (03CR) 10CI reject: [V:04-1] mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:10:35] (03PS2) 10Hnowlan: mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) [11:11:12] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9700395 (10BTullis) The Ganeti cluster report looks like it's fairly evenly balanced at the moment. ` DRY-RUN: START - Cookbook sre.gane... [11:11:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P60046 and previous config saved to /var/cache/conftool/dbconfig/20240409-111157-arnaudb.json [11:13:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P60047 and previous config saved to /var/cache/conftool/dbconfig/20240409-111334-arnaudb.json [11:15:42] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dumpsdata1004.eqiad.wmnet with OS bullseye [11:16:16] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9700413 (10MoritzMuehlenhoff) LGTM [11:16:36] (03PS1) 10Majavah: alertmanager: Add support for per-instance HTTP basic authentication [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018233 (https://phabricator.wikimedia.org/T360932) [11:17:04] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::alertmanager: add basic auth support [puppet] - 10https://gerrit.wikimedia.org/r/1017854 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [11:23:31] (03PS1) 10EoghanGaffney: gitlab: Switch rsync command in timer to run script [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) [11:23:39] (03PS1) 10AikoChou: ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018235 [11:24:00] (03CR) 10CI reject: [V:04-1] gitlab: Switch rsync command in timer to run script [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [11:24:53] (03PS2) 10EoghanGaffney: gitlab: Switch rsync command in timer to run script [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) [11:26:19] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018235 (owner: 10AikoChou) [11:27:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T360332)', diff saved to https://phabricator.wikimedia.org/P60048 and previous config saved to /var/cache/conftool/dbconfig/20240409-112705-arnaudb.json [11:27:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:27:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:27:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:27:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T360332)', diff saved to https://phabricator.wikimedia.org/P60049 and previous config saved to /var/cache/conftool/dbconfig/20240409-112728-arnaudb.json [11:27:46] !log btullis@cumin1002 START - Cookbook sre.ganeti.makevm for new host matomo1003.eqiad.wmnet [11:27:47] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [11:28:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P60050 and previous config saved to /var/cache/conftool/dbconfig/20240409-112841-arnaudb.json [11:29:42] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM matomo1003.eqiad.wmnet - btullis@cumin1002" [11:29:46] (03PS1) 10Majavah: hieradata: cloudcumin: Configure metricsinfra alertmanager instance [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) [11:30:01] (03PS1) 10Muehlenhoff: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018237 (https://phabricator.wikimedia.org/T360636) [11:31:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T360332)', diff saved to https://phabricator.wikimedia.org/P60051 and previous config saved to /var/cache/conftool/dbconfig/20240409-113100-arnaudb.json [11:31:09] (03PS1) 10Muehlenhoff: Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018238 (https://phabricator.wikimedia.org/T360636) [11:31:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018237 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:31:53] (03CR) 10Muehlenhoff: [C:03+2] installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1017279 (owner: 10Muehlenhoff) [11:32:25] (03PS2) 10Majavah: hieradata: cloudcumin: Configure metricsinfra alertmanager instance [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) [11:33:38] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1818/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:36:22] (03CR) 10Clément Goubert: [C:03+1] mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:37:02] (03CR) 10Clément Goubert: [C:03+1] Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018237 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:37:22] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018238 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:38:08] (03CR) 10Clément Goubert: [C:03+1] Switch testreduce to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018199 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:39:07] (03CR) 10Hnowlan: [C:03+2] mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:40:02] (03Merged) 10jenkins-bot: mw-jobrunner: use same request_terminate_timeout as metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018231 (https://phabricator.wikimedia.org/T358308) (owner: 10Hnowlan) [11:40:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: provisionning db2212.codfw.wmnet - T355422 [11:40:50] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [11:41:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: provisionning db2212.codfw.wmnet - T355422 [11:41:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: provisionning db2212.codfw.wmnet - T355422 [11:41:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: provisionning db2212.codfw.wmnet - T355422 [11:41:46] (03CR) 10Clément Goubert: [C:03+1] mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:42:10] !log hnowlan@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:42:10] !log hnowlan@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:42:52] !log hnowlan@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:43:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2112 in db2212 for T355422', diff saved to https://phabricator.wikimedia.org/P60053 and previous config saved to /var/cache/conftool/dbconfig/20240409-114302-arnaudb.json [11:43:07] (03CR) 10Ayounsi: [C:03+2] Add support for routed Ganeti in D-I early_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:43:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60054 and previous config saved to /var/cache/conftool/dbconfig/20240409-114349-arnaudb.json [11:43:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:43:52] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:43:55] !log hnowlan@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:44:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:44:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T360332)', diff saved to https://phabricator.wikimedia.org/P60055 and previous config saved to /var/cache/conftool/dbconfig/20240409-114411-arnaudb.json [11:44:37] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2112.codfw.wmnet onto db2212.codfw.wmnet [11:46:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P60056 and previous config saved to /var/cache/conftool/dbconfig/20240409-114607-arnaudb.json [11:47:51] !log hnowlan@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:47:52] !log hnowlan@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:48:44] !log hnowlan@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:49:40] !log hnowlan@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:50:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T360332)', diff saved to https://phabricator.wikimedia.org/P60057 and previous config saved to /var/cache/conftool/dbconfig/20240409-115035-arnaudb.json [11:50:39] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:53:58] (03CR) 10AikoChou: [C:03+2] ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018235 (owner: 10AikoChou) [11:54:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM matomo1003.eqiad.wmnet - btullis@cumin1002" [11:54:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:54:37] !log btullis@cumin1002 START - Cookbook sre.dns.wipe-cache matomo1003.eqiad.wmnet on all recursors [11:54:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) matomo1003.eqiad.wmnet on all recursors [11:54:53] (03Merged) 10jenkins-bot: ml-services: update RR ML/Wikidata's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018235 (owner: 10AikoChou) [11:55:13] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM matomo1003.eqiad.wmnet - btullis@cumin1002" [11:55:22] (03CR) 10FNegri: hieradata: cloudcumin: Configure metricsinfra alertmanager instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:55:49] (03PS3) 10Majavah: hieradata: cloudcumin: Configure metricsinfra alertmanager instance [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) [11:56:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM matomo1003.eqiad.wmnet - btullis@cumin1002" [11:56:15] (03CR) 10Majavah: hieradata: cloudcumin: Configure metricsinfra alertmanager instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:57:00] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [11:57:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1819/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [11:58:52] (03PS1) 10Jelto: gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004 [puppet] - 10https://gerrit.wikimedia.org/r/1018245 (https://phabricator.wikimedia.org/T357612) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1200) [12:01:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P60058 and previous config saved to /var/cache/conftool/dbconfig/20240409-120115-arnaudb.json [12:02:07] (03CR) 10FNegri: [C:03+1] "LGTM, but I'd check with Volans before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:02:16] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1004.eqiad.wmnet with OS bullseye [12:02:40] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: add proxy to alertmanager rw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1017853 (https://phabricator.wikimedia.org/T362061) (owner: 10Majavah) [12:05:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P60059 and previous config saved to /var/cache/conftool/dbconfig/20240409-120542-arnaudb.json [12:08:33] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018238 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [12:09:38] !log gmodena@deploy1002 Started deploy [analytics/refinery@d45a15b]: Regular analytics weekly train [analytics/refinery@d45a15b6] [12:11:50] (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018237 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [12:16:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T360332)', diff saved to https://phabricator.wikimedia.org/P60060 and previous config saved to /var/cache/conftool/dbconfig/20240409-121622-arnaudb.json [12:16:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:16:29] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:16:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:16:47] (03CR) 10Jelto: [C:03+2] gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004 [puppet] - 10https://gerrit.wikimedia.org/r/1018245 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [12:16:57] (03PS1) 10Jcrespo: mariadb: Migrate db2097 backups to db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751) [12:17:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [12:17:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [12:17:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T360332)', diff saved to https://phabricator.wikimedia.org/P60061 and previous config saved to /var/cache/conftool/dbconfig/20240409-121722-arnaudb.json [12:17:43] (03PS1) 10Majavah: hieradata: alerting_host: add fake metricsinfra password [labs/private] - 10https://gerrit.wikimedia.org/r/1018248 (https://phabricator.wikimedia.org/T320973) [12:18:32] (03CR) 10Majavah: [V:03+2 C:03+2] hieradata: alerting_host: add fake metricsinfra password [labs/private] - 10https://gerrit.wikimedia.org/r/1018248 (https://phabricator.wikimedia.org/T320973) (owner: 10Majavah) [12:19:45] (03PS1) 10Majavah: alertmanager: karma: Add metricsinfra write credentials [puppet] - 10https://gerrit.wikimedia.org/r/1018252 (https://phabricator.wikimedia.org/T320973) [12:19:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T360332)', diff saved to https://phabricator.wikimedia.org/P60062 and previous config saved to /var/cache/conftool/dbconfig/20240409-121958-arnaudb.json [12:20:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P60063 and previous config saved to /var/cache/conftool/dbconfig/20240409-122050-arnaudb.json [12:21:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudbackup1001-dev.eqiad.wmnet [12:22:04] (03CR) 10Gmodena: [WIP] Add datasets-config helm chart and helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:22:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1821/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018252 (https://phabricator.wikimedia.org/T320973) (owner: 10Majavah) [12:24:24] (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018233 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:24:32] (03CR) 10Majavah: [C:03+2] alertmanager: Add support for per-instance HTTP basic authentication [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018233 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:25:19] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d45a15b]: Regular analytics weekly train [analytics/refinery@d45a15b6] (duration: 15m 41s) [12:25:33] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018252 (https://phabricator.wikimedia.org/T320973) (owner: 10Majavah) [12:27:03] (03CR) 10Volans: "question inline, LGTM if that works :)" [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:27:46] (03PS1) 10Muehlenhoff: Switch cloudbackup1001-dev to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018254 (https://phabricator.wikimedia.org/T349619) [12:28:45] (03CR) 10Majavah: [V:03+1 C:03+2] alertmanager: karma: Add metricsinfra write credentials (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018252 (https://phabricator.wikimedia.org/T320973) (owner: 10Majavah) [12:29:15] (03PS1) 10Fabfur: prometheus: add calculated queries for benthos-haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) [12:31:29] (03Merged) 10jenkins-bot: alertmanager: Add support for per-instance HTTP basic authentication [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018233 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:31:50] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudbackup1001-dev to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018254 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:32:53] (03PS1) 10Slyngshede: IP blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) [12:33:11] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9700649 (10MoritzMuehlenhoff) [12:33:12] (03PS1) 10Majavah: hieradata: cloudcumin: Add fake metricsinfra password [labs/private] - 10https://gerrit.wikimedia.org/r/1018257 (https://phabricator.wikimedia.org/T360932) [12:33:28] !log gmodena@deploy1002 Started deploy [analytics/refinery@d45a15b] (thin): Regular analytics weekly train THIN [analytics/refinery@d45a15b6] [12:33:48] (03PS4) 10Majavah: hieradata: cloudcumin: Configure metricsinfra alertmanager instance [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) [12:33:59] (03CR) 10Slyngshede: "I want to add more documentation, swagger interface or something like that, but let's first get the API correct." [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [12:35:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P60064 and previous config saved to /var/cache/conftool/dbconfig/20240409-123505-arnaudb.json [12:35:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1825/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:35:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T360332)', diff saved to https://phabricator.wikimedia.org/P60065 and previous config saved to /var/cache/conftool/dbconfig/20240409-123558-arnaudb.json [12:36:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:36:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:36:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:36:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:36:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60066 and previous config saved to /var/cache/conftool/dbconfig/20240409-123628-arnaudb.json [12:36:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [12:36:58] (03CR) 10Elukey: [C:03+2] profile::prometheus::analytics: remove old metric relabeling [puppet] - 10https://gerrit.wikimedia.org/r/1017882 (owner: 10Elukey) [12:36:58] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d45a15b] (thin): Regular analytics weekly train THIN [analytics/refinery@d45a15b6] (duration: 03m 30s) [12:38:09] (03PS1) 10EoghanGaffney: gitlab: Switch gitlab-replica and gitlab-replica-old [dns] - 10https://gerrit.wikimedia.org/r/1018258 [12:38:18] (03CR) 10Elukey: [C:03+1] sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [12:38:55] !log gmodena@deploy1002 Started deploy [analytics/refinery@d45a15b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d45a15b6] [12:39:18] (03PS1) 10EoghanGaffney: gitlab: Switch hierdata for gitlab-replica and gitlab-replica-old [puppet] - 10https://gerrit.wikimedia.org/r/1018259 [12:40:23] (03PS1) 10Jelto: Revert "gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004" [puppet] - 10https://gerrit.wikimedia.org/r/1018219 (https://phabricator.wikimedia.org/T357612) [12:41:02] (03CR) 10Majavah: [V:03+1] hieradata: cloudcumin: Configure metricsinfra alertmanager instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:41:09] btullis: could you add your new VM naming to https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers ? [12:41:37] !log gmodena@deploy1002 Finished deploy [analytics/refinery@d45a15b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d45a15b6] (duration: 02m 42s) [12:41:40] (03CR) 10Jelto: [C:03+2] Revert "gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004" [puppet] - 10https://gerrit.wikimedia.org/r/1018219 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [12:43:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::cinder_backups [12:44:32] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Deploy new Truststore - elukey@cumin1002 [12:45:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:45:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60067 and previous config saved to /var/cache/conftool/dbconfig/20240409-124550-arnaudb.json [12:45:53] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:46:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9700693 (10WDoranWMF) > The idea would be to get your service up and running as early as possible s... [12:46:11] (03PS1) 10Muehlenhoff: Switch wmcs::openstack::codfw1dev::cinder_backups to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018261 (https://phabricator.wikimedia.org/T349619) [12:47:00] (03CR) 10Muehlenhoff: [C:03+2] Switch wmcs::openstack::codfw1dev::cinder_backups to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018261 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:47:30] (03CR) 10DCausse: [C:03+1] search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 (owner: 10Ebernhardson) [12:50:06] (03PS1) 10Filippo Giunchedi: thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) [12:50:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P60068 and previous config saved to /var/cache/conftool/dbconfig/20240409-125012-arnaudb.json [12:50:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:50:44] (03CR) 10CI reject: [V:04-1] thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [12:51:40] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:52:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::cinder_backups [12:52:42] (03PS2) 10Filippo Giunchedi: thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) [12:53:11] (03CR) 10CI reject: [V:04-1] thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [12:53:20] !log uploaded golang-github-gopacket-gopacket_1.2.0-2~wmf1 to apt.wm.o (bookworm) [12:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:04] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9700736 (10MoritzMuehlenhoff) [12:54:05] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudcumin: Configure metricsinfra alertmanager instance [puppet] - 10https://gerrit.wikimedia.org/r/1018236 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:54:26] (03PS3) 10Filippo Giunchedi: thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) [12:54:47] (03CR) 10Majavah: [V:03+2 C:03+2] hieradata: cloudcumin: Add fake metricsinfra password [labs/private] - 10https://gerrit.wikimedia.org/r/1018257 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [12:59:44] * James_F waves in advance. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1300). [13:00:05] James_F and SD_hehua: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] I can deploy. [13:00:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017952 (https://phabricator.wikimedia.org/T359844) (owner: 10Jforrester) [13:00:56]           I can also. [13:00:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P60069 and previous config saved to /var/cache/conftool/dbconfig/20240409-130057-arnaudb.json [13:01:10] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T362126#9700749 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:01:14] (03Merged) 10jenkins-bot: ExtensionDistributor: Add REL1_42 as a beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017952 (https://phabricator.wikimedia.org/T359844) (owner: 10Jforrester) [13:01:52] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1017952|ExtensionDistributor: Add REL1_42 as a beta (T359844)]] [13:02:14] T359844: Add REL1_42 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T359844 [13:03:25] 06SRE: Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779#9700772 (10Jgreen) >>! In T360779#9688343, @elukey wrote: > Hi Jeff! > > After a chat with Moritz we agreed that the simplest solution would be to create the cert via puppet in production on some host (we need to figur... [13:04:25] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1017952|ExtensionDistributor: Add REL1_42 as a beta (T359844)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:04:55] !log jforrester@deploy1002 jforrester: Continuing with sync [13:05:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T360332)', diff saved to https://phabricator.wikimedia.org/P60070 and previous config saved to /var/cache/conftool/dbconfig/20240409-130520-arnaudb.json [13:05:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [13:05:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:05:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [13:05:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T360332)', diff saved to https://phabricator.wikimedia.org/P60071 and previous config saved to /var/cache/conftool/dbconfig/20240409-130543-arnaudb.json [13:07:54] (03PS1) 10Arnaudb: dotfiles: add mysql query for replag to bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1018286 [13:08:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T360332)', diff saved to https://phabricator.wikimedia.org/P60072 and previous config saved to /var/cache/conftool/dbconfig/20240409-130812-arnaudb.json [13:11:25] (03PS1) 10Elukey: cassandra: move cqlshrc's template to wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) [13:11:45] (03CR) 10Marostegui: dotfiles: add mysql query for replag to bashrc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018286 (owner: 10Arnaudb) [13:11:54] (03CR) 10CI reject: [V:04-1] cassandra: move cqlshrc's template to wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:12:41] (03PS2) 10Elukey: cassandra: move cqlshrc's template to wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) [13:12:52] (03PS2) 10Arnaudb: dotfiles: add mysql query for replag to bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1018286 [13:13:15] (03CR) 10Arnaudb: dotfiles: add mysql query for replag to bashrc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018286 (owner: 10Arnaudb) [13:14:01] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:14:18] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on stat1010.eqiad.wmnet with reason: Connecting GPU power cable [13:14:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on stat1010.eqiad.wmnet with reason: Connecting GPU power cable [13:15:04] I should have merged the other patch at the same time. Long gone are the 45 second deploy days. :-( [13:15:13] (03CR) 10Marostegui: [C:03+1] dotfiles: add mysql query for replag to bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1018286 (owner: 10Arnaudb) [13:15:44] (03CR) 10Arnaudb: [C:03+2] dotfiles: add mysql query for replag to bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1018286 (owner: 10Arnaudb) [13:16:02] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1017952|ExtensionDistributor: Add REL1_42 as a beta (T359844)]] (duration: 14m 09s) [13:16:04] !log depool cp4052 for firmware upgrade [13:16:05] T359844: Add REL1_42 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T359844 [13:16:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P60073 and previous config saved to /var/cache/conftool/dbconfig/20240409-131605-arnaudb.json [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:16] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet [13:16:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [13:16:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [13:17:02] (03Merged) 10jenkins-bot: zhwiki:Add centralauth-createlocal to ipblock exempt granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [13:17:35] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1015187|zhwiki:Add centralauth-createlocal to ipblock exempt granter (T361184)]] [13:17:38] T361184: Add "centralauth-createlocal" right for ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T361184 [13:17:39] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: BIOS firmware upgrade [13:17:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: BIOS firmware upgrade [13:18:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet [13:18:02] (03PS1) 10Fabfur: benthos: better metric naming [puppet] - 10https://gerrit.wikimedia.org/r/1018268 (https://phabricator.wikimedia.org/T361845) [13:20:05] !log jforrester@deploy1002 sdhehua and jforrester: Backport for [[gerrit:1015187|zhwiki:Add centralauth-createlocal to ipblock exempt granter (T361184)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:20:13] (03PS1) 10Btullis: Add puppet7 data for new host matomo1003. [puppet] - 10https://gerrit.wikimedia.org/r/1018270 (https://phabricator.wikimedia.org/T349397) [13:20:48] seems ok [13:21:12] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [13:21:17] Yup, shows up on IP封禁豁免權授予者 on debug. [13:21:18] !log jforrester@deploy1002 sdhehua and jforrester: Continuing with sync [13:21:22] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9700846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host mato... [13:21:22] (03CR) 10Btullis: [C:03+2] Add puppet7 data for new host matomo1003. [puppet] - 10https://gerrit.wikimedia.org/r/1018270 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [13:22:52] (03PS5) 10Clément Goubert: docker_registry_ha: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) [13:23:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P60074 and previous config saved to /var/cache/conftool/dbconfig/20240409-132320-arnaudb.json [13:24:59] (03CR) 10Jcrespo: "I wonder what's the rule that adds you automatically as reviewers, and if it is really useful for you to keep it?" [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [13:25:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dumpsdata1004.eqiad.wmnet with OS bullseye [13:26:11] (03CR) 10Fabfur: [C:04-1] "This needs to be reworked with new metric names" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:29:20] 06SRE, 06Traffic, 06Wikimedia Enterprise: 14Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - 14https://phabricator.wikimedia.org/T280628#9700886 (10JArguello-WMF) 05Open→03Resolved a:03JArguello-WMF [13:29:41] OK, all done bar the PHP restarts. [13:31:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T360332)', diff saved to https://phabricator.wikimedia.org/P60075 and previous config saved to /var/cache/conftool/dbconfig/20240409-133112-arnaudb.json [13:31:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [13:31:17] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:31:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [13:31:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T360332)', diff saved to https://phabricator.wikimedia.org/P60076 and previous config saved to /var/cache/conftool/dbconfig/20240409-133135-arnaudb.json [13:32:07] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1015187|zhwiki:Add centralauth-createlocal to ipblock exempt granter (T361184)]] (duration: 14m 31s) [13:32:13] T361184: Add "centralauth-createlocal" right for ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T361184 [13:33:09] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet [13:33:11] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4052.ulsfo.wmnet [13:35:20] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [13:38:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P60077 and previous config saved to /var/cache/conftool/dbconfig/20240409-133827-arnaudb.json [13:38:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T360332)', diff saved to https://phabricator.wikimedia.org/P60078 and previous config saved to /var/cache/conftool/dbconfig/20240409-133852-arnaudb.json [13:38:53] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: host reimage [13:38:55] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:40:01] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9700913 (10Jclark-ctr) [13:42:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: host reimage [13:42:16] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [13:42:33] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [13:43:40] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9700926 (10ssingh) `cp4052` BIOS version `1.9.2` also didn't work; no PXE boot. I am going to focus on the install server now and see if... [13:44:51] (03CR) 10Eevans: [C:03+1] cassandra: move cqlshrc's template to wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:47:01] 06SRE: Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779#9700940 (10MoritzMuehlenhoff) >>! In T360779#9700772, @Jgreen wrote: > Sounds fine to me. I looked at the puppet code and if I understand correctly, cfssl::cert will automatically generate a new certificate 10 (default)... [13:47:16] (03PS2) 10Fabfur: benthos: better metric naming [puppet] - 10https://gerrit.wikimedia.org/r/1018268 (https://phabricator.wikimedia.org/T361845) [13:47:18] (03CR) 10Elukey: [V:03+1 C:03+2] cassandra: move cqlshrc's template to wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:48:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:48:18] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy bert model on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018274 (https://phabricator.wikimedia.org/T357986) [13:49:53] (03CR) 10Filippo Giunchedi: [C:03+1] benthos: better metric naming [puppet] - 10https://gerrit.wikimedia.org/r/1018268 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:50:30] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9700951 (10Jclark-ctr) @Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again [13:51:38] (03CR) 10Elukey: [C:03+1] ml-services: deploy bert model on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018274 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [13:53:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T360332)', diff saved to https://phabricator.wikimedia.org/P60079 and previous config saved to /var/cache/conftool/dbconfig/20240409-135335-arnaudb.json [13:53:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:53:39] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:53:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:53:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P60080 and previous config saved to /var/cache/conftool/dbconfig/20240409-135359-arnaudb.json [13:54:02] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:54:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [13:54:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [13:54:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:54:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:54:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:55:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:55:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:56:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:56:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:56:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2107 (T360332)', diff saved to https://phabricator.wikimedia.org/P60081 and previous config saved to /var/cache/conftool/dbconfig/20240409-135621-arnaudb.json [13:57:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2112.codfw.wmnet onto db2212.codfw.wmnet [13:57:51] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1004.eqiad.wmnet with OS bullseye [13:58:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 28.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:59:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T360332)', diff saved to https://phabricator.wikimedia.org/P60082 and previous config saved to /var/cache/conftool/dbconfig/20240409-135902-arnaudb.json [13:59:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:59:35] (03PS1) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) [13:59:41] (03CR) 10Muehlenhoff: "Looks good in general, one comment/suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [14:00:07] (03CR) 10Jcrespo: [C:04-1] "Blocked by 10.6 upgrade of s7 and s8." [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [14:01:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:02:59] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:03:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:04:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:06:41] (03CR) 10Clément Goubert: [V:03+1] docker_registry_ha: Migrate to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [14:06:45] (03PS6) 10Clément Goubert: docker_registry_ha: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) [14:06:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:07:07] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dumpsdata1005.eqiad.wmnet with OS bullseye [14:07:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:08:30] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1830/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [14:09:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P60083 and previous config saved to /var/cache/conftool/dbconfig/20240409-140906-arnaudb.json [14:10:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [14:14:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P60084 and previous config saved to /var/cache/conftool/dbconfig/20240409-141410-arnaudb.json [14:16:26] (03CR) 10Jelto: "on typo in line, otherwise it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:16:56] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1018258 (owner: 10EoghanGaffney) [14:17:13] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1018259 (owner: 10EoghanGaffney) [14:18:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Deploy new Truststore - elukey@cumin1002 [14:19:40] (03PS3) 10EoghanGaffney: gitlab: Switch rsync command in timer to run script [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) [14:20:09] (03CR) 10Muehlenhoff: "Looks good in general, one question inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [14:20:24] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:23:53] (03CR) 10Herron: [C:03+1] thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [14:24:04] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Switch rsync command in timer to run script [puppet] - 10https://gerrit.wikimedia.org/r/1018234 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:24:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T360332)', diff saved to https://phabricator.wikimedia.org/P60085 and previous config saved to /var/cache/conftool/dbconfig/20240409-142414-arnaudb.json [14:24:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:24:18] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:24:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:25:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:28:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [14:29:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P60086 and previous config saved to /var/cache/conftool/dbconfig/20240409-142916-arnaudb.json [14:29:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:29:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:30:39] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet [14:30:41] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp4052.ulsfo.wmnet [14:31:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 16.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:31:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:36] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 6 others: [C-DIS][SW] Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#9701099 (10Lydia_Pintscher) Removing it from WD dev team board as this will need to be handled by... [14:33:07] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm [14:33:07] !log btullis@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host matomo1003.eqiad.wmnet [14:33:19] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9701133 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo10... [14:33:37] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet [14:34:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:34:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:34:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T360332)', diff saved to https://phabricator.wikimedia.org/P60087 and previous config saved to /var/cache/conftool/dbconfig/20240409-143445-arnaudb.json [14:34:57] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:35:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4052.ulsfo.wmnet [14:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:35] (03CR) 10Fabfur: [C:03+2] benthos: better metric naming [puppet] - 10https://gerrit.wikimedia.org/r/1018268 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [14:44:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T360332)', diff saved to https://phabricator.wikimedia.org/P60088 and previous config saved to /var/cache/conftool/dbconfig/20240409-144424-arnaudb.json [14:44:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:44:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:44:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:44:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T360332)', diff saved to https://phabricator.wikimedia.org/P60089 and previous config saved to /var/cache/conftool/dbconfig/20240409-144447-arnaudb.json [14:45:29] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9701194 (10Manuel) 05Stalled→03Open [14:45:52] (03PS1) 10Fabfur: Revert "benthos: better metric naming" [puppet] - 10https://gerrit.wikimedia.org/r/1018221 [14:46:13] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission wdqs1025.eqiad.wmnet - https://phabricator.wikimedia.org/T362122#9701196 (10bking) a:05Papaul→03None [14:46:33] (03PS5) 10Hnowlan: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [14:47:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T360332)', diff saved to https://phabricator.wikimedia.org/P60090 and previous config saved to /var/cache/conftool/dbconfig/20240409-144735-arnaudb.json [14:49:11] mw-api-int is getting clobbered [14:49:32] (03CR) 10Fabfur: [C:03+2] Revert "benthos: better metric naming" [puppet] - 10https://gerrit.wikimedia.org/r/1018221 (owner: 10Fabfur) [14:49:48] (03PS7) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [14:49:48] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018282 [14:49:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T360332)', diff saved to https://phabricator.wikimedia.org/P60091 and previous config saved to /var/cache/conftool/dbconfig/20240409-144950-arnaudb.json [14:49:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:53:35] Looks like a lot of transcludes resource changes from changeprop [14:58:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1500). [15:01:16] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4041,4045,4049,4052].ulsfo.wmnet} and A:cp [15:01:25] jouncebot nowandnext [15:01:26] For the next 0 hour(s) and 58 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1500) [15:01:26] In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1600) [15:02:07] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [15:02:24] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy bert model on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018274 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [15:02:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P60092 and previous config saved to /var/cache/conftool/dbconfig/20240409-150242-arnaudb.json [15:03:16] (03Merged) 10jenkins-bot: ml-services: deploy bert model on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018274 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [15:04:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P60093 and previous config saved to /var/cache/conftool/dbconfig/20240409-150458-arnaudb.json [15:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:06:30] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:06:41] (03PS1) 10Ayounsi: Netbox: test test test test test test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 [15:09:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:12:01] (03PS2) 10Ayounsi: Netbox: add missing tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 [15:12:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you! And thanks for adding unit tests." [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:13:12] (03PS1) 10Hnowlan: mw-api-int: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018284 [15:16:00] (03CR) 10JMeybohm: [C:03+1] mw-api-int: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018284 (owner: 10Hnowlan) [15:16:44] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org [15:17:20] (03CR) 10Hnowlan: [C:03+2] mw-api-int: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018284 (owner: 10Hnowlan) [15:17:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P60094 and previous config saved to /var/cache/conftool/dbconfig/20240409-151750-arnaudb.json [15:18:13] (03Merged) 10jenkins-bot: mw-api-int: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018284 (owner: 10Hnowlan) [15:20:02] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9701473 (10andrea.denisse) @MoritzMuehlenhoff @fgiunchedi Thank you both, I'll make sure to remove it. [15:20:03] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the addition" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 (owner: 10Ayounsi) [15:20:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P60095 and previous config saved to /var/cache/conftool/dbconfig/20240409-152005-arnaudb.json [15:20:12] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:20:24] (03PS2) 10Mmartorana: Implementing security.txt standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) [15:20:32] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:22:16] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:22:40] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9701496 (10BTullis) [15:22:40] (03PS3) 10Ayounsi: Netbox: add missing tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 [15:22:45] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:23:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:04] (03PS1) 10Btullis: Add a role for matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1018307 (https://phabricator.wikimedia.org/T349397) [15:24:34] (03CR) 10CI reject: [V:04-1] Add a role for matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1018307 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [15:24:46] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1005.eqiad.wmnet with OS bullseye [15:25:06] (03PS2) 10Btullis: Add a role for matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1018307 (https://phabricator.wikimedia.org/T349397) [15:26:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:29:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.27% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:30:16] (03CR) 10Btullis: [C:03+2] Add a role for matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1018307 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [15:31:14] (03CR) 10Ayounsi: [C:03+2] Netbox: add missing tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 (owner: 10Ayounsi) [15:31:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4041,4045,4049,4052].ulsfo.wmnet} and A:cp [15:32:33] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [15:32:47] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9701558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host mato... [15:32:57] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:32:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T360332)', diff saved to https://phabricator.wikimedia.org/P60096 and previous config saved to /var/cache/conftool/dbconfig/20240409-153257-arnaudb.json [15:33:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:33:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:33:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:33:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:33:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:33:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T360332)', diff saved to https://phabricator.wikimedia.org/P60097 and previous config saved to /var/cache/conftool/dbconfig/20240409-153315-arnaudb.json [15:35:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T360332)', diff saved to https://phabricator.wikimedia.org/P60098 and previous config saved to /var/cache/conftool/dbconfig/20240409-153512-arnaudb.json [15:35:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:35:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:35:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T360332)', diff saved to https://phabricator.wikimedia.org/P60099 and previous config saved to /var/cache/conftool/dbconfig/20240409-153557-arnaudb.json [15:36:33] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: parse more queries from thanos access logs [puppet] - 10https://gerrit.wikimedia.org/r/1018263 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:39:10] !log installing python2.7 security updates [15:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] (03PS1) 10Fabfur: benthos: better metric naming, again [puppet] - 10https://gerrit.wikimedia.org/r/1018308 (https://phabricator.wikimedia.org/T361845) [15:40:59] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dumpsdata1005.eqiad.wmnet with OS bullseye [15:41:14] (03PS1) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) [15:42:43] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:44:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:47:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 21.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:48:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:48:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:51:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P60100 and previous config saved to /var/cache/conftool/dbconfig/20240409-155104-arnaudb.json [15:52:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 22.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:52:37] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for dmantena [puppet] - 10https://gerrit.wikimedia.org/r/1018310 (owner: 10Muehlenhoff) [15:52:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 25.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:54:13] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: host reimage [15:54:20] (03Merged) 10jenkins-bot: Netbox: add missing tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1018283 (owner: 10Ayounsi) [15:54:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [15:56:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: host reimage [15:58:00] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 22.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:00:05] jhathaway: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:19] o/ [16:00:59] o/ [16:01:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host matomo1003.eqiad.wmnet with OS bookworm [16:01:21] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#9701724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host matomo10... [16:02:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:02:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:02:22] dancy: merging in [16:02:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T360332)', diff saved to https://phabricator.wikimedia.org/P60101 and previous config saved to /var/cache/conftool/dbconfig/20240409-160225-arnaudb.json [16:02:27] thx [16:02:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:02:37] (03CR) 10JHathaway: [C:03+2] mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl [puppet] - 10https://gerrit.wikimedia.org/r/1013148 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [16:03:56] dancy: merged [16:04:18] I'll test it out in 10 minutes after puppet has run in all the places. [16:05:25] although running puppet manually on mwdebug1001.eqiad.wmnet might be sufficient for testing [16:06:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P60102 and previous config saved to /var/cache/conftool/dbconfig/20240409-160612-arnaudb.json [16:06:22] !log pool cp4052 after reimaging and new NIC firmware [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [16:08:16] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Site: eqiad 1 VM for Matomo - 14https://phabricator.wikimedia.org/T362146#9701744 (10BTullis) 05Open→03Resolved [16:09:04] (03CR) 10Ssingh: [C:03+1] benthos: better metric naming, again [puppet] - 10https://gerrit.wikimedia.org/r/1018308 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:09:41] (03CR) 10Fabfur: [C:03+2] benthos: better metric naming, again [puppet] - 10https://gerrit.wikimedia.org/r/1018308 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:13:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1005.eqiad.wmnet with OS bullseye [16:13:36] !log Deleting unused webperf TLS certificates - T360414 [16:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:45] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [16:15:26] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 14Site: eqiad, codfw 2 VM request for postfix mx-out - 14https://phabricator.wikimedia.org/T361750#9701764 (10jhathaway) 05Open→03Resolved [16:16:23] (03PS2) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) [16:16:23] (03PS1) 10Elukey: cassandra::instance: fix PKI keystore for each instance [puppet] - 10https://gerrit.wikimedia.org/r/1018311 (https://phabricator.wikimedia.org/T352647) [16:16:25] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:29] (03CR) 10JHathaway: [C:03+1] puppetdb::microservice: Use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1017769 (owner: 10Muehlenhoff) [16:16:31] (03PS1) 10Majavah: Drop grafana/graphite-labs [dns] - 10https://gerrit.wikimedia.org/r/1018312 [16:16:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T360332)', diff saved to https://phabricator.wikimedia.org/P60103 and previous config saved to /var/cache/conftool/dbconfig/20240409-161632-arnaudb.json [16:16:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:16:47] !log depool cp1113 for PXE boot issue related testing T350179 [16:16:48] (03PS1) 10Fabfur: benthos: temporary disable haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1018313 (https://phabricator.wikimedia.org/T361845) [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:56] (03PS2) 10Majavah: Drop grafana/graphite-labs [dns] - 10https://gerrit.wikimedia.org/r/1018312 [16:17:03] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [16:17:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:42] (03CR) 10Jforrester: Drop grafana/graphite-labs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1018312 (owner: 10Majavah) [16:17:42] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:19:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1113.eqiad.wmnet,service=(cdn|ats-be) [16:19:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: NIC firmware upgrade and reimage [16:20:03] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: NIC firmware upgrade and reimage [16:20:22] (03CR) 10Fabfur: [C:03+2] benthos: temporary disable haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1018313 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:20:26] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018311 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:20:44] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1113.eqiad.wmnet [16:21:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1113.eqiad.wmnet [16:21:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T360332)', diff saved to https://phabricator.wikimedia.org/P60104 and previous config saved to /var/cache/conftool/dbconfig/20240409-162119-arnaudb.json [16:21:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:21:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:21:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T360332)', diff saved to https://phabricator.wikimedia.org/P60105 and previous config saved to /var/cache/conftool/dbconfig/20240409-162142-arnaudb.json [16:22:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:22:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 18.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:24:20] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [16:24:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T360332)', diff saved to https://phabricator.wikimedia.org/P60106 and previous config saved to /var/cache/conftool/dbconfig/20240409-162423-arnaudb.json [16:24:29] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9701810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS b... [16:24:43] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:26:46] (03CR) 10Andrea Denisse: [C:03+2] ssl: delete webperf.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1017945 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [16:28:04] (03PS3) 10Majavah: Remove names for old cloudmetrics redirects [dns] - 10https://gerrit.wikimedia.org/r/1018312 [16:31:30] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9701827 (10andrea.denisse) [16:31:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P60107 and previous config saved to /var/cache/conftool/dbconfig/20240409-163140-arnaudb.json [16:31:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9701828 (10andrea.denisse) [16:37:53] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1113.eqiad.wmnet with OS bullseye [16:38:02] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9701838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bulls... [16:38:14] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [16:38:23] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9701839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS b... [16:39:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P60108 and previous config saved to /var/cache/conftool/dbconfig/20240409-163931-arnaudb.json [16:40:34] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1835/console" [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [16:46:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P60109 and previous config saved to /var/cache/conftool/dbconfig/20240409-164647-arnaudb.json [16:49:58] (03PS22) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [16:52:45] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 25.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:54:29] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:54:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P60110 and previous config saved to /var/cache/conftool/dbconfig/20240409-165438-arnaudb.json [16:56:51] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:57:45] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:58:36] (03PS23) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [16:58:36] (03PS1) 10Andrew Bogott: cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) [16:58:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [16:59:19] (03CR) 10CI reject: [V:04-1] cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T1700) [17:01:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T360332)', diff saved to https://phabricator.wikimedia.org/P60111 and previous config saved to /var/cache/conftool/dbconfig/20240409-170155-arnaudb.json [17:01:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:01:59] (03PS1) 10Phuedx: ext-EventLogging: Add mediawiki.product_metrics.wikifunctions_ui to $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 [17:02:00] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:02:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:02:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:02:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:02:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60112 and previous config saved to /var/cache/conftool/dbconfig/20240409-170234-arnaudb.json [17:02:45] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:05:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:05:23] (03PS2) 10Andrew Bogott: cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) [17:07:56] dancy: the ontology patch broke httpbb [17:08:21] hmm. Error message? [17:08:31] Ah gimme a sec [17:08:46] 404 on mwdebug1001 [17:09:05] Good on mw-on-k8s [17:09:14] puppet not run everywhere yet maybe? [17:09:20] huh.. I tested against mwdebug1001 [17:09:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T360332)', diff saved to https://phabricator.wikimedia.org/P60113 and previous config saved to /var/cache/conftool/dbconfig/20240409-170946-arnaudb.json [17:09:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:09:51] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:09:56] But the test results seem to have changed since then. [17:09:58] poking at it... [17:10:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:10:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T360332)', diff saved to https://phabricator.wikimedia.org/P60114 and previous config saved to /var/cache/conftool/dbconfig/20240409-171009-arnaudb.json [17:10:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.78% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:11:25] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:04] (03PS1) 10Andrea Denisse: wmcs: Remove redundant grafana-labs.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1018318 (https://phabricator.wikimedia.org/T360414) [17:12:30] (03PS1) 10Ahmon Dancy: Revert "mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl" [puppet] - 10https://gerrit.wikimedia.org/r/1018222 [17:12:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T360332)', diff saved to https://phabricator.wikimedia.org/P60115 and previous config saved to /var/cache/conftool/dbconfig/20240409-171251-arnaudb.json [17:12:52] dancy: once it works it also needs to be changed in deployment-charts, charts/mediawiki/templates/lamp/_site_helpers.tpl [17:13:03] (so it works for mw-on-k8s as well) [17:13:20] (03CR) 10Clément Goubert: [C:03+2] Revert "mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl" [puppet] - 10https://gerrit.wikimedia.org/r/1018222 (owner: 10Ahmon Dancy) [17:13:52] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: 14Grant Access to wmf for AndyRussG - 14https://phabricator.wikimedia.org/T361665#9701898 (10AndyRussG) 14Thanks so much, @MoritzMuehlenhoff , @RLazarus! :) :) [17:15:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS bullseye [17:15:48] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9701905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bulls... [17:16:03] dancy: merged, puppet run on mwdebug1001 done, XWD check good [17:16:07] claime: Do you see any apparent mistakes in how I configured things? [17:16:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:16:21] (03PS1) 10Andrea Denisse: wmcs: Remove redundant grafana-labs.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018320 (https://phabricator.wikimedia.org/T360414) [17:16:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1113.eqiad.wmnet,service=(cdn|ats-be) [17:17:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60116 and previous config saved to /var/cache/conftool/dbconfig/20240409-171719-arnaudb.json [17:17:23] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:18:28] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T361968#9701920 (10VRiley-WMF) Thank you! I will be closing this ticket. [17:18:37] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: 14db1246 crashed - 14https://phabricator.wikimedia.org/T361968#9701921 (10VRiley-WMF) 05Open→03Resolved [17:19:01] (03CR) 10SBassett: [C:03+1] "+1 mainly for the structure here and the first few targets to link to the standard security.txt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [17:19:07] dancy: Hmm it should be rewriting it correctly but it may be getting caught by another rule before it gets rewritten [17:19:23] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9701922 (10VRiley-WMF) a:03VRiley-WMF [17:19:26] You want the PT because if you don't PT, you don't get /w/docs/ rewrite applied [17:21:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 29.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:23:16] dancy: what is strange is that the rewriterule causes a 404 for the rewritten url [17:23:26] "Handle/Status": "-/404", [17:23:28] "ResponseSize": "1269", [17:23:30] "Method": "GET", [17:23:32] "Url": "http://www.mediawiki.org/w/docs/ontology.owl", [17:25:19] Is it possible the generate the Apache configs from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013148/4/hieradata/common/mediawiki.yaml ? [17:25:45] !log Delete unused grafana-labs TLS certificates - T360414 [17:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:48] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [17:26:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:26:23] (03CR) 10Jforrester: [C:03+1] "Oops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018317 (owner: 10Phuedx) [17:27:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P60117 and previous config saved to /var/cache/conftool/dbconfig/20240409-172758-arnaudb.json [17:28:02] dancy: They are readable on mwdebug [17:30:05] (03PS3) 10Mmartorana: Implementing security.txt standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) [17:30:24] claime: Proposal, I revive https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013148, you merge and run the puppet agent on mwdebug1001, I copy /etc/apache2/sites-available/ for examination, you revert the change again and run-puppet-agent. [17:31:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:32:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P60118 and previous config saved to /var/cache/conftool/dbconfig/20240409-173226-arnaudb.json [17:32:36] dancy: no need, you can check the puppet run here https://puppetboard.wikimedia.org/report/mwdebug1001.eqiad.wmnet/d9f74302e2d7511e75840f37d29c4b8a7d0f334d [17:32:49] Well that's the revert puppet run [17:33:05] `Service access denied due to missing privileges.` when I try to log into that page [17:33:10] Ooh [17:33:18] Sorry I thought you'd have access [17:35:26] I'll check something gimme a sec [17:37:03] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9701952 (10VRiley-WMF) [17:41:14] (03CR) 10Dzahn: [V:03+1 C:03+1] wmcs: Remove redundant grafana-labs.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1018318 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:42:16] (03CR) 10Dzahn: [V:03+1 C:03+1] "We checked. All the names on this are either not in DNS anymore or, as the 2 remaining ones, are redirects at the appserver layer. So noth" [puppet] - 10https://gerrit.wikimedia.org/r/1018320 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:43:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P60119 and previous config saved to /var/cache/conftool/dbconfig/20240409-174306-arnaudb.json [17:44:51] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9701963 (10VRiley-WMF) We currently have all these parts needed for this upgrade. @fgiunchedi do you have an estimated time for these upgrades to... [17:46:31] (03PS1) 10Dzahn: test symlink [puppet] - 10https://gerrit.wikimedia.org/r/1018324 [17:47:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P60120 and previous config saved to /var/cache/conftool/dbconfig/20240409-174734-arnaudb.json [17:47:43] (03Abandoned) 10Dzahn: test symlink [puppet] - 10https://gerrit.wikimedia.org/r/1018324 (owner: 10Dzahn) [17:49:05] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Switch hierdata for gitlab-replica and gitlab-replica-old [puppet] - 10https://gerrit.wikimedia.org/r/1018259 (owner: 10EoghanGaffney) [17:58:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T360332)', diff saved to https://phabricator.wikimedia.org/P60121 and previous config saved to /var/cache/conftool/dbconfig/20240409-175813-arnaudb.json [17:58:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [17:58:21] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:58:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [17:58:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60122 and previous config saved to /var/cache/conftool/dbconfig/20240409-175837-arnaudb.json [18:01:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60123 and previous config saved to /var/cache/conftool/dbconfig/20240409-180117-arnaudb.json [18:02:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T360332)', diff saved to https://phabricator.wikimedia.org/P60124 and previous config saved to /var/cache/conftool/dbconfig/20240409-180242-arnaudb.json [18:02:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:02:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:03:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T360332)', diff saved to https://phabricator.wikimedia.org/P60125 and previous config saved to /var/cache/conftool/dbconfig/20240409-180306-arnaudb.json [18:03:36] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:05:35] (03PS5) 10Dwisehaupt: Add discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [18:06:25] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:20] (03PS6) 10Dwisehaupt: community-crm: Add dyna and discovery records [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [18:11:25] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T356166)', diff saved to https://phabricator.wikimedia.org/P60126 and previous config saved to /var/cache/conftool/dbconfig/20240409-181213-marostegui.json [18:12:17] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:16:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P60127 and previous config saved to /var/cache/conftool/dbconfig/20240409-181625-arnaudb.json [18:17:43] (03PS24) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [18:17:43] (03PS3) 10Andrew Bogott: cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) [18:19:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T360332)', diff saved to https://phabricator.wikimedia.org/P60128 and previous config saved to /var/cache/conftool/dbconfig/20240409-181914-arnaudb.json [18:19:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [18:19:25] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:21:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:23:54] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9702025 (10ssingh) Some updates with the TL;DR that it is still failing for hosts in eqiad and ulsfo: I was talking with @ayounsi and h... [18:26:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:27:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P60129 and previous config saved to /var/cache/conftool/dbconfig/20240409-182721-marostegui.json [18:28:36] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9702027 (10ssingh) One more thing I will try to do is to successively try all NIC firmwares in `22.x` instead of picking the highest sup... [18:31:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P60130 and previous config saved to /var/cache/conftool/dbconfig/20240409-183132-arnaudb.json [18:33:11] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9702043 (10andrea.denisse) [18:34:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P60131 and previous config saved to /var/cache/conftool/dbconfig/20240409-183421-arnaudb.json [18:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:39:13] (03PS25) 10Andrew Bogott: Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 [18:39:13] (03PS4) 10Andrew Bogott: cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) [18:39:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [18:41:04] (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove redundant grafana-labs.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1018320 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:41:15] !log deploying airflow dag to fix mediawiki_history_metrics_monthly dag [18:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:39] (03CR) 10Andrea Denisse: [C:03+2] wmcs: Remove redundant grafana-labs.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1018318 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:41:41] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] wmcs: Remove redundant grafana-labs.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1018318 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:42:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P60132 and previous config saved to /var/cache/conftool/dbconfig/20240409-184228-marostegui.json [18:42:35] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9702056 (10herron) Hey @VRiley-WMF, I'll help out with this one for the o11y side. Schedule-wise would this Thursday (4/11) sometime after 11a Ea... [18:42:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [18:43:20] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@875e0d2]: (no justification provided) [18:43:46] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@875e0d2]: (no justification provided) (duration: 00m 26s) [18:44:18] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702057 (10andrea.denisse) [18:44:35] (03CR) 10Andrew Bogott: [C:03+2] Remove profile::puppetserver::enable_ca from hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1012765 (owner: 10Andrew Bogott) [18:45:28] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9702058 (10andrea.denisse) [18:45:51] (03CR) 10Andrew Bogott: [C:03+2] cloud: switch (almost) everything to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018316 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [18:46:12] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Switch gitlab-replica and gitlab-replica-old [dns] - 10https://gerrit.wikimedia.org/r/1018258 (owner: 10EoghanGaffney) [18:46:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T360332)', diff saved to https://phabricator.wikimedia.org/P60133 and previous config saved to /var/cache/conftool/dbconfig/20240409-184640-arnaudb.json [18:46:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [18:46:44] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:46:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [18:47:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60134 and previous config saved to /var/cache/conftool/dbconfig/20240409-184702-arnaudb.json [18:48:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:49:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P60135 and previous config saved to /var/cache/conftool/dbconfig/20240409-184929-arnaudb.json [18:49:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60136 and previous config saved to /var/cache/conftool/dbconfig/20240409-184944-arnaudb.json [18:51:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:54:25] !log eoghan@cumin1002 START - Cookbook sre.dns.wipe-cache 'https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/' on all recursors [18:54:28] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/' on all recursors [18:54:33] (03CR) 10Andrew Bogott: [C:03+2] wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [18:56:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:56:54] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1003.wikimedia.org to gitlab1004.wikimedia.org [18:57:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T356166)', diff saved to https://phabricator.wikimedia.org/P60137 and previous config saved to /var/cache/conftool/dbconfig/20240409-185736-marostegui.json [18:57:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [18:57:50] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:57:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [18:57:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:57:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1018312 (owner: 10Majavah) [18:58:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:58:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9702093 (10SToyofuku-WMF) @MoritzMuehlenhoff apologies, I thought `analytics-privatedata-users` was the specific level, but I can see I was looking at the wrong table.... [18:58:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T356166)', diff saved to https://phabricator.wikimedia.org/P60138 and previous config saved to /var/cache/conftool/dbconfig/20240409-185817-marostegui.json [18:58:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:58] (03PS1) 10Muehlenhoff: Remove now obsolete redirects [puppet] - 10https://gerrit.wikimedia.org/r/1018330 [19:00:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9702101 (10MoritzMuehlenhoff) >>! In T362113#9702093, @SToyofuku-WMF wrote: > @MoritzMuehlenhoff apologies, I thought `analytics-privatedata-users` was the specific leve... [19:03:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9702105 (10SToyofuku-WMF) Thank you! [19:04:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:04:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T360332)', diff saved to https://phabricator.wikimedia.org/P60139 and previous config saved to /var/cache/conftool/dbconfig/20240409-190436-arnaudb.json [19:04:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [19:04:51] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:04:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [19:04:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P60140 and previous config saved to /var/cache/conftool/dbconfig/20240409-190452-arnaudb.json [19:05:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T360332)', diff saved to https://phabricator.wikimedia.org/P60141 and previous config saved to /var/cache/conftool/dbconfig/20240409-190459-arnaudb.json [19:09:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:11:19] (03PS2) 10CDanis: jaeger ui: two week lookback [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017314 [19:20:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P60142 and previous config saved to /var/cache/conftool/dbconfig/20240409-192000-arnaudb.json [19:20:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T360332)', diff saved to https://phabricator.wikimedia.org/P60143 and previous config saved to /var/cache/conftool/dbconfig/20240409-192010-arnaudb.json [19:20:16] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:31:53] (03PS1) 10Ahmon Dancy: mediawiki: Route /w/docs/ to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018332 (https://phabricator.wikimedia.org/T171807) [19:34:09] (03PS1) 10JJMC89: extension.json: add pagetriage-copyvio right to the highvolume grant [extensions/PageTriage] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1018223 (https://phabricator.wikimedia.org/T362188) [19:34:23] (03PS1) 10JJMC89: extension.json: add pagetriage-copyvio right to the highvolume grant [extensions/PageTriage] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018224 (https://phabricator.wikimedia.org/T362188) [19:35:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T360332)', diff saved to https://phabricator.wikimedia.org/P60145 and previous config saved to /var/cache/conftool/dbconfig/20240409-193507-arnaudb.json [19:35:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [19:35:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [19:35:14] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:35:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P60146 and previous config saved to /var/cache/conftool/dbconfig/20240409-193517-arnaudb.json [19:35:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [19:36:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [19:36:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60147 and previous config saved to /var/cache/conftool/dbconfig/20240409-193611-arnaudb.json [19:38:45] (03CR) 10Dzahn: "checked if those are symlinks and looks like they are, so that part seems right to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [19:38:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60148 and previous config saved to /var/cache/conftool/dbconfig/20240409-193848-arnaudb.json [19:42:32] (03PS1) 10Ahmon Dancy: mediawiki.yaml: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) [19:45:37] (03PS2) 10Ahmon Dancy: Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) [19:48:13] (03PS2) 10Ahmon Dancy: mediawiki: Route /w/docs/ to /w/static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018332 (https://phabricator.wikimedia.org/T171807) [19:48:20] (03CR) 10CI reject: [V:04-1] Serve mw.org/ontology/ontology.owl via /w/docs/ontology.owl (v2) [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [19:50:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P60149 and previous config saved to /var/cache/conftool/dbconfig/20240409-195024-arnaudb.json [19:53:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P60150 and previous config saved to /var/cache/conftool/dbconfig/20240409-195355-arnaudb.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240409T2000) [20:00:05] musikanimal: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] (03PS1) 10CDanis: jaeger: add idp2003 as egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018336 [20:00:22] I'm here :) [20:01:26] (03CR) 10CDanis: [C:03+2] jaeger: add idp2003 as egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018336 (owner: 10CDanis) [20:02:17] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:02:53] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:05:03] anyone around to do the deploys? [20:05:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T360332)', diff saved to https://phabricator.wikimedia.org/P60151 and previous config saved to /var/cache/conftool/dbconfig/20240409-200533-arnaudb.json [20:05:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [20:05:37] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:05:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [20:05:51] hi - i can deploy [20:05:54] 1 sec [20:05:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T360332)', diff saved to https://phabricator.wikimedia.org/P60152 and previous config saved to /var/cache/conftool/dbconfig/20240409-200556-arnaudb.json [20:07:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1018223 (https://phabricator.wikimedia.org/T362188) (owner: 10JJMC89) [20:09:02] (03CR) 10Clare Ming: [C:03+2] extension.json: add pagetriage-copyvio right to the highvolume grant [extensions/PageTriage] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018224 (https://phabricator.wikimedia.org/T362188) (owner: 10JJMC89) [20:09:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P60153 and previous config saved to /var/cache/conftool/dbconfig/20240409-200903-arnaudb.json [20:16:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T360332)', diff saved to https://phabricator.wikimedia.org/P60154 and previous config saved to /var/cache/conftool/dbconfig/20240409-202110-arnaudb.json [20:21:14] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:24:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60155 and previous config saved to /var/cache/conftool/dbconfig/20240409-202410-arnaudb.json [20:26:26] (03Abandoned) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018282 (owner: 10CDobbins) [20:27:27] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018340 [20:27:35] (03Merged) 10jenkins-bot: extension.json: add pagetriage-copyvio right to the highvolume grant [extensions/PageTriage] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1018223 (https://phabricator.wikimedia.org/T362188) (owner: 10JJMC89) [20:28:03] (03Abandoned) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018340 (owner: 10CDobbins) [20:28:06] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1018223|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] [20:28:09] T362188: pagetriage-copyvio should be included in a grant - https://phabricator.wikimedia.org/T362188 [20:29:20] (03Merged) 10jenkins-bot: extension.json: add pagetriage-copyvio right to the highvolume grant [extensions/PageTriage] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018224 (https://phabricator.wikimedia.org/T362188) (owner: 10JJMC89) [20:30:44] !log cjming@deploy1002 jjmc89 and cjming: Backport for [[gerrit:1018223|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:30:46] musikanimal: can your patch be tested? wmf25 on test servers [20:30:58] doing [20:31:44] (03CR) 10Ahmon Dancy: [C:04-1] "holding" [puppet] - 10https://gerrit.wikimedia.org/r/1018335 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [20:32:13] cjming: looks good! [20:32:17] cool syncing [20:32:20] !log cjming@deploy1002 jjmc89 and cjming: Continuing with sync [20:33:05] and sorry - i should have scap'd them together -- i'll do the wmf26 one here as soon as this one finishes [20:33:21] no worries [20:33:31] (03PS8) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [20:34:00] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [20:36:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P60156 and previous config saved to /var/cache/conftool/dbconfig/20240409-203617-arnaudb.json [20:36:52] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702405 (10andrea.denisse) [20:37:47] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018343 [20:39:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:40:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:56] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702412 (10andrea.denisse) [20:44:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:44:22] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1018223|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] (duration: 16m 16s) [20:44:26] T362188: pagetriage-copyvio should be included in a grant - https://phabricator.wikimedia.org/T362188 [20:44:49] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:45:05] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1018224|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] [20:45:31] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:42] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:47:15] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702416 (10andrea.denisse) [20:47:34] !log cjming@deploy1002 jjmc89 and cjming: Backport for [[gerrit:1018224|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:40] musikanimal: should I sync wmf26? [20:48:19] yes I think so, because otherwise the commit will be lost when wmf26 hits right? [20:48:31] !log cjming@deploy1002 jjmc89 and cjming: Continuing with sync [20:48:34] we only need it on group2 at the moment [20:51:14] or I guess it doesn't need to be synced to wmf26, because the commit is in the branch, only it isn't live where wmf26 is live already (and in this case that's okay) [20:51:18] sorry I didn't think that through [20:51:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P60157 and previous config saved to /var/cache/conftool/dbconfig/20240409-205125-arnaudb.json [20:51:34] but syncing won't hurt anything either :) [20:52:15] i went ahead and synced -- since group 0 is at 26, might as well [20:53:06] oh got it - oh well - no worries [20:53:39] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702450 (10andrea.denisse) [21:00:33] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1018224|extension.json: add pagetriage-copyvio right to the highvolume grant (T362188)]] (duration: 15m 27s) [21:00:54] musikanimal: both of your patches should be live! [21:00:55] T362188: pagetriage-copyvio should be included in a grant - https://phabricator.wikimedia.org/T362188 [21:01:11] cjming: thank you!! [21:01:25] yw! [21:01:29] !log end of UTC late backport window [21:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T360332)', diff saved to https://phabricator.wikimedia.org/P60158 and previous config saved to /var/cache/conftool/dbconfig/20240409-210633-arnaudb.json [21:06:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [21:06:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [21:06:50] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:06:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T360332)', diff saved to https://phabricator.wikimedia.org/P60159 and previous config saved to /var/cache/conftool/dbconfig/20240409-210656-arnaudb.json [21:10:04] 06SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998#9702468 (10colewhite) [21:11:25] 06SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998#9702470 (10colewhite) Current SSO work being investigated in T337818 [21:16:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T360332)', diff saved to https://phabricator.wikimedia.org/P60160 and previous config saved to /var/cache/conftool/dbconfig/20240409-212210-arnaudb.json [21:22:15] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:22:25] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702496 (10colewhite) [21:37:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P60161 and previous config saved to /var/cache/conftool/dbconfig/20240409-213717-arnaudb.json [21:52:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P60162 and previous config saved to /var/cache/conftool/dbconfig/20240409-215225-arnaudb.json [22:07:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T360332)', diff saved to https://phabricator.wikimedia.org/P60163 and previous config saved to /var/cache/conftool/dbconfig/20240409-220732-arnaudb.json [22:07:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [22:07:37] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:07:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [22:07:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T360332)', diff saved to https://phabricator.wikimedia.org/P60164 and previous config saved to /var/cache/conftool/dbconfig/20240409-220755-arnaudb.json [22:16:27] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702570 (10andrea.denisse) [22:17:06] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9702571 (10andrea.denisse) I've documented the migration process on Wikitech: https://wikitech.wikimedia.org/wiki/Cergen#Migrating_to_CFS... [22:23:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T360332)', diff saved to https://phabricator.wikimedia.org/P60165 and previous config saved to /var/cache/conftool/dbconfig/20240409-222306-arnaudb.json [22:23:10] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:38:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P60166 and previous config saved to /var/cache/conftool/dbconfig/20240409-223813-arnaudb.json [22:53:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P60167 and previous config saved to /var/cache/conftool/dbconfig/20240409-225321-arnaudb.json [23:08:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T360332)', diff saved to https://phabricator.wikimedia.org/P60168 and previous config saved to /var/cache/conftool/dbconfig/20240409-230828-arnaudb.json [23:08:32] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332