[00:01:41] (03PS1) 10Tim Starling: WMCS: Fix type of ipb_range_start and ipb_range_end in the b/c view [puppet] - 10https://gerrit.wikimedia.org/r/1016892 (https://phabricator.wikimedia.org/T355034) [00:02:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016376 (owner: 10TrainBranchBot) [00:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:39] (03CR) 10Urbanecm: [C:04-1] stewards: let puppet create /srv/exports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [00:24:20] (03CR) 10Samwilson: [C:03+1] WMCS: Fix type of ipb_range_start and ipb_range_end in the b/c view [puppet] - 10https://gerrit.wikimedia.org/r/1016892 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [00:25:34] (03CR) 10Tim Starling: [C:03+2] WMCS: Fix type of ipb_range_start and ipb_range_end in the b/c view [puppet] - 10https://gerrit.wikimedia.org/r/1016892 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [00:30:37] (03PS3) 10Scott French: Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (https://phabricator.wikimedia.org/T358636) [00:34:15] (03PS1) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [00:35:15] (03PS2) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [00:35:41] (ProbeDown) firing: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:20] (03CR) 10CI reject: [V:04-1] (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [00:40:41] (ProbeDown) resolved: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:41:47] a bit late since the resolve came in but I this is a repeat of https://phabricator.wikimedia.org/T358936 [00:42:17] https://puppetboard.wikimedia.org/report/kubemaster2002.codfw.wmnet/4482322a5e7749a4be0cf944ad70568b070491b1 matches [00:42:23] and so does the service restart time [00:42:26] I will update the task [00:44:39] ah, that makes sense - I saw what looked like leader elections correlated with the bursts of probe failure. thanks for spotting that! [00:49:12] thanks, updated task! [00:49:46] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9686689 (10ssingh) This happened today as well, at 00:35 UTC, when we were paged for this: ` 00:35:41 <+jinxer-wm> (ProbeDown) firing: Service kubemaster2001:6443 has failed pr... [00:56:02] !log on clouddb1021 ran maintain-views for all databases [00:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:41] !log on clouddb1020 running maintain-views --all-databases --replace-all --auto-depool (T355034) [01:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:44] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [01:12:17] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9686707 (10RLazarus) [01:12:37] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9686710 (10RLazarus) p:05Triage→03Medium a:03RLazarus [01:16:13] (03PS3) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [01:16:43] (03CR) 10CI reject: [V:04-1] (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [01:20:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:24:08] (03PS4) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [01:24:36] (03CR) 10CI reject: [V:04-1] (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [01:26:52] (03PS5) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [01:31:03] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [01:45:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:17:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 841.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:22:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.047s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:38:42] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:45:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:48:57] !log ran maintain-views on clouddb1013-1019 (T355034) [02:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:02] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [02:58:42] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:29:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.341s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:39:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 874.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:41:57] (03PS1) 10Tim Starling: WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 [03:44:53] (03CR) 10CI reject: [V:04-1] WMCS: Add --quiet option to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1016912 (owner: 10Tim Starling) [04:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 812.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:16:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 812.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:42:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 857.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:47:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 871.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:01:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 980.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:06:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 810ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:16:40] (03PS1) 10Marostegui: db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016916 [05:17:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:17:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T356166)', diff saved to https://phabricator.wikimedia.org/P59406 and previous config saved to /var/cache/conftool/dbconfig/20240404-051728-marostegui.json [05:17:33] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:17:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2126 T361543', diff saved to https://phabricator.wikimedia.org/P59407 and previous config saved to /var/cache/conftool/dbconfig/20240404-051758-root.json [05:18:01] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [05:18:32] (03CR) 10Marostegui: [C:03+2] db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016916 (owner: 10Marostegui) [05:19:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2126.codfw.wmnet with OS bookworm [05:19:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T356166)', diff saved to https://phabricator.wikimedia.org/P59408 and previous config saved to /var/cache/conftool/dbconfig/20240404-051938-marostegui.json [05:23:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:23:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:23:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:23:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:23:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T355609)', diff saved to https://phabricator.wikimedia.org/P59409 and previous config saved to /var/cache/conftool/dbconfig/20240404-052338-marostegui.json [05:23:41] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [05:28:34] (03PS1) 10Marostegui: Revert "db2126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016867 [05:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:34:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P59410 and previous config saved to /var/cache/conftool/dbconfig/20240404-053446-marostegui.json [05:36:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2126.codfw.wmnet with reason: host reimage [05:39:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2126.codfw.wmnet with reason: host reimage [05:49:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P59411 and previous config saved to /var/cache/conftool/dbconfig/20240404-054953-marostegui.json [05:58:27] (03CR) 10Marostegui: [C:03+2] Revert "db2126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016867 (owner: 10Marostegui) [05:58:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59412 and previous config saved to /var/cache/conftool/dbconfig/20240404-055854-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T0600). [06:00:50] (03PS1) 10Marostegui: installserver: Do not format es2038 [puppet] - 10https://gerrit.wikimedia.org/r/1016918 [06:01:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2126.codfw.wmnet with OS bookworm [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T356166)', diff saved to https://phabricator.wikimedia.org/P59413 and previous config saved to /var/cache/conftool/dbconfig/20240404-060501-marostegui.json [06:05:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:05:05] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:05:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:05:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T356166)', diff saved to https://phabricator.wikimedia.org/P59414 and previous config saved to /var/cache/conftool/dbconfig/20240404-060524-marostegui.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T355609)', diff saved to https://phabricator.wikimedia.org/P59415 and previous config saved to /var/cache/conftool/dbconfig/20240404-061234-marostegui.json [06:12:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59416 and previous config saved to /var/cache/conftool/dbconfig/20240404-061400-root.json [06:20:01] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es2038 [puppet] - 10https://gerrit.wikimedia.org/r/1016918 (owner: 10Marostegui) [06:27:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59417 and previous config saved to /var/cache/conftool/dbconfig/20240404-062743-marostegui.json [06:29:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59418 and previous config saved to /var/cache/conftool/dbconfig/20240404-062905-root.json [06:42:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59419 and previous config saved to /var/cache/conftool/dbconfig/20240404-064250-marostegui.json [06:43:40] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1002 from puppetdb ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1016716 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff) [06:44:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59420 and previous config saved to /var/cache/conftool/dbconfig/20240404-064411-root.json [06:46:09] (03PS4) 10Cyndywikime: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 [06:47:32] (03PS3) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [06:50:49] (03CR) 10Ryan Kemper: "@Volans how's this implementation look? we're just using the put settings api directly instead of using curator." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [06:53:53] (03CR) 10CI reject: [V:04-1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [06:57:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T355609)', diff saved to https://phabricator.wikimedia.org/P59421 and previous config saved to /var/cache/conftool/dbconfig/20240404-065758-marostegui.json [06:58:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:58:02] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:58:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:59:02] !log installing util-linux security updates [06:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59422 and previous config saved to /var/cache/conftool/dbconfig/20240404-065917-root.json [07:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:14:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59423 and previous config saved to /var/cache/conftool/dbconfig/20240404-071423-root.json [07:19:05] (03PS1) 10Brouberol: wikimedia.org: provision public mpic subdomain [dns] - 10https://gerrit.wikimedia.org/r/1016926 (https://phabricator.wikimedia.org/T361338) [07:19:06] (03PS1) 10Brouberol: mpic: provision private service records [dns] - 10https://gerrit.wikimedia.org/r/1016927 (https://phabricator.wikimedia.org/T361339) [07:25:47] (03PS1) 10Brouberol: deployment_server: provisiom mpic(-next) view/deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1016929 (https://phabricator.wikimedia.org/T361336) [07:28:45] (03PS1) 10Marostegui: db2104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016930 [07:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59424 and previous config saved to /var/cache/conftool/dbconfig/20240404-072928-root.json [07:31:47] (03CR) 10Marostegui: [C:03+2] db2104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016930 (owner: 10Marostegui) [07:34:05] (03PS1) 10Marostegui: instances.yaml: Remove db2104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1016931 (https://phabricator.wikimedia.org/T361779) [07:34:49] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1016931 (https://phabricator.wikimedia.org/T361779) (owner: 10Marostegui) [07:35:18] (03PS1) 10Brouberol: mpic: provision staging and production namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016932 (https://phabricator.wikimedia.org/T361337) [07:36:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2104 from dbctl T361779', diff saved to https://phabricator.wikimedia.org/P59425 and previous config saved to /var/cache/conftool/dbconfig/20240404-073600-root.json [07:36:04] T361779: decommission db2104.codfw.wmnet - https://phabricator.wikimedia.org/T361779 [07:37:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016377 (https://phabricator.wikimedia.org/T361780) [07:39:02] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [07:40:53] (03PS1) 10Brouberol: trafficserver: add redirection config for mpic(-test) [puppet] - 10https://gerrit.wikimedia.org/r/1017005 (https://phabricator.wikimedia.org/T361340) [07:42:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:43:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:43:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T355609)', diff saved to https://phabricator.wikimedia.org/P59426 and previous config saved to /var/cache/conftool/dbconfig/20240404-074313-marostegui.json [07:43:16] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:45:13] (03PS2) 10Brouberol: trafficserver: add redirection config for mpic(-test) [puppet] - 10https://gerrit.wikimedia.org/r/1017005 (https://phabricator.wikimedia.org/T361340) [07:45:13] (03PS1) 10Brouberol: idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) [07:48:01] (03CR) 10Muehlenhoff: [C:03+1] Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [07:51:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [07:56:13] (03CR) 10Gehel: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1016926 (https://phabricator.wikimedia.org/T361338) (owner: 10Brouberol) [07:56:37] (03CR) 10Gehel: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1016927 (https://phabricator.wikimedia.org/T361339) (owner: 10Brouberol) [07:57:48] (03PS2) 10Brouberol: deployment_server: provisiom mpic(-next) view/deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1016929 (https://phabricator.wikimedia.org/T361336) [07:57:48] (03PS3) 10Brouberol: trafficserver: add redirection config for mpic(-test) [puppet] - 10https://gerrit.wikimedia.org/r/1017005 (https://phabricator.wikimedia.org/T361340) [07:57:48] (03PS2) 10Brouberol: idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) [07:58:01] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1016926 (https://phabricator.wikimedia.org/T361338) (owner: 10Brouberol) [08:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T0800). [08:00:20] hi, rolling out train in a few minutes [08:01:21] (03CR) 10Santiago Faci: [C:03+1] deployment_server: provisiom mpic(-next) view/deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1016929 (https://phabricator.wikimedia.org/T361336) (owner: 10Brouberol) [08:01:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 29 hosts with reason: Primary switchover s2 T361682 [08:01:56] T361682: Switchover s2 master (db2107 -> db2204) - https://phabricator.wikimedia.org/T361682 [08:02:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 29 hosts with reason: Primary switchover s2 T361682 [08:03:07] (03CR) 10Santiago Faci: [C:03+1] mpic: provision staging and production namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016932 (https://phabricator.wikimedia.org/T361337) (owner: 10Brouberol) [08:04:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T361682', diff saved to https://phabricator.wikimedia.org/P59427 and previous config saved to /var/cache/conftool/dbconfig/20240404-080408-arnaudb.json [08:04:31] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017015 (https://phabricator.wikimedia.org/T360157) [08:04:32] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017015 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:05:15] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017015 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:07:03] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9687143 (10MoritzMuehlenhoff) LGTM (we probably don't need as much CPU capacity, but also fine to overcommit a little, we can easily adjust later) [08:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T356166)', diff saved to https://phabricator.wikimedia.org/P59429 and previous config saved to /var/cache/conftool/dbconfig/20240404-081312-marostegui.json [08:13:38] (03CR) 10Volans: [C:04-1] "The approach looks ok, including the dry-run handling." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [08:17:50] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French) [08:18:02] (03PS1) 10Fabfur: benthos: add two new hosts (upload and text) [puppet] - 10https://gerrit.wikimedia.org/r/1017018 (https://phabricator.wikimedia.org/T358109) [08:20:07] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016372 (https://phabricator.wikimedia.org/T361682) (owner: 10Gerrit maintenance bot) [08:20:17] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.25 refs T360157 [08:21:28] !log Starting s2 codfw failover from db2107 to db2204 - T361682 [08:21:41] (03CR) 10Fabfur: [C:03+2] benthos: add two new hosts (upload and text) [puppet] - 10https://gerrit.wikimedia.org/r/1017018 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:22:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T361682', diff saved to https://phabricator.wikimedia.org/P59430 and previous config saved to /var/cache/conftool/dbconfig/20240404-082200-arnaudb.json [08:22:04] T361682: Switchover s2 master (db2107 -> db2204) - https://phabricator.wikimedia.org/T361682 [08:22:29] (03CR) 10Brouberol: [C:03+2] mpic: provision staging and production namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016932 (https://phabricator.wikimedia.org/T361337) (owner: 10Brouberol) [08:23:49] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:24:12] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:24:54] (03CR) 10Brouberol: [C:03+2] wikimedia.org: provision public mpic subdomain [dns] - 10https://gerrit.wikimedia.org/r/1016926 (https://phabricator.wikimedia.org/T361338) (owner: 10Brouberol) [08:25:03] (03CR) 10Brouberol: [C:03+2] mpic: provision private service records [dns] - 10https://gerrit.wikimedia.org/r/1016927 (https://phabricator.wikimedia.org/T361339) (owner: 10Brouberol) [08:25:47] 07Puppet, 10ORES, 07git-lfs: 14Require git-lfs in ORES hosts - 14https://phabricator.wikimedia.org/T232494#9687224 (10hashar) [08:25:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'bump db2107 weight', diff saved to https://phabricator.wikimedia.org/P59431 and previous config saved to /var/cache/conftool/dbconfig/20240404-082547-root.json [08:26:36] 06SRE, 10Gerrit, 07git-lfs: 14Initial backup run for Gerrit LFS data - 14https://phabricator.wikimedia.org/T254162#9687236 (10hashar) [08:26:42] 06SRE, 06Research, 07git-lfs: 14Add Git LFS support for research/wikiworkshop - 14https://phabricator.wikimedia.org/T252956#9687237 (10hashar) [08:26:53] 06SRE, 10Gerrit, 06Release-Engineering-Team, 07git-lfs: 14Automatic pickup of Gerrit clone master doesn't happen due to missing git-lfs – new deployment env - 14https://phabricator.wikimedia.org/T235677#9687239 (10hashar) [08:27:23] 10SRE-swift-storage, 10Phabricator, 07git-lfs, 10Release-Engineering-Team (Seen): 14Connect Phabricator to swift for storage of git-lfs and file uploads. - 14https://phabricator.wikimedia.org/T182085#9687244 (10hashar) [08:27:31] 06SRE, 06Machine-Learning-Team, 10ORES, 10Scap, 07git-lfs: 14scap support for git-lfs - 14https://phabricator.wikimedia.org/T181855#9687245 (10hashar) [08:27:47] 06SRE, 10Gerrit, 10ORES, 07git-lfs, 13Patch-For-Review: 14Plan migration of ORES repos to git-lfs - 14https://phabricator.wikimedia.org/T181678#9687247 (10hashar) [08:28:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P59432 and previous config saved to /var/cache/conftool/dbconfig/20240404-082819-marostegui.json [08:28:49] 06SRE, 06Infrastructure-Foundations, 10Packaging, 10Scap, and 2 others: 14Install git-lfs client (at least on scap targets & masters) - 14https://phabricator.wikimedia.org/T180628#9687246 (10hashar) [08:30:33] (03PS1) 10Brouberol: wmnet: fix typo in mpic staging record [dns] - 10https://gerrit.wikimedia.org/r/1017020 (https://phabricator.wikimedia.org/T361339) [08:30:49] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1016378 (https://phabricator.wikimedia.org/T361786) [08:31:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T355609)', diff saved to https://phabricator.wikimedia.org/P59433 and previous config saved to /var/cache/conftool/dbconfig/20240404-083147-marostegui.json [08:31:50] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:33:20] (03CR) 10Volans: [C:03+1] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:36:35] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [08:36:40] (03PS3) 10Filippo Giunchedi: hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) [08:39:18] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [08:43:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P59434 and previous config saved to /var/cache/conftool/dbconfig/20240404-084327-marostegui.json [08:46:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59435 and previous config saved to /var/cache/conftool/dbconfig/20240404-084655-marostegui.json [08:53:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1016379 (https://phabricator.wikimedia.org/T361789) [08:55:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s5 T361789 [08:55:19] T361789: Switchover s5 master (db2113 -> db2123) - https://phabricator.wikimedia.org/T361789 [08:55:38] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::backup_source [08:55:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s5 T361789 [08:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2123 with weight 0 T361789', diff saved to https://phabricator.wikimedia.org/P59436 and previous config saved to /var/cache/conftool/dbconfig/20240404-085606-arnaudb.json [08:58:27] (03PS1) 10Muehlenhoff: Switch mariadb::backup_source to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017026 (https://phabricator.wikimedia.org/T349619) [08:58:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T356166)', diff saved to https://phabricator.wikimedia.org/P59437 and previous config saved to /var/cache/conftool/dbconfig/20240404-085834-marostegui.json [08:58:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:58:38] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:58:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:58:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T356166)', diff saved to https://phabricator.wikimedia.org/P59438 and previous config saved to /var/cache/conftool/dbconfig/20240404-085856-marostegui.json [09:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T356166)', diff saved to https://phabricator.wikimedia.org/P59439 and previous config saved to /var/cache/conftool/dbconfig/20240404-090007-marostegui.json [09:00:38] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::backup_source to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017026 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:02:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59440 and previous config saved to /var/cache/conftool/dbconfig/20240404-090202-marostegui.json [09:03:27] 10SRE-swift-storage, 10Phabricator, 07git-lfs, 10Release-Engineering-Team (Seen): 14Connect Phabricator to swift for storage of git-lfs and file uploads. - 14https://phabricator.wikimedia.org/T182085#9687425 (10MatthewVernon) 14there was maybe a suggestion of using it for files uploaded to phab? [09:11:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::backup_source [09:12:01] (03CR) 10Santiago Faci: [C:03+1] idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [09:12:26] (03CR) 10Santiago Faci: [C:03+1] wmnet: fix typo in mpic staging record [dns] - 10https://gerrit.wikimedia.org/r/1017020 (https://phabricator.wikimedia.org/T361339) (owner: 10Brouberol) [09:12:59] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1016379 (https://phabricator.wikimedia.org/T361789) (owner: 10Gerrit maintenance bot) [09:14:16] !log Starting s5 codfw failover from db2113 to db2123 - T361789 [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:20] T361789: Switchover s5 master (db2113 -> db2123) - https://phabricator.wikimedia.org/T361789 [09:14:51] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9687499 (10MoritzMuehlenhoff) [09:15:00] (03CR) 10Santiago Faci: [C:03+1] trafficserver: add redirection config for mpic(-test) [puppet] - 10https://gerrit.wikimedia.org/r/1017005 (https://phabricator.wikimedia.org/T361340) (owner: 10Brouberol) [09:15:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T361789', diff saved to https://phabricator.wikimedia.org/P59441 and previous config saved to /var/cache/conftool/dbconfig/20240404-091512-arnaudb.json [09:15:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P59442 and previous config saved to /var/cache/conftool/dbconfig/20240404-091521-marostegui.json [09:17:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T355609)', diff saved to https://phabricator.wikimedia.org/P59443 and previous config saved to /var/cache/conftool/dbconfig/20240404-091709-marostegui.json [09:17:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:17:14] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:17:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T355609)', diff saved to https://phabricator.wikimedia.org/P59444 and previous config saved to /var/cache/conftool/dbconfig/20240404-091732-marostegui.json [09:18:58] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T361798 (10Ospingou) 03NEW [09:18:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'bump db2113 weight', diff saved to https://phabricator.wikimedia.org/P59445 and previous config saved to /var/cache/conftool/dbconfig/20240404-091858-arnaudb.json [09:19:22] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T361798#9687542 (10Ospingou) [09:23:58] 06SRE, 10Maps: Allow Wikimedia Maps usage on wikidata.pl - https://phabricator.wikimedia.org/T344678#9687566 (10Wargo) Nadal aktualne? [09:30:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P59446 and previous config saved to /var/cache/conftool/dbconfig/20240404-093028-marostegui.json [09:31:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc [09:34:07] (03CR) 10Brouberol: [C:03+2] wmnet: fix typo in mpic staging record [dns] - 10https://gerrit.wikimedia.org/r/1017020 (https://phabricator.wikimedia.org/T361339) (owner: 10Brouberol) [09:37:17] (03CR) 10Brouberol: [C:03+2] deployment_server: provisiom mpic(-next) view/deploy users [puppet] - 10https://gerrit.wikimedia.org/r/1016929 (https://phabricator.wikimedia.org/T361336) (owner: 10Brouberol) [09:40:05] (03PS1) 10Muehlenhoff: Switch mariadb::misc to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017029 (https://phabricator.wikimedia.org/T349619) [09:41:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2104.codfw.wmnet [09:42:43] (03CR) 10Brouberol: [C:03+2] trafficserver: add redirection config for mpic(-test) [puppet] - 10https://gerrit.wikimedia.org/r/1017005 (https://phabricator.wikimedia.org/T361340) (owner: 10Brouberol) [09:43:18] (03PS1) 10Marostegui: site.pp: Remove db2104 [puppet] - 10https://gerrit.wikimedia.org/r/1017030 (https://phabricator.wikimedia.org/T361779) [09:44:01] (03CR) 10Marostegui: [C:03+2] site.pp: Remove db2104 [puppet] - 10https://gerrit.wikimedia.org/r/1017030 (https://phabricator.wikimedia.org/T361779) (owner: 10Marostegui) [09:44:43] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::misc to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017029 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:45:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T356166)', diff saved to https://phabricator.wikimedia.org/P59448 and previous config saved to /var/cache/conftool/dbconfig/20240404-094536-marostegui.json [09:45:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [09:45:43] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:45:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [09:45:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:46:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:46:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T356166)', diff saved to https://phabricator.wikimedia.org/P59449 and previous config saved to /var/cache/conftool/dbconfig/20240404-094608-marostegui.json [09:46:11] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:48:06] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2104.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:49:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2104.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:49:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2104.codfw.wmnet [09:49:23] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2104.codfw.wmnet - https://phabricator.wikimedia.org/T361779#9687706 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `db2104.codfw.wmnet` - db2104.codfw.wmnet (**PASS**) - Downtimed h... [09:50:00] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2104.codfw.wmnet - https://phabricator.wikimedia.org/T361779#9687711 (10Marostegui) Ready for DC-Ops [09:51:02] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2104.codfw.wmnet - https://phabricator.wikimedia.org/T361779#9687707 (10Marostegui) a:05Marostegui→03None [09:52:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc [09:54:25] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9687715 (10MoritzMuehlenhoff) [09:57:34] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::objectstash [09:59:13] (03CR) 10David Caro: [C:03+1] harbor: upgrade from 2.9.0 to 2.10.1 [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) (owner: 10Slavina Stefanova) [09:59:49] (03CR) 10Slavina Stefanova: [C:03+1] harbor: upgrade from 2.9.0 to 2.10.1 [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) (owner: 10Slavina Stefanova) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1000) [10:00:19] (03PS2) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) [10:00:59] (03PS1) 10Muehlenhoff: Switch mariadb::objectstash to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017032 (https://phabricator.wikimedia.org/T349619) [10:01:02] (03CR) 10David Caro: [C:03+2] harbor: upgrade from 2.9.0 to 2.10.1 [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) (owner: 10Slavina Stefanova) [10:02:55] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::objectstash to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1017032 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:02:59] (03PS3) 10Brouberol: idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) [10:03:28] dcaro: I'll merge your patch along, ok? [10:03:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016382 [10:04:02] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016382 (owner: 10PipelineBot) [10:04:05] moritzm: yes plase [10:04:48] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016382 (owner: 10PipelineBot) [10:05:22] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:05:48] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:06:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T355609)', diff saved to https://phabricator.wikimedia.org/P59450 and previous config saved to /var/cache/conftool/dbconfig/20240404-100612-marostegui.json [10:06:16] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:07:07] dcaro: ack, now merged [10:07:16] thanks! [10:07:43] (03PS1) 10Brouberol: mpic: scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017034 (https://phabricator.wikimedia.org/T361343) [10:07:54] (03PS1) 10Jgiannelos: mobileapps: Use codfw as cassandra local DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) [10:08:50] (03PS2) 10Jgiannelos: mobileapps: Use codfw as cassandra local DC on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) [10:09:52] (03CR) 10Jgiannelos: "Related error from staging: https://phabricator.wikimedia.org/T350507#9687740" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:10:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::objectstash [10:13:36] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9687751 (10MoritzMuehlenhoff) [10:21:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59451 and previous config saved to /var/cache/conftool/dbconfig/20240404-102120-marostegui.json [10:23:27] (03CR) 10Hnowlan: "This user probably needs an entry in modules/cassandra/templates/users/ too" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [10:28:31] (03CR) 10Santiago Faci: [C:03+1] idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [10:28:48] (03CR) 10Brouberol: [C:03+2] idp: add mpic(_next) clients [puppet] - 10https://gerrit.wikimedia.org/r/1017014 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [10:29:06] (03PS1) 10Esanders: End EditCheck add-a-reference A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017038 (https://phabricator.wikimedia.org/T361727) [10:31:44] (03CR) 10Hnowlan: [C:03+1] mobileapps: Use codfw as cassandra local DC on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:32:11] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Use codfw as cassandra local DC on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:32:57] (03PS2) 10Brouberol: Remove leftovers from old an-coord nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) (owner: 10Muehlenhoff) [10:33:10] (03Merged) 10jenkins-bot: mobileapps: Use codfw as cassandra local DC on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017035 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:33:48] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:33:51] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:34:03] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:34:34] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:34:48] (03Abandoned) 10Muehlenhoff: Remove leftovers from old an-coord nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) (owner: 10Muehlenhoff) [10:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59453 and previous config saved to /var/cache/conftool/dbconfig/20240404-103628-marostegui.json [10:51:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T355609)', diff saved to https://phabricator.wikimedia.org/P59455 and previous config saved to /var/cache/conftool/dbconfig/20240404-105135-marostegui.json [10:51:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:51:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:51:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T355609)', diff saved to https://phabricator.wikimedia.org/P59456 and previous config saved to /var/cache/conftool/dbconfig/20240404-105158-marostegui.json [11:27:07] On March 24 I used "Email this user" on enWS to email another user, asking for a copy. Today I've received 6 (and counting) copies of that email, and the original recipient reports getting the same. [11:28:01] Started around 08:15 UTC today, and last came in 10 minutes ago. [11:29:15] That sounds to me like a hickup either down in the email transport (MTA), or in the jobqueue / cron / whatever that connects MediaWiki with the email transport. [11:30:01] I haven't seen any other users reporting the same problem on-wiki, so I don't know if it's just this message or a general problem. [11:30:34] But it could point at an awful lot of email getting resent to an awful lot of users. [11:32:31] (Traffic bill over quota) firing: (3) Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:37:31] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:40:09] The only things that look remotely connected time-wise is T361750 (but I that's vm quota stuff, so don't see it impacting anything) and a couple of db nodes getting repooled and eventually promoted to master. [11:40:10] T361750: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750 [11:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T355609)', diff saved to https://phabricator.wikimedia.org/P59457 and previous config saved to /var/cache/conftool/dbconfig/20240404-114012-marostegui.json [11:40:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:41:03] I don't see any newly open tasks on Phab about this, and not finding any old tasks with obvious relevance. [11:49:29] !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox circuit ID 108 [11:49:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [11:51:41] xover: I would create a private past (https://phabricator.wikimedia.org/paste/edit/form/45/ - Change "Visible to" to Allow members of project acl*security) with copies of the full headers and then create a phabricator task about the issue and reference the P#### number in it so the sre team that handles mail can have a look [11:52:31] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:55:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59458 and previous config saved to /var/cache/conftool/dbconfig/20240404-115520-marostegui.json [11:57:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T356166)', diff saved to https://phabricator.wikimedia.org/P59459 and previous config saved to /var/cache/conftool/dbconfig/20240404-115709-marostegui.json [11:57:14] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:57:18] +1 on a ticket + private paste [11:57:31] (Traffic bill over quota) resolved: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1200) [12:08:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: provisionning db2207.codfw.wmnet - T355422 [12:08:29] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [12:08:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: provisionning db2207.codfw.wmnet - T355422 [12:08:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: provisionning db2207.codfw.wmnet - T355422 [12:08:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: provisionning db2207.codfw.wmnet - T355422 [12:09:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2107 in db2207 for T355422', diff saved to https://phabricator.wikimedia.org/P59460 and previous config saved to /var/cache/conftool/dbconfig/20240404-121008-arnaudb.json [12:10:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59461 and previous config saved to /var/cache/conftool/dbconfig/20240404-121027-marostegui.json [12:11:58] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:12:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P59462 and previous config saved to /var/cache/conftool/dbconfig/20240404-121218-marostegui.json [12:12:27] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2107.codfw.wmnet onto db2207.codfw.wmnet [12:15:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: provisionning db2213.codfw.wmnet - T355422 [12:16:02] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [12:16:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: provisionning db2213.codfw.wmnet - T355422 [12:16:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: provisionning db2213.codfw.wmnet - T355422 [12:16:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: provisionning db2213.codfw.wmnet - T355422 [12:17:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2113 in db2213 for T355422', diff saved to https://phabricator.wikimedia.org/P59463 and previous config saved to /var/cache/conftool/dbconfig/20240404-121722-arnaudb.json [12:18:23] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2113.codfw.wmnet onto db2213.codfw.wmnet [12:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T355609)', diff saved to https://phabricator.wikimedia.org/P59465 and previous config saved to /var/cache/conftool/dbconfig/20240404-122535-marostegui.json [12:25:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:25:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:25:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T355609)', diff saved to https://phabricator.wikimedia.org/P59466 and previous config saved to /var/cache/conftool/dbconfig/20240404-122557-marostegui.json [12:27:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P59467 and previous config saved to /var/cache/conftool/dbconfig/20240404-122727-marostegui.json [12:36:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:36:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:36:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:36:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:36:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59468 and previous config saved to /var/cache/conftool/dbconfig/20240404-123645-arnaudb.json [12:36:48] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:42:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T356166)', diff saved to https://phabricator.wikimedia.org/P59469 and previous config saved to /var/cache/conftool/dbconfig/20240404-124235-marostegui.json [12:42:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [12:42:38] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:42:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [12:42:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T356166)', diff saved to https://phabricator.wikimedia.org/P59470 and previous config saved to /var/cache/conftool/dbconfig/20240404-124257-marostegui.json [12:56:33] (03PS2) 10Brouberol: mpic: scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017034 (https://phabricator.wikimedia.org/T361343) [12:56:49] (03PS3) 10Muehlenhoff: prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 [12:58:21] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Remove obsolete discovery certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:58:25] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) (owner: 10AikoChou) [12:59:00] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove dummy cert for debmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/1016726 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:59:14] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) (owner: 10AikoChou) [12:59:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9688084 (10MoritzMuehlenhoff) [12:59:42] (03PS1) 10Gmodena: analytics: refinery: add webrequest_frontend timer [puppet] - 10https://gerrit.wikimedia.org/r/1017041 (https://phabricator.wikimedia.org/T314956) [12:59:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013542 (owner: 10Muehlenhoff) [12:59:54] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1013521 (owner: 10Muehlenhoff) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1300). [13:00:05] esanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59471 and previous config saved to /var/cache/conftool/dbconfig/20240404-130022-arnaudb.json [13:00:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:00:48] (03CR) 10Muehlenhoff: [C:03+2] cloudceph::mon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013542 (owner: 10Muehlenhoff) [13:01:07] I can’t deploy yet but probably later in the window [13:02:06] (03CR) 10AikoChou: [C:03+2] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) (owner: 10AikoChou) [13:02:12] (03CR) 10Ayounsi: [C:03+1] tests: fix typos in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016814 (owner: 10Volans) [13:02:24] (03Merged) 10jenkins-bot: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) (owner: 10AikoChou) [13:02:30] (03CR) 10Muehlenhoff: [C:03+2] cloudceph::osd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013521 (owner: 10Muehlenhoff) [13:03:17] (03CR) 10Volans: [C:03+2] tests: fix typos in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016814 (owner: 10Volans) [13:04:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2113.codfw.wmnet onto db2213.codfw.wmnet [13:04:45] (03PS7) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [13:04:55] (03PS1) 10Ayounsi: add_ip6_mapped - don't fail if the host already have a /128 address [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) [13:05:01] (03Merged) 10jenkins-bot: tests: fix typos in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016814 (owner: 10Volans) [13:05:30] (03PS8) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [13:05:44] (03PS9) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [13:06:16] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:06:24] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/1017047/1796/testvm2006.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:06:28] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the patience ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [13:07:24] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:07:32] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016384 [13:07:48] (03CR) 10Ayounsi: Netbox: add functions to get and set device name (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [13:08:04] (03CR) 10Ayounsi: [C:03+2] Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [13:08:24] (03PS9) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [13:08:32] (03PS4) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [13:08:36] (03CR) 10Ilias Sarantopoulos: Add new version for amd-pytorch image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [13:08:38] alright, I can deploy now if esanders is around [13:09:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Phase out cergen for Fundraising services - https://phabricator.wikimedia.org/T360779#9688365 (10MoritzMuehlenhoff) As for the host where to export the keys, the cumin hosts seems like the best choice. [13:09:07] (03Merged) 10jenkins-bot: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [13:09:29] 06SRE, 06Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTBase Sunsetting, and 3 others: 14Setup allowed list for MCS decom - 14https://phabricator.wikimedia.org/T340036#9688380 (10akosiaris) 14I guess it's about time I ask if it is ok to remove those exceptions now and return 403 to everyo... [13:09:51] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [13:10:25] (03CR) 10Ayounsi: "Note that it also remove the resource once it has been applied once. For example https://puppet-compiler.wmflabs.org/output/1017047/1798/n" [puppet] - 10https://gerrit.wikimedia.org/r/1017047 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:10:51] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835 (10WDoranWMF) 03NEW [13:10:57] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9688437 (10WDoranWMF) p:05Triage→03High [13:11:15] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9688441 (10WDoranWMF) [13:12:21] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: remove duplicate key type from gitlab known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1013004 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [13:15:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T355609)', diff saved to https://phabricator.wikimedia.org/P59472 and previous config saved to /var/cache/conftool/dbconfig/20240404-131504-marostegui.json [13:15:09] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:15:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P59473 and previous config saved to /var/cache/conftool/dbconfig/20240404-131529-arnaudb.json [13:18:29] (03PS1) 10Alexandros Kosiaris: changeprop: Remove all MCS endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017054 (https://phabricator.wikimedia.org/T361483) [13:18:42] (03CR) 10CI reject: [V:04-1] changeprop: Remove all MCS endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017054 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [13:21:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Corresponding DiscussionTools code was completely removed in change I432ec0a24b (commit 5ba0bfa026), in the wmf/1.42.0-wmf.19 train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 (owner: 10Esanders) [13:21:28] I can deploy that config cleanup at least, I don’t think esanders needs to confirm that one [13:22:06] (03PS2) 10Lucas Werkmeister (WMDE): DiscussionTools: Remove no-op config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 (owner: 10Esanders) [13:22:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 (owner: 10Esanders) [13:23:54] (03PS1) 10Slyngshede: Update error pages to Codex design. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017056 [13:24:02] (03Merged) 10jenkins-bot: DiscussionTools: Remove no-op config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 (owner: 10Esanders) [13:24:26] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1004749|DiscussionTools: Remove no-op config]] [13:24:33] Lucas_WMDE: hi [13:24:45] hi! [13:24:54] I hope it’s okay that I already started with the no-op config removal [13:25:08] Thanks [13:25:31] (03PS1) 10Jcrespo: mariadb: Reenable notifications for backup source host db2198 [puppet] - 10https://gerrit.wikimedia.org/r/1017057 (https://phabricator.wikimedia.org/T355422) [13:26:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and esanders: Backport for [[gerrit:1004749|DiscussionTools: Remove no-op config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:53] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and esanders: Continuing with sync [13:26:58] They're all fairly low risk [13:27:02] ok [13:27:49] (03CR) 10Jcrespo: "Heads up for the DBAs that the host has been recovered with yesterday's content." [puppet] - 10https://gerrit.wikimedia.org/r/1017057 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [13:29:16] edsanders: I’m guessing the request mentioned at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1015083 is not publicly visible because it happened on collabwiki itself? [13:29:31] (which apparently “can only be seen by authorized users” according to its main page) [13:30:00] Yeah, it was internal [13:30:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59474 and previous config saved to /var/cache/conftool/dbconfig/20240404-133012-marostegui.json [13:30:27] ok [13:30:32] It's a minor behaviour tweak to VE [13:30:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P59475 and previous config saved to /var/cache/conftool/dbconfig/20240404-133037-arnaudb.json [13:31:14] (03CR) 10Alexandros Kosiaris: [C:03+1] "That's a lot of hacks, ouch. If it is tested that it works, it should allow us to move forward for a while, but it feels somewhat brittle." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [13:39:36] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1004749|DiscussionTools: Remove no-op config]] (duration: 15m 10s) [13:40:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2107.codfw.wmnet onto db2207.codfw.wmnet [13:41:06] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1017038|End EditCheck add-a-reference A/B test (T361727)]] [13:41:09] T361727: [Config] Stop the Edit Check (references) A/B test - https://phabricator.wikimedia.org/T361727 [13:43:21] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and esanders: Backport for [[gerrit:1017038|End EditCheck add-a-reference A/B test (T361727)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:32] edsanders: can you test the ended A/B test on mwdebug? [13:43:37] I'd like to backport a translation change for T361695 once others are done [13:43:38] T361695: The log type {log_type_one} has the same translation as {log_type_two} for {lang}. {log_type_one} will not be displayed in the drop down menu on Special:Log. - https://phabricator.wikimedia.org/T361695 [13:43:43] jouncebot: next [13:43:43] In 2 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1600) [13:43:53] Dreamy_Jazz: it’ll probably be after the end of the window but I think that’s okay [13:44:31] Okay. I may not be able to do it right now as I would also need someone to give it a review (as it would be picking a specific translation change to wmf branches) [13:44:50] Which I don't(?) think I can self-merge [13:45:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59476 and previous config saved to /var/cache/conftool/dbconfig/20240404-134519-marostegui.json [13:45:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59477 and previous config saved to /var/cache/conftool/dbconfig/20240404-134544-arnaudb.json [13:45:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:45:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:45:55] Lucas_WMDE: looking [13:46:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:46:01] I'll also have some security patches to deploy after that. [13:46:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T360332)', diff saved to https://phabricator.wikimedia.org/P59478 and previous config saved to /var/cache/conftool/dbconfig/20240404-134607-arnaudb.json [13:46:18] But it looks like the calendar is free for the next few hours for such a security deploy [13:46:54] yeah [13:47:03] and I can try to review the change, I’ll probably be around after the window for a while [13:47:46] :) [13:48:52] Lucas_WMDE: looks good [13:48:55] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and esanders: Continuing with sync [13:48:59] alright, thanks! [13:54:10] feels like the sync-prod-k8s step in scap is taking longer now [13:54:18] which I guess makes sense as we move more and more traffic to k8s [13:54:37] I assume it’s only restarting a few pods(?) at a time, to make sure there’s always enough capacity to handle requests [13:55:15] I'm not sure, but from the incident yesterday this graph can tell you how the k8s deployment is going: https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha?orgId=1&from=now-12h&to=now&viewPanel=78 [13:55:17] would be cool to have more visibility into it though, like the in-flight / ok / fail / left numbers for bare-metal steps [13:55:29] The spikes correlate with deployments [13:55:40] * Lucas_WMDE looks up what calico does [13:55:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 10%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59479 and previous config saved to /var/cache/conftool/dbconfig/20240404-135547-arnaudb.json [13:55:57] Lucas_WMDE: glad its not just me [13:57:53] doesn’t sound like the kind of service that should use *that* much CPU but what do I know [13:58:19] That was the service that crashed yesterday and caused the site incident (IIRC) [13:58:35] is that when the last sentence of the first paragraph at https://wikitech.wikimedia.org/wiki/Calico#Typha was added ._. [13:58:48] huh, no, last edit was in february [13:59:01] then it was prescient I guess [13:59:57] BTW https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1017062 is the translation change I want to make - It will reduce logstash spam over the weekend - See https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=h@f84680b&_a=h@84299a4 [14:00:12] I’m guessing each deployment generates a lot of events in k8s, and Typha filters those events, so it has more to do when k8s is “active” [14:00:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 10%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59480 and previous config saved to /var/cache/conftool/dbconfig/20240404-140011-arnaudb.json [14:00:23] (scap reached php-fpm-restart now btw) [14:00:25] The change to that message key for sk is already applied to master [14:00:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T355609)', diff saved to https://phabricator.wikimedia.org/P59481 and previous config saved to /var/cache/conftool/dbconfig/20240404-140027-marostegui.json [14:00:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [14:00:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:00:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [14:00:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T355609)', diff saved to https://phabricator.wikimedia.org/P59482 and previous config saved to /var/cache/conftool/dbconfig/20240404-140050-marostegui.json [14:01:10] I see [14:01:11] Actually the dashboard link didn't work. This one should work: https://logstash.wikimedia.org/goto/8e66ad6c51d36cdc4afb9844758db4a9 [14:01:12] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1017038|End EditCheck add-a-reference A/B test (T361727)]] (duration: 20m 05s) [14:01:15] T361727: [Config] Stop the Edit Check (references) A/B test - https://phabricator.wikimedia.org/T361727 [14:02:17] The bug that it the logstash errors complain have been around for 16 years (?!), but the logstash error is new so that these issues can be found. [14:02:31] neat ^^ [14:02:40] edsanders: deploying the third change now, I hope you still have time [14:03:22] “Finished sync-prod-k8s (duration: 07m 07s)” “Finished php-fpm-restarts (duration: 02m 28s)” – the bare-metal restarts now take less time than the k8s restarts, that’s neat [14:03:24] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1015083|Enable wgVisualEditorAllowExternalLinkPaste at collabwiki]] [14:04:08] :D. I imagine that soon it will be under 1 minute to the baremetal restarts [14:04:55] It would be nice to have more visual output on the command that does the k8s restarts (i.e. a progress bar of some kind) [14:05:23] yeah [14:05:26] (to both ^^) [14:05:40] !log lucaswerkmeister-wmde@deploy1002 esanders and lucaswerkmeister-wmde: Backport for [[gerrit:1015083|Enable wgVisualEditorAllowExternalLinkPaste at collabwiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:14] edsanders: can you test the collabwiki change on mwdebug? [14:08:22] Lucas_WMDE: looking [14:09:21] Lucas_WMDE: looks good [14:09:48] !log lucaswerkmeister-wmde@deploy1002 esanders and lucaswerkmeister-wmde: Continuing with sync [14:09:52] cool, thanks! [14:10:50] Dreamy_Jazz: do you want to deploy the change yourself or should I do it? [14:10:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T360332)', diff saved to https://phabricator.wikimedia.org/P59483 and previous config saved to /var/cache/conftool/dbconfig/20240404-141051-arnaudb.json [14:10:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:10:54] (once the current deployment is done) [14:11:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 20%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59484 and previous config saved to /var/cache/conftool/dbconfig/20240404-141100-arnaudb.json [14:11:12] I can deploy it, but just wanted to not self-deploy without having someone go that's fine to deploy [14:11:23] But I've now got a +1 on it, so it should be fine. [14:11:47] alright :) [14:11:51] I’ll ping you then [14:11:55] Thanks! [14:12:21] I'll do my security deploys after that. I'll stop once the puppet window starts. [14:12:31] sounds good [14:15:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 20%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59485 and previous config saved to /var/cache/conftool/dbconfig/20240404-141517-arnaudb.json [14:22:08] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1015083|Enable wgVisualEditorAllowExternalLinkPaste at collabwiki]] (duration: 18m 43s) [14:22:15] * Lucas_WMDE done [14:22:17] Dreamy_Jazz: all yours [14:22:23] Thanks! [14:25:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P59486 and previous config saved to /var/cache/conftool/dbconfig/20240404-142558-arnaudb.json [14:26:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 30%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59487 and previous config saved to /var/cache/conftool/dbconfig/20240404-142606-arnaudb.json [14:30:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59488 and previous config saved to /var/cache/conftool/dbconfig/20240404-143006-arnaudb.json [14:30:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59489 and previous config saved to /var/cache/conftool/dbconfig/20240404-143020-arnaudb.json [14:30:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 30%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59490 and previous config saved to /var/cache/conftool/dbconfig/20240404-143030-arnaudb.json [14:33:42] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1017062|Remove sk translation of centralauth-rightslog-name (T361695)]] [14:33:45] T361695: The log type {log_type_one} has the same translation as {log_type_two} for {lang}. {log_type_one} will not be displayed in the drop down menu on Special:Log. - https://phabricator.wikimedia.org/T361695 [14:36:01] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1017062|Remove sk translation of centralauth-rightslog-name (T361695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:17] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [14:36:54] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9688647 (10hnowlan) [14:38:07] (03PS1) 10Dreamy Jazz: Remove sk translation of centralauth-rightslog-name [extensions/CentralAuth] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017062 (https://phabricator.wikimedia.org/T361695) [14:38:15] (03CR) 10Arnaudb: [C:03+1] mariadb: Reenable notifications for backup source host db2198 [puppet] - 10https://gerrit.wikimedia.org/r/1017057 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [14:38:27] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:39] (03CR) 10Jforrester: [C:03+1] Remove sk translation of centralauth-rightslog-name [extensions/CentralAuth] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017062 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [14:38:59] (03PS2) 10Lucas Werkmeister (WMDE): Enable wgVisualEditorAllowExternalLinkPaste at collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015083 (owner: 10Esanders) [14:39:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015083 (owner: 10Esanders) [14:39:07] (03Merged) 10jenkins-bot: Enable wgVisualEditorAllowExternalLinkPaste at collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015083 (owner: 10Esanders) [14:39:11] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016385 [14:39:50] (03PS1) 10Arnaudb: mariadb: toggle notifications and roles for db2213 db2207 [puppet] - 10https://gerrit.wikimedia.org/r/1017066 (https://phabricator.wikimedia.org/T355422) [14:40:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Remove sk translation of centralauth-rightslog-name [extensions/CentralAuth] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017062 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [14:41:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P59491 and previous config saved to /var/cache/conftool/dbconfig/20240404-144105-arnaudb.json [14:41:09] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1800/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013571 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:41:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 50%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59492 and previous config saved to /var/cache/conftool/dbconfig/20240404-144111-arnaudb.json [14:41:21] (03PS1) 10Cathal Mooney: Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) [14:41:25] (03CR) 10Ilias Sarantopoulos: "I built image docker-registry.wikimedia.org/amd-pytorch21:2.1.2rocm5.5-1 and it is 10.2GB" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [14:42:47] (03CR) 10Elukey: [C:03+2] Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [14:42:51] (03CR) 10Elukey: [V:03+2 C:03+2] Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [14:44:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017062 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [14:44:32] (03CR) 10Eevans: "Thanks for having a look! If you have any thoughts about the name of the user as well (`dbdev`), I am open to bikeshedding (I don't much " [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [14:45:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59493 and previous config saved to /var/cache/conftool/dbconfig/20240404-144511-arnaudb.json [14:45:16] (03PS6) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [14:45:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59494 and previous config saved to /var/cache/conftool/dbconfig/20240404-144526-arnaudb.json [14:45:28] (03CR) 10Marostegui: [C:03+1] mariadb: toggle notifications and roles for db2213 db2207 [puppet] - 10https://gerrit.wikimedia.org/r/1017066 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [14:45:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 50%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59495 and previous config saved to /var/cache/conftool/dbconfig/20240404-144536-arnaudb.json [14:45:42] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications and roles for db2213 db2207 [puppet] - 10https://gerrit.wikimedia.org/r/1017066 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [14:46:59] (03CR) 10Cathal Mooney: "Sorry for the delay reviewing this. LGTM overall, played with it on -next and worked as expected. I'll reply on task with some challenge" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) (owner: 10Ayounsi) [14:47:15] (03Merged) 10jenkins-bot: Remove sk translation of centralauth-rightslog-name [extensions/CentralAuth] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1017062 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [14:47:35] 10SRE-swift-storage: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844 (10MatthewVernon) 03NEW [14:47:39] 10SRE-swift-storage: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844#9688867 (10MatthewVernon) p:05Triage→03High [14:48:09] (03PS1) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [14:48:32] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1017062|Remove sk translation of centralauth-rightslog-name (T361695)]] (duration: 14m 49s) [14:48:35] T361695: The log type {log_type_one} has the same translation as {log_type_two} for {lang}. {log_type_one} will not be displayed in the drop down menu on Special:Log. - https://phabricator.wikimedia.org/T361695 [14:49:45] (03PS1) 10Fabfur: benthos: fix unit tests to reflect recent changes in schema [puppet] - 10https://gerrit.wikimedia.org/r/1017090 (https://phabricator.wikimedia.org/T358109) [14:50:05] Now deploying a security patch [14:50:16] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1801/console" [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [14:50:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T355609)', diff saved to https://phabricator.wikimedia.org/P59496 and previous config saved to /var/cache/conftool/dbconfig/20240404-145041-marostegui.json [14:50:44] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:50:48] (03PS2) 10Fabfur: benthos: add metric for ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1017088 (https://phabricator.wikimedia.org/T361845) [14:51:10] (03CR) 10FebinBellamy: [C:03+1] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016385 (owner: 10PipelineBot) [14:51:48] (03CR) 10Fabfur: [C:03+2] benthos: fix unit tests to reflect recent changes in schema [puppet] - 10https://gerrit.wikimedia.org/r/1017090 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:52:22] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016385 (owner: 10PipelineBot) [14:53:21] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016385 (owner: 10PipelineBot) [14:56:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T360332)', diff saved to https://phabricator.wikimedia.org/P59497 and previous config saved to /var/cache/conftool/dbconfig/20240404-145613-arnaudb.json [14:56:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:56:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 70%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59498 and previous config saved to /var/cache/conftool/dbconfig/20240404-145617-arnaudb.json [14:56:18] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:56:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:56:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:57:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:57:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T360332)', diff saved to https://phabricator.wikimedia.org/P59499 and previous config saved to /var/cache/conftool/dbconfig/20240404-145714-arnaudb.json [14:57:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:58:27] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59500 and previous config saved to /var/cache/conftool/dbconfig/20240404-150017-arnaudb.json [15:00:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59501 and previous config saved to /var/cache/conftool/dbconfig/20240404-150032-arnaudb.json [15:00:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 70%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59502 and previous config saved to /var/cache/conftool/dbconfig/20240404-150041-arnaudb.json [15:01:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T360332)', diff saved to https://phabricator.wikimedia.org/P59503 and previous config saved to /var/cache/conftool/dbconfig/20240404-150145-arnaudb.json [15:01:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:03:44] !log dreamyjazz Deployed security patch for T361295 [15:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59504 and previous config saved to /var/cache/conftool/dbconfig/20240404-150549-marostegui.json [15:07:01] Doing another security deploy [15:07:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 35.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:08:01] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:08:23] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T356166)', diff saved to https://phabricator.wikimedia.org/P59505 and previous config saved to /var/cache/conftool/dbconfig/20240404-150833-marostegui.json [15:08:38] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:10:23] !log beginning rolling hardware upgrades on titan200[12] T361229 [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] T361229: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229 [15:11:15] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9689044 (10herron) [15:11:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 100%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59506 and previous config saved to /var/cache/conftool/dbconfig/20240404-151123-arnaudb.json [15:15:03] jouncebot: next [15:15:03] In 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1600) [15:15:11] Dreamy_Jazz: can you ping me when you’re done? I’d like to test something on mwdebug [15:15:20] Sure. [15:15:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59507 and previous config saved to /var/cache/conftool/dbconfig/20240404-151524-arnaudb.json [15:15:25] thx [15:15:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59508 and previous config saved to /var/cache/conftool/dbconfig/20240404-151537-arnaudb.json [15:15:42] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:15:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2113 (re)pooling @ 100%: Post clone repool (src)', diff saved to https://phabricator.wikimedia.org/P59509 and previous config saved to /var/cache/conftool/dbconfig/20240404-151547-arnaudb.json [15:16:33] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:16:35] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:16:45] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:16:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P59510 and previous config saved to /var/cache/conftool/dbconfig/20240404-151653-arnaudb.json [15:18:27] (JobUnavailable) firing: (4) Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:44] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:20:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59511 and previous config saved to /var/cache/conftool/dbconfig/20240404-152056-marostegui.json [15:22:18] !log dreamyjazz Deployed security patch for T361296 [15:22:26] Lucas_WMDE: Done. [15:22:32] alright, thanks! [15:22:44] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:22:51] * Lucas_WMDE testing some stuff on mwdebug1002 [15:23:06] (if anyone else wants to deploy, that’s fine, just let me know that my changes on mwdebug1002 will be wiped by scap ^^) [15:23:26] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:23:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P59512 and previous config saved to /var/cache/conftool/dbconfig/20240404-152341-marostegui.json [15:24:25] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9689099 (10Jhancock.wm) [15:27:21] alright, nothing to test after all ^^ [15:27:33] (wanted to look into T315510 but can’t reproduce the error) [15:27:34] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [15:27:38] * Lucas_WMDE done, all clear [15:29:14] db1124 down [15:29:19] hmm [15:29:24] acked [15:29:24] !incidents [15:29:24] 4565 (ACKED) Host db2214 (paged) - PING - Packet loss = 100% [15:29:24] 4563 (RESOLVED) ProbeDown sre (10.192.0.56 ip4 kubemaster2001:6443 probes/custom http_codfw_kube_apiserver_ip4 codfw) [15:29:25] 4562 (RESOLVED) ProbeDown sre (10.2.2.51 ip4 shellbox:4008 probes/service http_shellbox_ip4 eqiad) [15:29:27] 2214 [15:29:31] depooling first [15:29:37] jynus: hi [15:29:42] sorry, wrong person [15:29:52] sukhe: thanks [15:30:04] was it for j*yme :-) ? [15:30:15] no, I forgot again you are not a DBA [15:30:25] !log sukhe@cumin2002 dbctl commit (dc=all): 'depool db2214', diff saved to https://phabricator.wikimedia.org/P59513 and previous config saved to /var/cache/conftool/dbconfig/20240404-153023-sukhe.json [15:30:26] a db server died [15:30:26] maybe I can still help [15:30:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59514 and previous config saved to /var/cache/conftool/dbconfig/20240404-153030-arnaudb.json [15:30:32] jynus: db2114 is down [15:30:34] but yeah, just depool [15:30:37] done [15:30:40] ty [15:30:41] want me to file a task? [15:30:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59515 and previous config saved to /var/cache/conftool/dbconfig/20240404-153043-arnaudb.json [15:30:54] if you have the bandwith yes, even if it is empty [15:30:59] on it [15:31:05] just with "X crashed" [15:31:05] is it coincidence i see db2213 being repooled [15:31:09] while db2214 goes down [15:31:18] arnaudb: ^ [15:31:26] well, there is pools and depools all the time [15:31:33] I wouldn't look more into that [15:31:43] there is like 1 every 15 minutes 14/7 [15:31:48] should we make a ticket and move on? [15:31:49] *24/7 [15:31:52] ack [15:32:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P59516 and previous config saved to /var/cache/conftool/dbconfig/20240404-153200-arnaudb.json [15:32:07] yeah, unless there is ongoing production affectance [15:32:47] even without the depool, mw already moves away the connections [15:32:57] but it will try to connect to it all the time, creating log spam [15:33:27] (JobUnavailable) resolved: (2) Reduced availability for job thanos-query-frontend in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:28] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9689123 (10hnowlan) What external paths should we be routing to what internal paths for this service? [15:33:50] it's backup but I will not pool it unless someone has looked at it [15:33:58] took a look at console anyways [15:34:01] server is up [15:34:06] so like lost networking [15:34:15] oh, ok [15:34:21] sukhe: yeah, better leave a check for the dbas [15:34:31] what does icinga say? [15:34:37] we could maybe downtime it [15:35:24] if there is no dbas around, I will do it for a couple of days [15:35:42] icinga says the server is up but could not connect to check slave lag. rescheduling checks [15:35:49] mutante: thanks [15:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T355609)', diff saved to https://phabricator.wikimedia.org/P59517 and previous config saved to /var/cache/conftool/dbconfig/20240404-153603-marostegui.json [15:36:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:36:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:36:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2214.codfw.wmnet with reason: depooled, see T361851 [15:36:11] T361851: db2214 is down - https://phabricator.wikimedia.org/T361851 [15:36:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:36:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2214.codfw.wmnet with reason: depooled, see T361851 [15:36:26] yeah, mutante after server crash mysql doesn't start automatically [15:36:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T355609)', diff saved to https://phabricator.wikimedia.org/P59518 and previous config saved to /var/cache/conftool/dbconfig/20240404-153626-marostegui.json [15:36:54] downtimed, depool *not removed*, task is at T361851 [15:36:57] so it is the process that is stopped, and we should let it there until then [15:37:18] if it was me, after a crash, I would recover from provisioning server to ensure 100% data consistency [15:37:40] thank you so much, sukhe, you saved me from doing all that! [15:38:02] jynus: np hth [15:38:04] so, icinga won't recover but i think that's normal [15:38:10] since mysql is NOT started on reboot on purpose [15:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:38:22] mutante: indeed [15:38:40] I can check but usually it is a memory crash [15:38:48] that's the #1 cause [15:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P59519 and previous config saved to /var/cache/conftool/dbconfig/20240404-153850-marostegui.json [15:40:12] the bad news is that I think is those are just bought servers [15:40:51] I ran 'ipmi-sel' but nothing from today [15:41:28] a bunch on Feb 28th [15:41:42] yeah looks very recent in netbox [15:41:46] the amount of errors was almost 0, so probably the server was not fully pooled yet: https://logstash.wikimedia.org/goto/dba938f602530a9d3e4ebf869b0bf4b9 [15:42:13] so that's good in that it was being setup, bad in that it was newly setup [15:42:16] checking the hw logs [15:42:30] have to go afk for a little bit [15:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:43:28] mutante: thanks for checking! [15:43:36] (ProbeDown) firing: Service titan2002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:46] * sukhe stands by for another page [15:45:00] !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox circuit ID 108 [15:45:03] can someone log into db2214 ilo? It goes in a loop for me [15:45:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [15:45:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59520 and previous config saved to /var/cache/conftool/dbconfig/20240404-154535-arnaudb.json [15:45:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59521 and previous config saved to /var/cache/conftool/dbconfig/20240404-154549-arnaudb.json [15:47:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T360332)', diff saved to https://phabricator.wikimedia.org/P59522 and previous config saved to /var/cache/conftool/dbconfig/20240404-154707-arnaudb.json [15:47:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:47:11] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:47:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:47:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T360332)', diff saved to https://phabricator.wikimedia.org/P59523 and previous config saved to /var/cache/conftool/dbconfig/20240404-154730-arnaudb.json [15:48:27] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:49] the logs just say "generic crash" [15:50:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T360332)', diff saved to https://phabricator.wikimedia.org/P59524 and previous config saved to /var/cache/conftool/dbconfig/20240404-155000-arnaudb.json [15:50:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:53:10] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9689218 (10Jhancock.wm) [15:53:27] (JobUnavailable) resolved: (3) Reduced availability for job thanos-query in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:38] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9689223 (10herron) [15:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T356166)', diff saved to https://phabricator.wikimedia.org/P59525 and previous config saved to /var/cache/conftool/dbconfig/20240404-155357-marostegui.json [15:54:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:54:02] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:54:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:54:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T356166)', diff saved to https://phabricator.wikimedia.org/P59526 and previous config saved to /var/cache/conftool/dbconfig/20240404-155420-marostegui.json [15:54:37] !log depooling cp3068 for reimage (T360430) [15:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:44] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [15:55:04] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9689231 (10herron) SSD and RAM upgrades have been installed. @fgiunchedi how did you want to configure the raid/filesystems on titan2001? [15:55:11] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3068.esams.wmnet [15:55:32] (03CR) 10Fabfur: [C:03+2] cp3068: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015970 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [15:56:57] (03CR) 10Alexandros Kosiaris: "> I 'll try and see if I can reproduce" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:57:53] jouncebot: nowandnext [15:57:53] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [15:57:53] In 0 hour(s) and 2 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1600) [15:58:15] Gosh, a puppet window with actual patches in it. [15:58:18] Ah well. [15:58:23] :) [15:58:37] So much for my sneaky plan to deploy Wikifunctions services. [15:58:37] they are all simple and don't affect anything active right now. [15:58:48] honestly, you probably could. [15:58:50] Hmm, in that case maybe we can go in parallel. [15:58:52] * James_F nods. [15:59:39] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-03-05-140533 to 2024-04-04-132719 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017060 (https://phabricator.wikimedia.org/T348370) [15:59:42] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-03-05-140533 to 2024-04-04-132719 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017060 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester) [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1600). [16:00:05] dwisehaupt: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:39] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-03-05-140533 to 2024-04-04-132719 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017060 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester) [16:00:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59527 and previous config saved to /var/cache/conftool/dbconfig/20240404-160041-arnaudb.json [16:00:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59528 and previous config saved to /var/cache/conftool/dbconfig/20240404-160055-arnaudb.json [16:01:10] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:01:48] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:02:24] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:02:29] dwisehaupt: 👋 want to deploy these in any particular order? [16:02:31] (ProbeDown) resolved: (2) Service titan2002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:43] no particular order, they are all independant. [16:03:05] and as i mentioned, the service isn't in use yet so nothing to coordinate on our side. [16:03:34] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:03:38] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:03:39] okay cool -- I can just merge them all at once if that works for you? [16:03:45] that works. thanks! [16:04:10] i wasn't sure if i should have rebased them before the window. sorry if that adds any pain. [16:04:24] well, not really pain, just more work. [16:04:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3068.esams.wmnet with OS bullseye [16:04:35] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9689289 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3068.esams.wmnet with OS bullseye [16:04:37] (03CR) 10RLazarus: [C:03+2] Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:04:47] (03CR) 10RLazarus: [C:03+2] Force CIVICRM_TEMPLATE_COMPILE_CHECK to false [puppet] - 10https://gerrit.wikimedia.org/r/1016014 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:04:48] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:04:57] (03CR) 10RLazarus: [C:03+2] Enable the mariadb slow query log for civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016016 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:05:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P59529 and previous config saved to /var/cache/conftool/dbconfig/20240404-160508-arnaudb.json [16:05:37] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-02-26-150300 to 2024-04-03-210033 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017061 (https://phabricator.wikimedia.org/T320507) [16:05:43] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-02-26-150300 to 2024-04-03-210033 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017061 (https://phabricator.wikimedia.org/T320507) (owner: 10Jforrester) [16:06:18] dwisehaupt: when there's a conflict and it has to be done manually, it's nice to do it first, but not really a requirement -- otherwise doesn't matter [16:06:31] don't sweat it either way, basically :) [16:06:45] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-02-26-150300 to 2024-04-03-210033 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017061 (https://phabricator.wikimedia.org/T320507) (owner: 10Jforrester) [16:07:02] and, puppet-merge complete -- I'll let you run puppet and test on your own, yeah? let me know if you need a rollback or anything else [16:07:06] heh. if there was a conflict on these it's because i was sleep coding. :) [16:07:22] cool thanks! yeah. i can do it from here. [16:07:35] 👍 [16:07:37] thanks! [16:07:41] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9689309 (10Scott_French) Additionally, two timeline questions: * When do you anticipate having a min... [16:08:47] looks good from here. :) [16:09:04] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:10:34] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59530 and previous config saved to /var/cache/conftool/dbconfig/20240404-161547-arnaudb.json [16:16:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59531 and previous config saved to /var/cache/conftool/dbconfig/20240404-161601-arnaudb.json [16:20:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P59532 and previous config saved to /var/cache/conftool/dbconfig/20240404-162019-arnaudb.json [16:21:09] 10ops-codfw: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856 (10Jhancock.wm) 03NEW [16:22:35] 10ops-codfw: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9689404 (10Jhancock.wm) refresh task: https://phabricator.wikimedia.org/T325215 [16:23:14] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 11), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9689413 (10Scott_French) a:03Scott_French [16:27:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [16:27:25] (03PS2) 10Jcrespo: mariadb: Reenable notifications for backup source host db2198 [puppet] - 10https://gerrit.wikimedia.org/r/1017057 (https://phabricator.wikimedia.org/T355422) [16:28:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T355609)', diff saved to https://phabricator.wikimedia.org/P59533 and previous config saved to /var/cache/conftool/dbconfig/20240404-162801-marostegui.json [16:28:04] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:29:00] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: 14decommission db2104.codfw.wmnet - 14https://phabricator.wikimedia.org/T361779#9689433 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:30:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59534 and previous config saved to /var/cache/conftool/dbconfig/20240404-163053-arnaudb.json [16:31:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59535 and previous config saved to /var/cache/conftool/dbconfig/20240404-163107-arnaudb.json [16:31:13] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for backup source host db2198 [puppet] - 10https://gerrit.wikimedia.org/r/1017057 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [16:32:30] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [16:35:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T360332)', diff saved to https://phabricator.wikimedia.org/P59536 and previous config saved to /var/cache/conftool/dbconfig/20240404-163526-arnaudb.json [16:35:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:35:30] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:35:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:35:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T360332)', diff saved to https://phabricator.wikimedia.org/P59537 and previous config saved to /var/cache/conftool/dbconfig/20240404-163549-arnaudb.json [16:38:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T360332)', diff saved to https://phabricator.wikimedia.org/P59538 and previous config saved to /var/cache/conftool/dbconfig/20240404-163819-arnaudb.json [16:41:21] (03CR) 10Scott French: "Thanks for the review, Riccardo." [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French) [16:41:36] (03CR) 10Scott French: [C:03+2] Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French) [16:42:24] (03Merged) 10jenkins-bot: Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French) [16:42:58] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm [16:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59539 and previous config saved to /var/cache/conftool/dbconfig/20240404-164309-marostegui.json [16:45:04] (03PS1) 10Cory Massaro: Update config variables to point to the correct WASM binaries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017102 [16:45:48] (03PS2) 10Cory Massaro: Update config variables to point to the correct WASM binaries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017102 (https://phabricator.wikimedia.org/T361854) [16:49:06] (03CR) 10Bking: [V:03+1] Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [16:49:15] (03CR) 10Bking: [C:03+1] Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [16:49:41] (03PS2) 10Yahya: Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) [16:51:55] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [16:52:54] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Platform-SRE (2024.03.25 - 2024.04.14): 14create and deploy new Elastic Curator deb package - 14https://phabricator.wikimedia.org/T361105#9689591 (10bking) 05Resolved→03Declined [16:53:15] (03PS4) 10Scott French: Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (https://phabricator.wikimedia.org/T358636) [16:53:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P59540 and previous config saved to /var/cache/conftool/dbconfig/20240404-165328-arnaudb.json [16:54:42] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [16:54:57] (03CR) 10Scott French: [C:03+2] Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:55:47] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3068.esams.wmnet with OS bullseye [16:55:50] (03Merged) 10jenkins-bot: Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:56:00] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9689603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3068.esams.wmnet with OS bullseye completed: - cp3068 (**PASS**)... [16:58:02] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [16:58:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59541 and previous config saved to /var/cache/conftool/dbconfig/20240404-165816-marostegui.json [16:58:54] (03CR) 10CI reject: [V:04-1] Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:00:04] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1700) [17:03:48] (03PS3) 10Jforrester: wikifunctions: Update evaluator config to point to the correct WASM binaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017102 (https://phabricator.wikimedia.org/T361854) (owner: 10Cory Massaro) [17:04:09] (03PS8) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [17:04:19] (03CR) 10Anzx: [C:04-1] Enable abusefilter block at bnwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:07:57] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluator config to point to the correct WASM binaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017102 (https://phabricator.wikimedia.org/T361854) (owner: 10Cory Massaro) [17:08:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P59542 and previous config saved to /var/cache/conftool/dbconfig/20240404-170836-arnaudb.json [17:08:39] (03PS9) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [17:08:50] (03Merged) 10jenkins-bot: wikifunctions: Update evaluator config to point to the correct WASM binaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017102 (https://phabricator.wikimedia.org/T361854) (owner: 10Cory Massaro) [17:09:12] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3068.esams.wmnet [17:11:27] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9689660 (10Fabfur) [17:12:12] 06SRE, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860 (10Xover) 03NEW [17:12:25] (03CR) 10Scott French: "Thank you, Riccardo!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:12:35] (03PS3) 10Yahya: Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) [17:12:50] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:13:20] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:13:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T355609)', diff saved to https://phabricator.wikimedia.org/P59543 and previous config saved to /var/cache/conftool/dbconfig/20240404-171324-marostegui.json [17:13:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [17:13:37] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:13:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [17:13:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T355609)', diff saved to https://phabricator.wikimedia.org/P59544 and previous config saved to /var/cache/conftool/dbconfig/20240404-171347-marostegui.json [17:13:49] p858snake|cloud: See T361860. Phab limits who can create private pastes, but the headers contain nothing obviously of relevance. [17:13:59] T361860: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860 [17:14:42] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T361798#9689684 (10Aklapper) @Ospingou: Hi and welcome! :) Based on this request, I assume that you are a WMF staff member or contractor. Could you please [connect](https://phabricator.wikimedia.org/set... [17:15:30] (03CR) 10Anzx: Enable abusefilter block at bnwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:15:38] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:16:57] (03PS1) 10RLazarus: admin: Add ospingou to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1017107 (https://phabricator.wikimedia.org/T361798) [17:17:29] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:17:36] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:18:00] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9689708 (10ssingh) Traffic has been reimaging hosts in esams (we have done three so far for T360430) and we observed that we didn't have... [17:19:37] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:20:43] (03CR) 10Ssingh: [C:03+1] admin: Add ospingou to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1017107 (https://phabricator.wikimedia.org/T361798) (owner: 10RLazarus) [17:21:15] (03CR) 10RLazarus: [C:03+2] admin: Add ospingou to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1017107 (https://phabricator.wikimedia.org/T361798) (owner: 10RLazarus) [17:22:22] !log installing qemu security updates on bookworm [17:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T360332)', diff saved to https://phabricator.wikimedia.org/P59545 and previous config saved to /var/cache/conftool/dbconfig/20240404-172343-arnaudb.json [17:23:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:23:55] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:24:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:24:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T360332)', diff saved to https://phabricator.wikimedia.org/P59546 and previous config saved to /var/cache/conftool/dbconfig/20240404-172408-arnaudb.json [17:25:36] (03PS4) 10Yahya: Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) [17:27:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T360332)', diff saved to https://phabricator.wikimedia.org/P59547 and previous config saved to /var/cache/conftool/dbconfig/20240404-172739-arnaudb.json [17:28:00] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: 14Grant Access to for  - 14https://phabricator.wikimedia.org/T361798#9689764 (10RLazarus) 05Open→03Resolved p:05Triage→03Medium a:03RLazarus 14Done! ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf | grep ospingou member: uid=os... [17:28:53] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:29:48] !log installing isl bugfix updates from Bookworm point release [17:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:02] (03CR) 10Anzx: [C:03+1] Enable abusefilter block at bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:31:04] (03CR) 10Anzx: [C:03+1] Enable abusefilter block at bnwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:34:13] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9689810 (10MoritzMuehlenhoff) [17:35:00] (03CR) 10Anzx: [C:03+1] "please schedule this for deployment in any of backport windows https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [17:35:54] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:36:00] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:37:27] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet [17:40:51] (03CR) 10Ebernhardson: WIP: Add Flink alerts for Cirrus Streaming Updater (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [17:42:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P59548 and previous config saved to /var/cache/conftool/dbconfig/20240404-174246-arnaudb.json [17:45:53] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9689860 (10andrea.denisse) 05Open→03In progress [17:46:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4052.ulsfo.wmnet [17:48:28] !depool cp4052 to prepare for reimaging [17:48:28] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [17:48:32] !log depool cp4052 to prepare for reimaging [17:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:34] every single time [17:48:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm [17:49:23] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [17:52:43] (03CR) 10Scott French: [C:03+2] Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:52:45] that wm-bot response sure is something [17:53:02] a form of documentation-out-of-date I didn't even know we had [17:53:19] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9689894 (10andrea.denisse) [17:53:25] ha, I think we keep it around to remember the old times or something :P [17:53:32] (03Merged) 10jenkins-bot: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:57:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P59549 and previous config saved to /var/cache/conftool/dbconfig/20240404-175756-arnaudb.json [18:00:04] jnuche and jeena: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T1800). [18:00:36] (03CR) 10Yahya: [C:03+1] "done. Thank you for the help." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016882 (https://phabricator.wikimedia.org/T361852) (owner: 10Yahya) [18:01:53] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [18:04:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [18:05:17] (03PS1) 10Bking: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 [18:07:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T356166)', diff saved to https://phabricator.wikimedia.org/P59550 and previous config saved to /var/cache/conftool/dbconfig/20240404-180733-marostegui.json [18:07:37] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:09:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T355609)', diff saved to https://phabricator.wikimedia.org/P59551 and previous config saved to /var/cache/conftool/dbconfig/20240404-180913-marostegui.json [18:09:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:13:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T360332)', diff saved to https://phabricator.wikimedia.org/P59552 and previous config saved to /var/cache/conftool/dbconfig/20240404-181303-arnaudb.json [18:13:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:13:07] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:13:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:13:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T360332)', diff saved to https://phabricator.wikimedia.org/P59553 and previous config saved to /var/cache/conftool/dbconfig/20240404-181326-arnaudb.json [18:16:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T360332)', diff saved to https://phabricator.wikimedia.org/P59554 and previous config saved to /var/cache/conftool/dbconfig/20240404-181601-arnaudb.json [18:16:55] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [18:19:45] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9690028 (10RLazarus) p:05Triage→03High Clinic duty SRE here -- I/F, can you start investigating this at the MTA end? Triaging this to High in case... [18:19:50] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [18:22:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P59555 and previous config saved to /var/cache/conftool/dbconfig/20240404-182241-marostegui.json [18:23:32] (03PS1) 10Andrea Denisse: performance: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1017117 (https://phabricator.wikimedia.org/T360414) [18:23:48] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [18:24:12] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [18:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59556 and previous config saved to /var/cache/conftool/dbconfig/20240404-182421-marostegui.json [18:26:39] (03PS2) 10Andrea Denisse: performance: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1017117 (https://phabricator.wikimedia.org/T360414) [18:27:20] (03CR) 10Dzahn: [C:03+1] "look good to me - we went through this in a 1:1 and checked all the existing names on the SAN on the old cert - but they are all not neede" [puppet] - 10https://gerrit.wikimedia.org/r/1017117 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:28:14] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9690061 (10ssingh) Update: I ran the firmware-upgrade cookbook on cp4052 and updated it's firmware to `6.10.30.20`, did a `racreset` to... [18:31:01] !log Disabling Puppet on the webperf hosts part of the cergen to CFSSL migration - T360414 [18:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:05] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [18:31:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P59557 and previous config saved to /var/cache/conftool/dbconfig/20240404-183108-arnaudb.json [18:36:31] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9690121 (10ssingh) Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of l... [18:37:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P59558 and previous config saved to /var/cache/conftool/dbconfig/20240404-183748-marostegui.json [18:39:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59559 and previous config saved to /var/cache/conftool/dbconfig/20240404-183928-marostegui.json [18:39:34] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on performance.wikimedia.org with reason: Downtiming the webperf hosts part of the cergen to CFSSL migration - T360414 [18:39:35] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on performance.wikimedia.org with reason: Downtiming the webperf hosts part of the cergen to CFSSL migration - T360414 [18:39:37] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [18:40:58] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on webperf2003.codfw.wmnet,webperf1003.eqiad.wmnet with reason: Downtiming the webperf hosts part of the cergen to CFSSL migration - T360414 [18:41:14] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on webperf2003.codfw.wmnet,webperf1003.eqiad.wmnet with reason: Downtiming the webperf hosts part of the cergen to CFSSL migration - T360414 [18:41:28] (03CR) 10Andrea Denisse: [C:03+2] performance: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1017117 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:41:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1017117 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:43:12] (03PS4) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [18:43:31] (03CR) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [18:44:56] (03PS3) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 [18:46:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P59560 and previous config saved to /var/cache/conftool/dbconfig/20240404-184616-arnaudb.json [18:46:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [18:49:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [18:51:20] 06SRE, 06SRE Observability: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9690161 (10lmata) [18:51:55] (03CR) 10CI reject: [V:04-1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [18:52:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T356166)', diff saved to https://phabricator.wikimedia.org/P59561 and previous config saved to /var/cache/conftool/dbconfig/20240404-185256-marostegui.json [18:52:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:53:00] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [18:53:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:53:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T356166)', diff saved to https://phabricator.wikimedia.org/P59562 and previous config saved to /var/cache/conftool/dbconfig/20240404-185319-marostegui.json [18:54:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T355609)', diff saved to https://phabricator.wikimedia.org/P59563 and previous config saved to /var/cache/conftool/dbconfig/20240404-185436-marostegui.json [18:54:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:54:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:54:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T355609)', diff saved to https://phabricator.wikimedia.org/P59564 and previous config saved to /var/cache/conftool/dbconfig/20240404-185458-marostegui.json [18:57:39] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9690206 (10Gehel) p:05Triage→03High [18:57:41] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9690201 (10RobH) [18:57:47] (03PS5) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [18:57:48] (03CR) 10Urbanecm: [C:04-1] "(needs rebase)" [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:57:56] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9690208 (10Gehel) [18:58:09] (03PS1) 10Andrew Bogott: openstack: nova-compute: persist compute node id for cloudvirt1031 [puppet] - 10https://gerrit.wikimedia.org/r/1017124 (https://phabricator.wikimedia.org/T357631) [19:00:57] (03CR) 10Andrew Bogott: "Arturo, I'm going to merge this now to fix an alert, but I'm interested in your opinion after the fact." [puppet] - 10https://gerrit.wikimedia.org/r/1017124 (https://phabricator.wikimedia.org/T357631) (owner: 10Andrew Bogott) [19:00:58] (03CR) 10Andrew Bogott: [C:03+2] openstack: nova-compute: persist compute node id for cloudvirt1031 [puppet] - 10https://gerrit.wikimedia.org/r/1017124 (https://phabricator.wikimedia.org/T357631) (owner: 10Andrew Bogott) [19:01:18] (03PS6) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [19:01:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T360332)', diff saved to https://phabricator.wikimedia.org/P59565 and previous config saved to /var/cache/conftool/dbconfig/20240404-190123-arnaudb.json [19:01:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [19:01:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:01:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [19:01:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T360332)', diff saved to https://phabricator.wikimedia.org/P59566 and previous config saved to /var/cache/conftool/dbconfig/20240404-190146-arnaudb.json [19:05:27] (03PS7) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [19:06:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T360332)', diff saved to https://phabricator.wikimedia.org/P59567 and previous config saved to /var/cache/conftool/dbconfig/20240404-190616-arnaudb.json [19:10:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [19:11:00] (03PS2) 10Ryan Kemper: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [19:11:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [19:12:07] (03CR) 10CI reject: [V:04-1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [19:13:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P59568 and previous config saved to /var/cache/conftool/dbconfig/20240404-192123-arnaudb.json [19:24:07] 10ops-codfw, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871 (10Papaul) 03NEW [19:25:00] (03PS3) 10Ryan Kemper: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [19:36:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P59569 and previous config saved to /var/cache/conftool/dbconfig/20240404-193631-arnaudb.json [19:47:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T355609)', diff saved to https://phabricator.wikimedia.org/P59570 and previous config saved to /var/cache/conftool/dbconfig/20240404-194739-marostegui.json [19:47:43] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:50:03] (03PS1) 10Andrea Denisse: ssl: Remove performance.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1017125 (https://phabricator.wikimedia.org/T360414) [19:51:07] (03CR) 10Dzahn: [C:03+1] ssl: Remove performance.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1017125 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:51:26] (03CR) 10Andrea Denisse: [C:03+2] ssl: Remove performance.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1017125 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:51:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T360332)', diff saved to https://phabricator.wikimedia.org/P59571 and previous config saved to /var/cache/conftool/dbconfig/20240404-195138-arnaudb.json [19:51:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:51:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:51:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:52:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:52:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:53:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [19:53:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [19:53:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2108 (T360332)', diff saved to https://phabricator.wikimedia.org/P59572 and previous config saved to /var/cache/conftool/dbconfig/20240404-195333-arnaudb.json [19:56:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T360332)', diff saved to https://phabricator.wikimedia.org/P59573 and previous config saved to /var/cache/conftool/dbconfig/20240404-195615-arnaudb.json [19:57:18] (03PS2) 10Scott French: WIP: role::configcluster: Add a dedicated ACL for /spicerack keyspace [puppet] - 10https://gerrit.wikimedia.org/r/1016456 [19:57:18] (03PS4) 10Scott French: WIP: profile::etcd::tlsproxy: Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 [19:57:18] (03PS4) 10Scott French: DNM: role::configcluster: Make /spicerack read-only [puppet] - 10https://gerrit.wikimedia.org/r/1016458 [19:59:01] (03CR) 10Scott French: "Thanks for reviewing this chain, Riccardo!" [puppet] - 10https://gerrit.wikimedia.org/r/1016456 (owner: 10Scott French) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240404T2000). [20:00:04] ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] (03PS1) 10Andrea Denisse: Delete dummy TLS certificate for the performance host [labs/private] - 10https://gerrit.wikimedia.org/r/1017146 (https://phabricator.wikimedia.org/T333615) [20:00:36] (03CR) 10Andrea Denisse: [C:03+2] Delete dummy TLS certificate for the performance host [labs/private] - 10https://gerrit.wikimedia.org/r/1017146 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [20:00:42] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Delete dummy TLS certificate for the performance host [labs/private] - 10https://gerrit.wikimedia.org/r/1017146 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [20:02:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59574 and previous config saved to /var/cache/conftool/dbconfig/20240404-200247-marostegui.json [20:03:29] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9690601 (10andrea.denisse) [20:04:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9690604 (10andrea.denisse) [20:04:25] (03CR) 10Ebernhardson: [C:03+1] flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [20:04:45] (03CR) 10Ebernhardson: [C:03+1] flink-kubernetes-operator: restart failed jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [20:08:35] * cjming waves -- is around tho late [20:08:39] ebernhardson: do you need someone to deploy? [20:11:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P59575 and previous config saved to /var/cache/conftool/dbconfig/20240404-201126-arnaudb.json [20:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59576 and previous config saved to /var/cache/conftool/dbconfig/20240404-201755-marostegui.json [20:22:50] (03CR) 10Andrew Bogott: "Weirdly this causes cloud-init to also get uninstalled, which breaks VM setup in cloud-vps. See https://phabricator.wikimedia.org/T361749" [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [20:24:50] (03PS4) 10Bking: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) [20:26:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P59577 and previous config saved to /var/cache/conftool/dbconfig/20240404-202634-arnaudb.json [20:30:57] (03PS5) 10Bking: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) [20:31:14] (03CR) 10Bking: flink-kubernetes-operator: restart failed jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [20:31:51] (03PS1) 10Andrew Bogott: cloud-vps instances: ensure cloud-init is forever installed [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) [20:33:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T355609)', diff saved to https://phabricator.wikimedia.org/P59578 and previous config saved to /var/cache/conftool/dbconfig/20240404-203302-marostegui.json [20:33:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:33:14] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:33:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:34:01] (03PS2) 10Andrew Bogott: cloud-vps instances: ensure cloud-init is forever installed [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) [20:34:11] 06SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685#9690698 (10lmata) [20:34:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) (owner: 10Andrew Bogott) [20:40:05] (03PS3) 10Andrew Bogott: cloud-vps instances: ensure cloud-init is forever installed [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) [20:40:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) (owner: 10Andrew Bogott) [20:40:36] (03PS1) 10Tchanders: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017152 (https://phabricator.wikimedia.org/T361884) [20:41:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T360332)', diff saved to https://phabricator.wikimedia.org/P59579 and previous config saved to /var/cache/conftool/dbconfig/20240404-204141-arnaudb.json [20:41:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:41:45] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:41:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:42:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2120 (T360332)', diff saved to https://phabricator.wikimedia.org/P59580 and previous config saved to /var/cache/conftool/dbconfig/20240404-204204-arnaudb.json [20:44:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T360332)', diff saved to https://phabricator.wikimedia.org/P59581 and previous config saved to /var/cache/conftool/dbconfig/20240404-204446-arnaudb.json [20:51:34] (03PS1) 10Cwhite: wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814) [20:54:52] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps instances: ensure cloud-init is forever installed [puppet] - 10https://gerrit.wikimedia.org/r/1017150 (https://phabricator.wikimedia.org/T361749) (owner: 10Andrew Bogott) [20:59:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P59582 and previous config saved to /var/cache/conftool/dbconfig/20240404-205953-arnaudb.json [21:02:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T356166)', diff saved to https://phabricator.wikimedia.org/P59583 and previous config saved to /var/cache/conftool/dbconfig/20240404-210230-marostegui.json [21:02:35] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:11:56] (03PS1) 10Andrew Bogott: Revert "Uninstall eject on VMs" [puppet] - 10https://gerrit.wikimedia.org/r/1017155 (https://phabricator.wikimedia.org/T361749) [21:12:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:12:41] (03CR) 10Andrew Bogott: "reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017155" [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [21:12:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T355609)', diff saved to https://phabricator.wikimedia.org/P59584 and previous config saved to /var/cache/conftool/dbconfig/20240404-211248-marostegui.json [21:12:52] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:15:01] (03CR) 10Andrew Bogott: [C:03+2] Revert "Uninstall eject on VMs" [puppet] - 10https://gerrit.wikimedia.org/r/1017155 (https://phabricator.wikimedia.org/T361749) (owner: 10Andrew Bogott) [21:15:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P59585 and previous config saved to /var/cache/conftool/dbconfig/20240404-211501-arnaudb.json [21:17:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P59586 and previous config saved to /var/cache/conftool/dbconfig/20240404-211738-marostegui.json [21:24:51] (03PS12) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:26:46] (03PS13) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:27:51] (03CR) 10CI reject: [V:04-1] WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:30:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T360332)', diff saved to https://phabricator.wikimedia.org/P59587 and previous config saved to /var/cache/conftool/dbconfig/20240404-213008-arnaudb.json [21:30:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:30:17] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:30:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:30:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T360332)', diff saved to https://phabricator.wikimedia.org/P59588 and previous config saved to /var/cache/conftool/dbconfig/20240404-213031-arnaudb.json [21:32:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P59589 and previous config saved to /var/cache/conftool/dbconfig/20240404-213245-marostegui.json [21:33:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T360332)', diff saved to https://phabricator.wikimedia.org/P59590 and previous config saved to /var/cache/conftool/dbconfig/20240404-213317-arnaudb.json [21:40:24] (03PS14) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:41:30] (03CR) 10CI reject: [V:04-1] WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:44:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T356166)', diff saved to https://phabricator.wikimedia.org/P59591 and previous config saved to /var/cache/conftool/dbconfig/20240404-214753-marostegui.json [21:47:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [21:48:02] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [21:48:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [21:48:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T356166)', diff saved to https://phabricator.wikimedia.org/P59592 and previous config saved to /var/cache/conftool/dbconfig/20240404-214817-marostegui.json [21:48:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P59593 and previous config saved to /var/cache/conftool/dbconfig/20240404-214824-arnaudb.json [21:49:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 895.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:55:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T355609)', diff saved to https://phabricator.wikimedia.org/P59594 and previous config saved to /var/cache/conftool/dbconfig/20240404-215557-marostegui.json [21:56:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:02:57] (03PS15) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [22:03:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P59595 and previous config saved to /var/cache/conftool/dbconfig/20240404-220331-arnaudb.json [22:08:33] (03CR) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [22:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59596 and previous config saved to /var/cache/conftool/dbconfig/20240404-221104-marostegui.json [22:18:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T360332)', diff saved to https://phabricator.wikimedia.org/P59597 and previous config saved to /var/cache/conftool/dbconfig/20240404-221839-arnaudb.json [22:18:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:18:44] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:18:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:19:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T360332)', diff saved to https://phabricator.wikimedia.org/P59598 and previous config saved to /var/cache/conftool/dbconfig/20240404-221903-arnaudb.json [22:21:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T360332)', diff saved to https://phabricator.wikimedia.org/P59599 and previous config saved to /var/cache/conftool/dbconfig/20240404-222141-arnaudb.json [22:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59600 and previous config saved to /var/cache/conftool/dbconfig/20240404-222612-marostegui.json [22:26:45] (03PS2) 10Tim Starling: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) [22:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:36:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P59601 and previous config saved to /var/cache/conftool/dbconfig/20240404-223649-arnaudb.json [22:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T355609)', diff saved to https://phabricator.wikimedia.org/P59602 and previous config saved to /var/cache/conftool/dbconfig/20240404-224119-marostegui.json [22:41:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:41:23] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:41:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:51:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P59603 and previous config saved to /var/cache/conftool/dbconfig/20240404-225156-arnaudb.json [23:07:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T360332)', diff saved to https://phabricator.wikimedia.org/P59604 and previous config saved to /var/cache/conftool/dbconfig/20240404-230704-arnaudb.json [23:07:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:07:08] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:07:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:07:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:07:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:07:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T360332)', diff saved to https://phabricator.wikimedia.org/P59605 and previous config saved to /var/cache/conftool/dbconfig/20240404-230743-arnaudb.json [23:10:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T360332)', diff saved to https://phabricator.wikimedia.org/P59606 and previous config saved to /var/cache/conftool/dbconfig/20240404-231020-arnaudb.json [23:17:06] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [23:21:42] (03PS5) 10Krinkle: codesearch: Enable network=host and set CODESEARCH_HOUND_BASE [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (https://phabricator.wikimedia.org/T361899) [23:25:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P59608 and previous config saved to /var/cache/conftool/dbconfig/20240404-232528-arnaudb.json [23:33:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017079 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1017079 (owner: 10TrainBranchBot) [23:40:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P59609 and previous config saved to /var/cache/conftool/dbconfig/20240404-234035-arnaudb.json [23:55:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T360332)', diff saved to https://phabricator.wikimedia.org/P59610 and previous config saved to /var/cache/conftool/dbconfig/20240404-235543-arnaudb.json [23:55:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:55:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:55:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59611 and previous config saved to /var/cache/conftool/dbconfig/20240404-235606-arnaudb.json [23:58:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:58:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59612 and previous config saved to /var/cache/conftool/dbconfig/20240404-235843-arnaudb.json