[00:01:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1015459 (owner: 10TrainBranchBot) [00:25:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 869.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:30:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 869.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:03:53] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T361533 (10phaultfinder) 03NEW [01:07:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.25 [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1015460 (https://phabricator.wikimedia.org/T360157) [01:07:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.42.0-wmf.25 [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1015460 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [01:13:01] (03PS1) 10Andrew Bogott: Add another profile::openstack::eqiad1::nova::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1016053 [01:13:22] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add another profile::openstack::eqiad1::nova::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1016053 (owner: 10Andrew Bogott) [01:13:45] (03CR) 10Andrew Bogott: [C:03+2] role::wmcs::openstack::eqiad1::cinder_backups: include envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016023 (owner: 10Andrew Bogott) [01:16:19] (03CR) 10Andrew Bogott: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016023 (owner: 10Andrew Bogott) [01:24:03] (03PS1) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:24:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T361535 (10phaultfinder) 03NEW [01:28:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.25 [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1015460 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [01:28:48] (03PS2) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:38:20] (03PS1) 10Andrew Bogott: Move an eqiad-specific designate setting to the 'common' tree [puppet] - 10https://gerrit.wikimedia.org/r/1016058 [01:40:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [01:42:10] (03PS3) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:45:17] (03PS4) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:48:52] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 (owner: 10Andrew Bogott) [01:49:00] (03CR) 10Andrew Bogott: [C:03+2] Move an eqiad-specific designate setting to the 'common' tree [puppet] - 10https://gerrit.wikimedia.org/r/1016058 (owner: 10Andrew Bogott) [01:50:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [02:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0200) [02:18:27] (03PS1) 10Andrew Bogott: eqiad1::cinder_backups: include observerenv rather than envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016060 [02:18:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016060 (owner: 10Andrew Bogott) [02:22:30] (03CR) 10Andrew Bogott: [C:03+2] eqiad1::cinder_backups: include observerenv rather than envscripts [puppet] - 10https://gerrit.wikimedia.org/r/1016060 (owner: 10Andrew Bogott) [02:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:49] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1015304 (owner: 10L10n-bot) [02:41:26] (03PS1) 10Andrew Bogott: wmcs-backup: use novaobserver instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1016064 [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:46:06] (03CR) 10Andrew Bogott: [C:03+2] wmcs-backup: use novaobserver instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1016064 (owner: 10Andrew Bogott) [02:50:39] (03PS1) 10Tim Starling: WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) [02:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:51:33] (03CR) 10Tim Starling: "This is untested and I have a low confidence in it. I would appreciate advice on how to test it." [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0300) [03:02:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:25] !log mwpresync@deploy1002 Pruned MediaWiki: 1.42.0-wmf.22 (duration: 03m 20s) [03:04:57] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016069 (https://phabricator.wikimedia.org/T360157) [03:04:59] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016069 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [03:05:35] (03CR) 10CI reject: [V:04-1] testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016069 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:12:21] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537 (10Andrew) 03NEW [03:16:00] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537#9678267 (10Andrew) [03:17:52] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537#9678269 (10Andrew) @cmooney do you recall if we have special secret routing set up someplace to make this work for the old cloudbackup hosts? [03:20:31] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T361533#9678270 (10Papaul) 05Open→03Resolved a:03Papaul [03:22:09] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Decom asw-b-codfw switch stack - 14https://phabricator.wikimedia.org/T360776#9678272 (10Papaul) [03:24:00] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Decom asw-b-codfw switch stack - 14https://phabricator.wikimedia.org/T360776#9678273 (10Papaul) 05Open→03Resolved a:03Papaul [03:53:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.339s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:58:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 823.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:05:49] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015462 [04:20:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:49:23] (03PS8) 10Sg912: SLO queries for AQS 2.0 geo analytics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) [04:50:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 815.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:53:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:53:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59107 and previous config saved to /var/cache/conftool/dbconfig/20240402-045353-root.json [04:55:20] (03PS3) 10Dwisehaupt: Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [04:55:57] (03PS1) 10Marostegui: es1024: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016076 (https://phabricator.wikimedia.org/T358746) [04:56:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1024 T358746', diff saved to https://phabricator.wikimedia.org/P59108 and previous config saved to /var/cache/conftool/dbconfig/20240402-045559-root.json [04:56:02] T358746: Upgrade es5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358746 [04:56:14] (03CR) 10CI reject: [V:04-1] Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) (owner: 10Dwisehaupt) [04:56:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:57:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:57:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T356166)', diff saved to https://phabricator.wikimedia.org/P59109 and previous config saved to /var/cache/conftool/dbconfig/20240402-045716-marostegui.json [04:57:19] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [04:57:40] (03CR) 10Marostegui: [C:03+2] es1024: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016076 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui) [04:58:00] (03PS4) 10Dwisehaupt: Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [04:58:50] (03CR) 10CI reject: [V:04-1] Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) (owner: 10Dwisehaupt) [05:02:13] (03PS1) 10KartikMistry: Update cxserver to 2024-04-01-160720-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T333969) [05:04:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59110 and previous config saved to /var/cache/conftool/dbconfig/20240402-050436-root.json [05:08:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59111 and previous config saved to /var/cache/conftool/dbconfig/20240402-050859-root.json [05:19:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59112 and previous config saved to /var/cache/conftool/dbconfig/20240402-051942-root.json [05:24:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59113 and previous config saved to /var/cache/conftool/dbconfig/20240402-052404-root.json [05:31:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:34:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59114 and previous config saved to /var/cache/conftool/dbconfig/20240402-053447-root.json [05:39:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59115 and previous config saved to /var/cache/conftool/dbconfig/20240402-053910-root.json [05:44:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1229 T361543', diff saved to https://phabricator.wikimedia.org/P59116 and previous config saved to /var/cache/conftool/dbconfig/20240402-054408-root.json [05:44:12] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [05:45:35] (03PS1) 10Marostegui: db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016080 (https://phabricator.wikimedia.org/T361543) [05:46:28] (03CR) 10Marostegui: [C:03+2] db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016080 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui) [05:46:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bookworm [05:49:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59117 and previous config saved to /var/cache/conftool/dbconfig/20240402-054953-root.json [05:53:55] (03PS1) 10Marostegui: Revert "db1229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016032 [05:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59118 and previous config saved to /var/cache/conftool/dbconfig/20240402-055416-root.json [05:59:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0600). [06:03:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59119 and previous config saved to /var/cache/conftool/dbconfig/20240402-060459-root.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59120 and previous config saved to /var/cache/conftool/dbconfig/20240402-060922-root.json [06:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T356166)', diff saved to https://phabricator.wikimedia.org/P59121 and previous config saved to /var/cache/conftool/dbconfig/20240402-061206-marostegui.json [06:12:09] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:15:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 845.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:20:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59122 and previous config saved to /var/cache/conftool/dbconfig/20240402-062004-root.json [06:20:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 845.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:23:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1229.eqiad.wmnet with OS bookworm [06:24:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59123 and previous config saved to /var/cache/conftool/dbconfig/20240402-062427-root.json [06:26:52] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9678542 (10Urbanecm) >>! In T351202#9639225, @Dzahn wrote:... [06:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P59124 and previous config saved to /var/cache/conftool/dbconfig/20240402-062713-marostegui.json [06:29:51] (03PS2) 10Elukey: profile::pki::multirootca::monitoring: add workaround for python3-crypto [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) [06:30:27] (03CR) 10Elukey: profile::pki::multirootca::monitoring: add workaround for python3-crypto (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [06:30:43] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1771/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [06:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:57] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1772/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [06:35:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59125 and previous config saved to /var/cache/conftool/dbconfig/20240402-063510-root.json [06:35:25] (03CR) 10Marostegui: [C:03+2] Revert "db1229: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016032 (owner: 10Marostegui) [06:36:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59126 and previous config saved to /var/cache/conftool/dbconfig/20240402-063607-root.json [06:41:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [06:42:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P59127 and previous config saved to /var/cache/conftool/dbconfig/20240402-064221-marostegui.json [06:48:22] (03CR) 10AOkoth: [C:03+2] miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [06:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59128 and previous config saved to /var/cache/conftool/dbconfig/20240402-065113-root.json [06:57:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T356166)', diff saved to https://phabricator.wikimedia.org/P59129 and previous config saved to /var/cache/conftool/dbconfig/20240402-065728-marostegui.json [06:57:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:57:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:57:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T356166)', diff saved to https://phabricator.wikimedia.org/P59130 and previous config saved to /var/cache/conftool/dbconfig/20240402-065751-marostegui.json [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0700). nyaa~ [07:00:04] gmodena: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] I'm around [07:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:02:43] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9678629 (10MoritzMuehlenhoff) [07:03:07] !log installing util-linux security updates [07:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59131 and previous config saved to /var/cache/conftool/dbconfig/20240402-070619-root.json [07:10:14] jouncebot: now [07:10:14] For the next 0 hour(s) and 49 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0700) [07:10:36] gmodena: good morning, I will proceed your patch [07:10:46] I apologize for the delay, the week-end has been long and tedious! :D [07:11:01] hashar thanks & no worries at all [07:11:11] I'm also just back from a very long weekend :D [07:11:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [07:11:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [07:12:06] (03Merged) 10jenkins-bot: webrequest: disable canary events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [07:12:44] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] [07:12:47] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9678635 (10Urbanecm) We also need to move the list of user... [07:12:48] T314956: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 [07:12:48] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [07:21:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59132 and previous config saved to /var/cache/conftool/dbconfig/20240402-072125-root.json [07:22:48] gmodena: the images have been built and are being pulled :) [07:22:55] hashar ack [07:23:34] the first deploy of the day usually takes half an hour or so [07:24:11] partly due to the localization cache being rebuild (it is something like 3.5 G) [07:26:29] TIL [07:26:32] thanks for the heads up [07:27:39] I'll be around to test the whole morning [07:28:20] !log hashar@deploy1002 gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:28:26] T314956: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 [07:28:26] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [07:28:47] gmodena: that is ready to be tested, if that is ever testable [07:28:48] :) [07:28:50] (03CR) 10Elukey: [V:03+1 C:03+2] profile::pki::multirootca::monitoring: add workaround for python3-crypto [puppet] - 10https://gerrit.wikimedia.org/r/1015541 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [07:30:05] hashar tested on mwdebug1001. A call to api.php?action=streamconfigs&streams=webrequest.frontend.rc0 produces the expected output. [07:30:26] !log hashar@deploy1002 gmodena and hashar: Continuing with sync [07:30:29] excellent [07:36:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59133 and previous config saved to /var/cache/conftool/dbconfig/20240402-073631-root.json [07:38:31] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: 14connectivity from cloudbackup200[34] and eqiad ceph - 14https://phabricator.wikimedia.org/T361537#9678727 (10taavi) 05Open→03Resolved [07:39:25] !log update firewall policy on cr-eqiad, cr-codfw T361537 [07:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:28] T361537: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537 [07:39:35] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: 14connectivity from cloudbackup200[34] and eqiad ceph - 14https://phabricator.wikimedia.org/T361537#9678725 (10taavi) a:05Andrew→03taavi 14I ran the Capirca netbox script and that updated the firewall policy on `cr*-eqiad`: `lang=diff [edit fire... [07:42:59] (03CR) 10Majavah: "I don't have concerns with this specific extension, but unless there's a specific reason not to I'd like to only add new functionality to " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [07:43:42] (03PS1) 10Elukey: profile::pki::multirootca::monitoring: rework prometheus-client pkg [puppet] - 10https://gerrit.wikimedia.org/r/1016289 (https://phabricator.wikimedia.org/T360595) [07:45:15] 06SRE, 07SecTeam-Processed, 07Security, 07Vuln-VulnComponent: 14[CVE-2024-3094] SSH backdoor vulnerability in liblzma in Debian Sid - 14https://phabricator.wikimedia.org/T361420#9678774 (10Aklapper) 05Resolved→03Invalid [07:45:36] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1773/console" [puppet] - 10https://gerrit.wikimedia.org/r/1016289 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [07:46:48] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s) [07:46:51] T314956: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 [07:46:52] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [07:47:00] yeah half an hour [07:47:33] !log UTC morning backport window completed [07:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 38278 [07:49:53] (03CR) 10Elukey: [V:03+1 C:03+2] profile::pki::multirootca::monitoring: rework prometheus-client pkg [puppet] - 10https://gerrit.wikimedia.org/r/1016289 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [07:51:22] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; e... - https://phabricator.wikimedia.org/T361218#9678784 [07:51:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59134 and previous config saved to /var/cache/conftool/dbconfig/20240402-075136-root.json [07:51:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38278 [07:55:37] (03PS1) 10Muehlenhoff: Move alert* Hiera config to the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016290 (https://phabricator.wikimedia.org/T333615) [07:59:08] (03CR) 10Majavah: [C:03+1] wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [07:59:46] (03CR) 10DCausse: [C:04-1] updateQueryServiceLag: tune the min query rate of a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [08:00:05] jnuche and jeena: MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T0800). Please do the needful. [08:00:35] morning, I'll start the train in a few minutes [08:00:52] presync needs to be re-run, so I'll begin with that [08:02:49] !log restore SRE business hours routing/escalation after the holidays - T350192 [08:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:52] T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192 [08:06:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59137 and previous config saved to /var/cache/conftool/dbconfig/20240402-080642-root.json [08:07:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9678812 (10MoritzMuehlenhoff) [08:10:22] !log jnuche@deploy1002 Pruned MediaWiki: 1.42.0-wmf.23 (duration: 03m 45s) [08:10:36] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016291 (https://phabricator.wikimedia.org/T360157) [08:10:37] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016291 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:10:53] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#9678820 (10ayounsi) For the record I looked deeper at gNMI to configure Juniper devices. Some of the findings: current code fails with a reply from the switch ab... [08:11:20] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016291 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:11:49] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.42.0-wmf.25 refs T360157 [08:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T356166)', diff saved to https://phabricator.wikimedia.org/P59138 and previous config saved to /var/cache/conftool/dbconfig/20240402-081229-marostegui.json [08:14:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1197', diff saved to https://phabricator.wikimedia.org/P59139 and previous config saved to /var/cache/conftool/dbconfig/20240402-081408-marostegui.json [08:15:01] (03PS1) 10Marostegui: db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016292 [08:15:05] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9678829 (10MoritzMuehlenhoff) [08:16:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1197.eqiad.wmnet with OS bookworm [08:16:00] (03CR) 10Marostegui: [C:03+2] db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016292 (owner: 10Marostegui) [08:17:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [08:17:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [08:17:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [08:17:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [08:17:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2115 in db2215 for T355422', diff saved to https://phabricator.wikimedia.org/P59140 and previous config saved to /var/cache/conftool/dbconfig/20240402-081741-arnaudb.json [08:17:57] (03PS1) 10Muehlenhoff: Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) [08:19:13] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2115.codfw.wmnet onto db2215.codfw.wmnet [08:20:02] (03PS2) 10Muehlenhoff: Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) [08:27:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P59141 and previous config saved to /var/cache/conftool/dbconfig/20240402-082737-marostegui.json [08:28:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [08:29:47] (03CR) 10Filippo Giunchedi: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1016290 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff) [08:30:36] (03PS1) 10Muehlenhoff: Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) [08:31:31] (03PS5) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) [08:31:31] (03PS1) 10Volans: external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) [08:32:38] (03CR) 10CI reject: [V:04-1] external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [08:32:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [08:34:41] (03PS2) 10Volans: external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) [08:35:25] (03CR) 10CI reject: [V:04-1] external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [08:41:27] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab_runner: allow dockerfile frontend on gitlab-runner2004 [puppet] - 10https://gerrit.wikimedia.org/r/1014485 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [08:42:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P59142 and previous config saved to /var/cache/conftool/dbconfig/20240402-084244-marostegui.json [08:43:09] (03PS1) 10Marostegui: Revert "db1197: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016033 [08:53:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1197.eqiad.wmnet with OS bookworm [08:54:06] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1015960 (owner: 10L10n-bot) [08:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T356166)', diff saved to https://phabricator.wikimedia.org/P59143 and previous config saved to /var/cache/conftool/dbconfig/20240402-085752-marostegui.json [08:57:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:57:55] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:57:57] (03PS3) 10Volans: external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) [08:58:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:58:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T356166)', diff saved to https://phabricator.wikimedia.org/P59144 and previous config saved to /var/cache/conftool/dbconfig/20240402-085814-marostegui.json [08:58:50] (03CR) 10Marostegui: [C:03+2] Revert "db1197: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016033 (owner: 10Marostegui) [08:59:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59145 and previous config saved to /var/cache/conftool/dbconfig/20240402-085917-root.json [09:01:17] (03CR) 10CI reject: [V:04-1] external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [09:02:53] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.42.0-wmf.25 refs T360157 (duration: 51m 03s) [09:03:01] T360157: 1.42.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T360157 [09:07:37] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016300 (https://phabricator.wikimedia.org/T360157) [09:07:38] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016300 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [09:08:26] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016300 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [09:10:18] (03CR) 10Muehlenhoff: [C:03+2] Move alert* Hiera config to the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016290 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff) [09:12:16] (03PS1) 10Filippo Giunchedi: hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) [09:13:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage db1185', diff saved to https://phabricator.wikimedia.org/P59146 and previous config saved to /var/cache/conftool/dbconfig/20240402-091303-arnaudb.json [09:13:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: Silence for reimaging [09:14:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: Silence for reimaging [09:14:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59147 and previous config saved to /var/cache/conftool/dbconfig/20240402-091422-root.json [09:15:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bookworm [09:16:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [09:22:24] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.25 refs T360157 [09:22:29] T360157: 1.42.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T360157 [09:28:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [09:29:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59148 and previous config saved to /var/cache/conftool/dbconfig/20240402-092928-root.json [09:30:37] (03CR) 10Volans: [C:04-1] "-1 for now as it requires a backported package. As for CI, we'll need the wmf-style ignore anyway but depends if we prefer PS2 or PS3 appr" [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [09:32:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [09:40:27] (03PS5) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [09:40:43] (03PS1) 10Filippo Giunchedi: hieradata: bump ops prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016304 (https://phabricator.wikimedia.org/T360537) [09:40:44] (03PS1) 10Filippo Giunchedi: hieradata: bump k8s prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016305 (https://phabricator.wikimedia.org/T360537) [09:42:24] (03CR) 10Majavah: [C:03+1] cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 (owner: 10Andrew Bogott) [09:42:47] (03CR) 10Majavah: [C:03+2] ldap: Pass typed data to sssd class [puppet] - 10https://gerrit.wikimedia.org/r/1013324 (owner: 10Majavah) [09:43:13] (03CR) 10David Caro: [C:03+1] cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 (owner: 10Andrew Bogott) [09:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59149 and previous config saved to /var/cache/conftool/dbconfig/20240402-094433-root.json [09:44:54] (03CR) 10Majavah: [C:03+2] cloud puppetservers: remove hooks preventing local commit/merge/rebase [puppet] - 10https://gerrit.wikimedia.org/r/1015625 (owner: 10Andrew Bogott) [09:46:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2115.codfw.wmnet onto db2215.codfw.wmnet [09:53:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bookworm [09:59:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59150 and previous config saved to /var/cache/conftool/dbconfig/20240402-095939-root.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1000) [10:00:48] (03CR) 10Fabfur: [C:03+2] cp3066: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015968 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [10:01:09] (03CR) 10Fabfur: [C:03+1] cp3066: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015968 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [10:06:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:07:59] (03PS1) 10Fabfur: benthos: add 2 more hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016306 (https://phabricator.wikimedia.org/T358109) [10:09:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [10:09:35] (03CR) 10Fabfur: [C:03+2] benthos: add 2 more hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016306 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:11:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:12:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:14:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59151 and previous config saved to /var/cache/conftool/dbconfig/20240402-101445-root.json [10:15:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 893.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:15:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T356166)', diff saved to https://phabricator.wikimedia.org/P59152 and previous config saved to /var/cache/conftool/dbconfig/20240402-101538-marostegui.json [10:15:41] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:16:30] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:16:43] (03PS1) 10Santiago Faci: edit and editor analytics: Updating mediawiki snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016307 [10:20:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 852.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:21:30] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:24:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [10:26:30] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:29:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59153 and previous config saved to /var/cache/conftool/dbconfig/20240402-102951-root.json [10:30:04] (03CR) 10Santiago Faci: "s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016307 (owner: 10Santiago Faci) [10:30:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P59154 and previous config saved to /var/cache/conftool/dbconfig/20240402-103045-marostegui.json [10:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:31] (03PS1) 10Muehlenhoff: Remove leftovers from old an-coord nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) [10:36:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) (owner: 10Muehlenhoff) [10:38:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 935.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:45:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P59155 and previous config saved to /var/cache/conftool/dbconfig/20240402-104552-marostegui.json [10:48:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 918.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:01:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T356166)', diff saved to https://phabricator.wikimedia.org/P59156 and previous config saved to /var/cache/conftool/dbconfig/20240402-110100-marostegui.json [11:01:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:01:04] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:01:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:01:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T356166)', diff saved to https://phabricator.wikimedia.org/P59157 and previous config saved to /var/cache/conftool/dbconfig/20240402-110122-marostegui.json [11:17:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:22:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:25:06] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [11:25:52] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [11:27:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:28:39] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [11:29:07] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [11:29:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:29:21] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [11:29:51] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [11:31:30] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [11:31:41] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [11:32:27] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [11:32:44] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [11:32:57] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [11:33:15] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [11:34:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:16] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1035.eqiad.wmnet with OS bookworm [11:57:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 863.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:58:13] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1035 [11:58:35] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1035 [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1200) [12:04:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188 T361543', diff saved to https://phabricator.wikimedia.org/P59158 and previous config saved to /var/cache/conftool/dbconfig/20240402-120455-root.json [12:04:59] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [12:06:26] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [12:07:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1188.eqiad.wmnet with OS bookworm [12:07:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 953.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:09:29] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [12:11:26] !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [12:11:27] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [12:12:09] !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:13:08] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:13:32] !log installing pillow security updates [12:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T356166)', diff saved to https://phabricator.wikimedia.org/P59159 and previous config saved to /var/cache/conftool/dbconfig/20240402-121819-marostegui.json [12:18:23] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:18:28] !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [12:18:28] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [12:19:11] !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:19:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [12:20:07] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:22:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [12:28:08] !log taavi@deploy1002 ~ $ sudo systemctl kill train-presync.service # T361580 [12:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:11] T361580: Train presync timer in unhealthy state - https://phabricator.wikimedia.org/T361580 [12:33:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P59160 and previous config saved to /var/cache/conftool/dbconfig/20240402-123326-marostegui.json [12:39:19] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1035.eqiad.wmnet with OS bookworm [12:44:22] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9679513 (10MoritzMuehlenhoff) [12:44:28] (03PS1) 10Muehlenhoff: analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) [12:44:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1188.eqiad.wmnet with OS bookworm [12:45:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59161 and previous config saved to /var/cache/conftool/dbconfig/20240402-124506-root.json [12:45:19] (03CR) 10Hnowlan: [C:03+1] edit and editor analytics: Updating mediawiki snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016307 (owner: 10Santiago Faci) [12:45:23] (03CR) 10Muehlenhoff: [C:03+1] "That's fine, all firewall services used on the stewards hosts use nftables-compatible service definitions. Note that the hosts will need t" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [12:45:31] (03CR) 10Muehlenhoff: [C:04-1] "Hold back with that one, not all firewall definitions are ported." [puppet] - 10https://gerrit.wikimedia.org/r/1013648 (owner: 10Dzahn) [12:45:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:46:11] (03CR) 10Santiago Faci: [C:03+2] edit and editor analytics: Updating mediawiki snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016307 (owner: 10Santiago Faci) [12:46:19] (03Merged) 10jenkins-bot: edit and editor analytics: Updating mediawiki snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016307 (owner: 10Santiago Faci) [12:46:47] (03PS1) 10Muehlenhoff: Remove obsolete stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016312 (https://phabricator.wikimedia.org/T360412) [12:47:15] (03PS1) 10Muehlenhoff: Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) [12:47:45] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9679605 (10MoritzMuehlenhoff) [12:48:06] (03PS1) 10Muehlenhoff: schema: Remove obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016315 (https://phabricator.wikimedia.org/T360412) [12:48:11] (03PS1) 10Muehlenhoff: schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) [12:48:32] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9679624 (10MoritzMuehlenhoff) [12:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P59162 and previous config saved to /var/cache/conftool/dbconfig/20240402-124834-marostegui.json [12:49:22] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance puppetdb2003:9100) - https://phabricator.wikimedia.org/T361578 (10LSobanski) 03NEW [12:49:46] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1035: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016318 (https://phabricator.wikimedia.org/T319184) [12:50:36] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 3 others: 14db2100 crashed (memory error) - 14https://phabricator.wikimedia.org/T361037#9679684 (10jcrespo) 05Open→03Declined 14Yes, we have redundancy for the backups and this will actually simplify things. I will take care of... [12:52:01] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9679718 (10jijiki) >>! In T360596#9676049, @akosiaris wrote: > > My 2, operationally minded, cents says to wait for the d... [12:52:46] jouncebot: nowandnext [12:52:46] For the next 0 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1200) [12:52:46] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1300) [12:54:00] (03CR) 10FNegri: [C:03+2] R:wmcs::db::toolsdb: remove unnecessary config [puppet] - 10https://gerrit.wikimedia.org/r/1015580 (https://phabricator.wikimedia.org/T344717) (owner: 10FNegri) [12:54:51] I can self deploy my change. [12:54:58] For the backport window [12:55:14] So I can run the window too. [12:55:45] (just saw another change being added as I sent the message :) ) [12:55:57] I won’t be available during the window [12:57:19] TheresNoTime: Are you wanting to self-deploy your config change? [12:57:53] Dreamy_Jazz: feel free to deploy it! [12:58:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 5%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59163 and previous config saved to /var/cache/conftool/dbconfig/20240402-125825-arnaudb.json [12:58:34] (03CR) 10Samwilson: [C:03+1] "Looks G to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016334 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [12:58:41] I'll be in a meeting at the same time, so if you can deploy your change that would be ideal. However, I'm happy to deploy it if you can't. [12:59:11] Dreamy_Jazz: ack, I will self-deploy after you've done yours [12:59:16] :D [12:59:22] (nb. T361577 is really noisy) [12:59:23] T361577: Error: Typed property MediaWiki\Rest\RequestBase::$parsedBody must not be accessed before initialization - https://phabricator.wikimedia.org/T361577 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1300). [13:00:05] Dreamy_Jazz and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59164 and previous config saved to /var/cache/conftool/dbconfig/20240402-130012-root.json [13:00:14] \o [13:00:30] all yours, ping me when complete please! :) [13:00:44] Sure. [13:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015373 (https://phabricator.wikimedia.org/T353496) (owner: 10Tchanders) [13:01:05] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9679992 (10ayounsi) Ping? :) [13:01:57] (03Merged) 10jenkins-bot: Deploy partial action blocks everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015373 (https://phabricator.wikimedia.org/T353496) (owner: 10Tchanders) [13:02:22] (03PS1) 10Arnaudb: mariadb: removes db2100 after memory failure [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) [13:02:31] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1015373|Deploy partial action blocks everywhere (T353496)]] [13:02:34] T353496: Deploy partial action blocks to remaining wikis - https://phabricator.wikimedia.org/T353496 [13:03:17] (03PS2) 10Arnaudb: mariadb: removes db2100 after memory failure [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) [13:03:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T356166)', diff saved to https://phabricator.wikimedia.org/P59165 and previous config saved to /var/cache/conftool/dbconfig/20240402-130341-marostegui.json [13:03:45] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:03:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:03:46] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1016304 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi) [13:03:53] oh nice, partial action blocks to all projects! [13:03:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:04:00] (03CR) 10Arnaudb: "the last patch also edits" [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [13:04:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:04:03] Yup :) [13:04:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:04:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T356166)', diff saved to https://phabricator.wikimedia.org/P59166 and previous config saved to /var/cache/conftool/dbconfig/20240402-130423-marostegui.json [13:04:59] !log dreamyjazz@deploy1002 dreamyjazz and tchanders: Backport for [[gerrit:1015373|Deploy partial action blocks everywhere (T353496)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:52] !log dreamyjazz@deploy1002 dreamyjazz and tchanders: Continuing with sync [13:07:18] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9680010 (10ayounsi) p:05Low→03High [13:13:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59167 and previous config saved to /var/cache/conftool/dbconfig/20240402-131330-arnaudb.json [13:14:00] (03CR) 10Elukey: [C:03+1] "Left some nits/questions, if they are not relevant feel free to proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [13:15:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59168 and previous config saved to /var/cache/conftool/dbconfig/20240402-131517-root.json [13:18:04] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1015373|Deploy partial action blocks everywhere (T353496)]] (duration: 15m 33s) [13:18:08] T353496: Deploy partial action blocks to remaining wikis - https://phabricator.wikimedia.org/T353496 [13:18:30] TheresNoTime: Done, feel free to go ahead with your config change. [13:18:37] Dreamy_Jazz: ack, thank you [13:18:42] (03PS2) 10Samtar: InitialiseSettings: Enable Edit Recovery on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016334 (https://phabricator.wikimedia.org/T355548) [13:19:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016334 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [13:20:28] (03Merged) 10jenkins-bot: InitialiseSettings: Enable Edit Recovery on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016334 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [13:21:02] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1016334|InitialiseSettings: Enable Edit Recovery on all projects (T355548)]] [13:21:04] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [13:23:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:23:22] !log samtar@deploy1002 samtar: Backport for [[gerrit:1016334|InitialiseSettings: Enable Edit Recovery on all projects (T355548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:30] * TheresNoTime testing [13:23:41] (03CR) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:25:52] !log samtar@deploy1002 samtar: Continuing with sync [13:28:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:28:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 15%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59169 and previous config saved to /var/cache/conftool/dbconfig/20240402-132836-arnaudb.json [13:30:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59170 and previous config saved to /var/cache/conftool/dbconfig/20240402-133023-root.json [13:31:24] (03PS1) 10Muehlenhoff: Uninstall eject on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1016345 [13:32:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [13:32:33] !log depool cp3066 for reimage (T360430) [13:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [13:33:00] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [13:33:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:34:35] (03PS1) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: install etcd package before referencing uid [puppet] - 10https://gerrit.wikimedia.org/r/1016346 [13:35:17] (03PS2) 10Andrew Bogott: profile::wmcs::kubeadm::etcd: install etcd package before referencing uid [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) [13:35:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [13:37:28] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1016334|InitialiseSettings: Enable Edit Recovery on all projects (T355548)]] (duration: 16m 26s) [13:37:31] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [13:38:07] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS bullseye [13:38:16] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3066.esams.wmnet with OS bullseye [13:38:17] (03PS3) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [13:38:45] !log closing UTC afternoon backport window [13:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:58] (03CR) 10Fabfur: [C:03+2] cp3066: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015968 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [13:42:58] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680164 (10ssingh) [13:43:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59171 and previous config saved to /var/cache/conftool/dbconfig/20240402-134342-arnaudb.json [13:45:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59172 and previous config saved to /var/cache/conftool/dbconfig/20240402-134528-root.json [13:45:58] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [13:47:00] (03CR) 10Herron: [C:03+1] hieradata: bump ops prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016304 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi) [13:47:30] (03Abandoned) 10Herron: grafana-client: initial packaging [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983477 (owner: 10Herron) [13:47:44] (03Abandoned) 10Herron: initial import of 3.10.0 [debs/python-grafana-client] - 10https://gerrit.wikimedia.org/r/983480 (owner: 10Herron) [13:48:38] (03Abandoned) 10Herron: verlib2: initial packaging [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [13:48:47] (03Abandoned) 10Herron: initial import from upstream 0.2.0 [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983475 (owner: 10Herron) [13:50:14] (03Abandoned) 10Herron: graphite-web: switch logrotate to copytruncate [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron) [13:51:24] (03Abandoned) 10Herron: envoy: manage strip_matching_host_port setting and enable on thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [13:52:11] (03CR) 10Herron: [C:03+1] wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [13:52:35] (03Abandoned) 10Herron: prometheus: apt::pin prometheus package to bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [13:52:51] (03CR) 10Marostegui: mariadb: removes db2100 after memory failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [13:54:02] (03Abandoned) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/944287 (owner: 10Herron) [13:54:39] (03Abandoned) 10Herron: admin: add common approval guidelines to group descriptions [puppet] - 10https://gerrit.wikimedia.org/r/425074 (owner: 10Herron) [13:57:37] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680325 (10ssingh) For posterity, an annotated Grafana dashboard that shows incoming traffic to esams after and during the depool and power-off events: https://grafana.wikim... [13:58:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59173 and previous config saved to /var/cache/conftool/dbconfig/20240402-135847-arnaudb.json [13:58:52] (03Abandoned) 10Herron: logstash: shrink es cluster back to 3 nodes, remove retired hosts [puppet] - 10https://gerrit.wikimedia.org/r/493098 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [14:00:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59174 and previous config saved to /var/cache/conftool/dbconfig/20240402-140035-root.json [14:00:42] (03CR) 10Volans: Netbox: add functions to get and set device name (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [14:01:51] (03Abandoned) 10Herron: rsyslog-shipper: enable omkafka action queue and retry [puppet] - 10https://gerrit.wikimedia.org/r/486169 (https://phabricator.wikimedia.org/T214176) (owner: 10Herron) [14:02:32] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [14:02:35] (03Abandoned) 10Herron: add default vlaue for kafka_shipper::kafka_brokers [puppet] - 10https://gerrit.wikimedia.org/r/480790 (owner: 10Herron) [14:02:53] (03Abandoned) 10Herron: logstash: ship zookeeper logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476977 (https://phabricator.wikimedia.org/T63789) (owner: 10Herron) [14:03:22] (03Abandoned) 10Herron: puppetdb: set jetty.ini host = 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/436166 (owner: 10Herron) [14:03:51] (03Abandoned) 10Herron: change wikipedia.com DMARC domain and subdomain policies to reject [dns] - 10https://gerrit.wikimedia.org/r/409407 (https://phabricator.wikimedia.org/T184230) (owner: 10Herron) [14:03:55] (03Abandoned) 10Herron: change wikipedia.com SPF record to fail all (-all) [dns] - 10https://gerrit.wikimedia.org/r/409406 (https://phabricator.wikimedia.org/T184230) (owner: 10Herron) [14:04:00] (03Abandoned) 10Herron: change wikipedia.com zone from symlink to file [dns] - 10https://gerrit.wikimedia.org/r/409405 (https://phabricator.wikimedia.org/T184230) (owner: 10Herron) [14:05:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [14:05:45] (03Abandoned) 10Herron: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 (https://phabricator.wikimedia.org/T126989) (owner: 10Herron) [14:07:40] (03CR) 10Stevemunene: "Hello, we had a similar change to decommission an-coord nodes and all related services going on. The change however did not include the re" [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) (owner: 10Muehlenhoff) [14:08:13] (03CR) 10Stevemunene: "I think we should also include the the removal of the The old replica role as it is no longer used/needed. this is mentioned by Moritz her" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [14:08:59] (03Abandoned) 10Herron: thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule [puppet] - 10https://gerrit.wikimedia.org/r/717473 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [14:09:04] !log imported jenkins 2.440.2 to thirdparty/ci for buster-wikimedia T360759 [14:09:06] (03Abandoned) 10Herron: rsyslog_recieve: logrotate set maxage and rotate empty logs [puppet] - 10https://gerrit.wikimedia.org/r/701576 (https://phabricator.wikimedia.org/T285371) (owner: 10Herron) [14:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:08] (03Abandoned) 10Herron: profile::mail: add mta hiera option profile::mail::mta [puppet] - 10https://gerrit.wikimedia.org/r/688391 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [14:09:09] T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759 [14:09:10] (03Abandoned) 10Herron: WIP: icinga: add check_sysctl.sh script [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060) (owner: 10Herron) [14:09:12] (03Abandoned) 10Herron: pontoon: add hiera settings for o11y-grafana [puppet] - 10https://gerrit.wikimedia.org/r/671187 (owner: 10Herron) [14:09:14] (03Abandoned) 10Herron: puppet-agent: remove --show_diff from scheduled puppet-run script [puppet] - 10https://gerrit.wikimedia.org/r/434719 (https://phabricator.wikimedia.org/T1) (owner: 10Herron) [14:09:16] (03Abandoned) 10Herron: graphite-carbon: disable internal log rotation and use logrotate [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) (owner: 10Herron) [14:09:18] (03Abandoned) 10Herron: prometheus: reduce prometheus.svc.eqsin TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/628853 (owner: 10Herron) [14:09:21] (03Abandoned) 10Herron: check_confd_template: glob fixup and add detail to alerts [puppet] - 10https://gerrit.wikimedia.org/r/575598 (owner: 10Herron) [14:09:25] (03Abandoned) 10Herron: prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [14:09:29] (03Abandoned) 10Herron: logstash: add tcp json_lines localhost compatability endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496021 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [14:09:33] (03Abandoned) 10Herron: logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [14:11:45] (fwiw I'm around to backport 1016347 for T361577 if needed, but I'm quite sure some deployers are already on the task) [14:11:48] T361577: Error: Typed property MediaWiki\Rest\RequestBase::$parsedBody must not be accessed before initialization - https://phabricator.wikimedia.org/T361577 [14:12:29] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:13:33] (03CR) 10Muehlenhoff: "Sure thing, feel free to fold that in and abandon mine" [puppet] - 10https://gerrit.wikimedia.org/r/1016308 (https://phabricator.wikimedia.org/T353774) (owner: 10Muehlenhoff) [14:13:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59176 and previous config saved to /var/cache/conftool/dbconfig/20240402-141353-arnaudb.json [14:13:58] (03CR) 10David Caro: [C:03+1] "The change from require -> before https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/50b8b961d8327fa3be4c86d2ff7a6e08ec998" [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:15:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59177 and previous config saved to /var/cache/conftool/dbconfig/20240402-141541-root.json [14:16:22] (03CR) 10David Caro: [C:03+1] "It mentions the etcd-server requiring the certs, but maybe that's not true anymore? (https://gerrit.wikimedia.org/r/plugins/gitiles/operat" [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:18:09] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9680420 (10herron) CC from IRC chat -- We've tentatively scheduled this for this Weds afternoon (Eastern TZ, 4/3/2024) [14:18:14] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::kubeadm::etcd: install etcd package before referencing uid [puppet] - 10https://gerrit.wikimedia.org/r/1016346 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [14:18:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:19:41] (03CR) 10Fabfur: [V:03+1 C:03+2] hiera: minor fix for benthos env_variables structure [puppet] - 10https://gerrit.wikimedia.org/r/1007299 (https://phabricator.wikimedia.org/T358647) (owner: 10Fabfur) [14:26:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T356166)', diff saved to https://phabricator.wikimedia.org/P59178 and previous config saved to /var/cache/conftool/dbconfig/20240402-142650-marostegui.json [14:26:53] (03CR) 10Muehlenhoff: "This is good to merge, only needs some rebase/sync to the latest git tree. the SPDX work has been backlogged for quite a while since other" [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:26:54] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:27:02] (03CR) 10Muehlenhoff: "This is good to merge, only needs some rebase/sync to the latest git tree. the SPDX work has been backlogged for quite a while since other" [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:27:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015465 [14:28:42] (03CR) 10JHathaway: [C:03+1] Enable the mariadb slow query log for civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016016 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [14:28:58] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS bullseye [14:28:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59179 and previous config saved to /var/cache/conftool/dbconfig/20240402-142859-arnaudb.json [14:29:08] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3066.esams.wmnet with OS bullseye completed: - cp3066 (**PASS**)... [14:29:26] 10ops-codfw, 06SRE: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9680476 (10Jhancock.wm) a:03Jhancock.wm this error reoccured. A fatal error was detected on a component at bus 101 device 0 function 0. I'm gonna open a troubleshooting ticket with Dell because I'm not 100% sure w... [14:29:33] (03CR) 10JHathaway: [C:03+1] "We have started to migrate away from exposing puppet certs to using cfssl to generate certs, is that possible on the funding tech side?" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [14:30:14] (03Abandoned) 10Hnowlan: mw-jobrunner: reduce replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005121 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:31:41] (03PS6) 10JMeybohm: k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) [14:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:44] (03CR) 10JHathaway: [C:03+1] "looks good, could we remove it everywhere, do we still have cd drives on physical servers?" [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [14:33:19] (03CR) 10JMeybohm: k8s/apiserver: Add option to configure audit logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:34:47] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1775/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:37:12] (03CR) 10Muehlenhoff: "It's really needed on baremetal servers: eject only gets installed if d-i detects an optical drive and we don't have any on baremetal. Thi" [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [14:37:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:42] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Platform-SRE (2024.03.25 - 2024.04.14): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105#9680514 (10bking) a:05RKemper→03bking [14:38:44] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016359 (https://phabricator.wikimedia.org/T356933) [14:38:46] !log repooling cp3066 after reimage (T360430) [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [14:38:49] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [14:38:53] (03CR) 10Jcrespo: [C:04-1] mariadb: removes db2100 after memory failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [14:39:51] (03PS3) 10Arnaudb: mariadb: removes db2100 after memory failure [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) [14:40:00] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1036: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016363 (https://phabricator.wikimedia.org/T319184) [14:40:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:40:19] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 (owner: 10Hnowlan) [14:40:27] (03CR) 10Arnaudb: mariadb: removes db2100 after memory failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [14:41:01] (03CR) 10Muehlenhoff: [C:03+2] Uninstall eject on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1016345 (owner: 10Muehlenhoff) [14:41:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P59180 and previous config saved to /var/cache/conftool/dbconfig/20240402-144158-marostegui.json [14:42:14] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680533 (10Fabfur) cp3066 has been reimaged successfully, no evidence of errors [14:42:17] (03CR) 10David Caro: [C:03+1] cloudvirt1036: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016363 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [14:42:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage db1230', diff saved to https://phabricator.wikimedia.org/P59181 and previous config saved to /var/cache/conftool/dbconfig/20240402-144221-arnaudb.json [14:42:35] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9680539 (10Fabfur) [14:42:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1230.eqiad.wmnet with reason: Silence for reimaging [14:42:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: Silence for reimaging [14:44:50] (03CR) 10Elukey: [C:03+1] k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:45:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1230.eqiad.wmnet with OS bookworm [14:45:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:47:03] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2215 [puppet] - 10https://gerrit.wikimedia.org/r/1016366 (https://phabricator.wikimedia.org/T355422) [14:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:51:55] (03PS1) 10Samtar: rest: add default null to nullable typed prop [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016043 (https://phabricator.wikimedia.org/T361577) [14:53:43] 10ops-codfw: aqs2001.codfw.wmnet down - https://phabricator.wikimedia.org/T361603 (10Eevans) 03NEW [14:53:59] 10ops-codfw, 10Cassandra: aqs2001.codfw.wmnet down - https://phabricator.wikimedia.org/T361603#9680612 (10Eevans) p:05Triage→03High [14:54:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:55:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9680630 (10MoritzMuehlenhoff) [14:56:28] !log installing mariadb security updates (as packaged in Debian, unrelated to the wmf-mariadb packages) [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P59182 and previous config saved to /var/cache/conftool/dbconfig/20240402-145705-marostegui.json [14:57:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [14:59:09] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 7 others: [C-DIS][SW] Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#9680651 (10ArthurTaylor) [14:59:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:00:04] eoghan, jelto, and arnoldokoth: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1500). [15:01:22] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 7 others: [C-DIS][SW] Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#9680653 (10ArthurTaylor) [15:01:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [15:02:08] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2214, db2219, db2220 [puppet] - 10https://gerrit.wikimedia.org/r/1016367 (https://phabricator.wikimedia.org/T355422) [15:02:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:02:33] (03CR) 10Jgreen: [C:03+2] Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:02:37] (03CR) 10Marostegui: [C:03+1] "Assuming all green!" [puppet] - 10https://gerrit.wikimedia.org/r/1016367 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:02:47] (03CR) 10Jgreen: [C:03+1] Add cv and drush bin dirs to PATH on community crm [puppet] - 10https://gerrit.wikimedia.org/r/1016013 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:03:09] (03CR) 10Marostegui: [C:03+1] mariadb: toggle notifications for db2215 [puppet] - 10https://gerrit.wikimedia.org/r/1016366 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:03:20] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2214, db2219, db2220 [puppet] - 10https://gerrit.wikimedia.org/r/1016367 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:03:33] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2215 [puppet] - 10https://gerrit.wikimedia.org/r/1016366 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:03:48] (03CR) 10Jgreen: [C:03+1] Force CIVICRM_TEMPLATE_COMPILE_CHECK to false [puppet] - 10https://gerrit.wikimedia.org/r/1016014 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:04:46] (03CR) 10Jgreen: [C:03+1] Enable the mariadb slow query log for civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016016 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:05:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59183 and previous config saved to /var/cache/conftool/dbconfig/20240402-150509-arnaudb.json [15:05:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59184 and previous config saved to /var/cache/conftool/dbconfig/20240402-150516-arnaudb.json [15:05:21] (03CR) 10Jgreen: [C:03+1] Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:05:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59185 and previous config saved to /var/cache/conftool/dbconfig/20240402-150525-arnaudb.json [15:05:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59186 and previous config saved to /var/cache/conftool/dbconfig/20240402-150538-arnaudb.json [15:05:39] (03CR) 10JHathaway: external clouds: get prefixes also from MaxMindDB (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [15:07:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:08:42] (03PS1) 10Elukey: Remove profile::pki::client::auth_key from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) [15:08:58] (03CR) 10Elukey: [V:03+2 C:03+2] Remove profile::pki::client::auth_key from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:12:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T356166)', diff saved to https://phabricator.wikimedia.org/P59187 and previous config saved to /var/cache/conftool/dbconfig/20240402-151213-marostegui.json [15:12:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:12:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:12:17] jouncebot: nowandnext [15:12:17] For the next 0 hour(s) and 47 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1500) [15:12:17] In 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1600) [15:12:18] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:12:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:12:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T356166)', diff saved to https://phabricator.wikimedia.org/P59188 and previous config saved to /var/cache/conftool/dbconfig/20240402-151235-marostegui.json [15:12:54] (03CR) 10Jcrespo: [C:03+1] "All good from me." [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [15:14:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9680727 (10MoritzMuehlenhoff) [15:16:11] (03CR) 10Majavah: "Cloud VPS instances don't read profile hiera so removing this means that provisioning any new instances sing profile::pki::client will be " [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:16:44] (03PS1) 10Elukey: Revert "Remove profile::pki::client::auth_key from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1016044 [15:16:49] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "Remove profile::pki::client::auth_key from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1016044 (owner: 10Elukey) [15:18:01] jouncebot: nowandnext [15:18:01] For the next 0 hour(s) and 41 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1500) [15:18:01] In 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1600) [15:18:35] (03CR) 10Jaime Nuche: "Fix looks simple enough. I'll backport it" [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016043 (https://phabricator.wikimedia.org/T361577) (owner: 10Samtar) [15:20:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016043 (https://phabricator.wikimedia.org/T361577) (owner: 10Samtar) [15:20:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59189 and previous config saved to /var/cache/conftool/dbconfig/20240402-152015-arnaudb.json [15:20:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59190 and previous config saved to /var/cache/conftool/dbconfig/20240402-152023-arnaudb.json [15:20:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59191 and previous config saved to /var/cache/conftool/dbconfig/20240402-152031-arnaudb.json [15:20:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59192 and previous config saved to /var/cache/conftool/dbconfig/20240402-152044-arnaudb.json [15:20:52] (03CR) 10Elukey: [V:03+2 C:03+2] "Duly noted, reverted :)" [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:21:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1230.eqiad.wmnet with OS bookworm [15:23:25] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@c4d19d7]: (no justification provided) [15:23:40] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@c4d19d7]: (no justification provided) (duration: 00m 16s) [15:25:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59193 and previous config saved to /var/cache/conftool/dbconfig/20240402-152527-arnaudb.json [15:27:07] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9680831 (10Pginer-WMF) [15:30:20] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015465 (owner: 10PipelineBot) [15:30:52] (03PS1) 10Elukey: Remove profile::pki::client's specific hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) [15:31:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 905ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:31:30] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015465 (owner: 10PipelineBot) [15:31:47] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1036.eqiad.wmnet with OS bookworm [15:35:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59194 and previous config saved to /var/cache/conftool/dbconfig/20240402-153521-arnaudb.json [15:35:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59195 and previous config saved to /var/cache/conftool/dbconfig/20240402-153529-arnaudb.json [15:35:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59196 and previous config saved to /var/cache/conftool/dbconfig/20240402-153536-arnaudb.json [15:35:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59197 and previous config saved to /var/cache/conftool/dbconfig/20240402-153550-arnaudb.json [15:36:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 822.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:37:25] (SystemdUnitFailed) firing: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59198 and previous config saved to /var/cache/conftool/dbconfig/20240402-154033-arnaudb.json [15:44:50] (03CR) 10Dwisehaupt: "Sure thing. I was following the pattern we are currently using in our environment. I'll have a look at the PKI wikitech page and see if I " [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:44:57] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1036.eqiad.wmnet... [15:45:23] 10ops-codfw, 06SRE, 10Cassandra: aqs2001.codfw.wmnet down - https://phabricator.wikimedia.org/T361603#9681018 (10Eevans) a:03Jhancock.wm [15:45:47] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9681019 (10Dzahn) Thank you, will do! Yes, let's use /srv/... [15:45:54] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9681022 (10bking) @lmata Let's go ahead and disable this alert. We'll make a new one once the releva... [15:45:59] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9681025 (10bking) [15:46:56] (03CR) 10Dzahn: [C:03+2] Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [15:46:57] (03CR) 10Dzahn: [V:03+2 C:03+2] Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [15:47:09] (03Merged) 10jenkins-bot: rest: add default null to nullable typed prop [core] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016043 (https://phabricator.wikimedia.org/T361577) (owner: 10Samtar) [15:47:29] (03PS1) 10Alexandros Kosiaris: ores: Remove old ORES DNS entries [dns] - 10https://gerrit.wikimedia.org/r/1016389 [15:47:38] !log jnuche@deploy1002 Started scap: Backport for [[gerrit:1016043|rest: add default null to nullable typed prop (T361577)]] [15:47:42] T361577: Error: Typed property MediaWiki\Rest\RequestBase::$parsedBody must not be accessed before initialization - https://phabricator.wikimedia.org/T361577 [15:49:49] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [15:50:04] !log jnuche@deploy1002 samtar and jnuche: Backport for [[gerrit:1016043|rest: add default null to nullable typed prop (T361577)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:50:14] !log jnuche@deploy1002 samtar and jnuche: Continuing with sync [15:50:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59199 and previous config saved to /var/cache/conftool/dbconfig/20240402-155026-arnaudb.json [15:50:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59200 and previous config saved to /var/cache/conftool/dbconfig/20240402-155035-arnaudb.json [15:50:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59201 and previous config saved to /var/cache/conftool/dbconfig/20240402-155042-arnaudb.json [15:50:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59202 and previous config saved to /var/cache/conftool/dbconfig/20240402-155056-arnaudb.json [15:51:32] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1036: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016363 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:51:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [15:52:15] (03PS6) 10Elukey: Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) [15:52:21] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681068 (10aborrero) [15:52:58] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016359 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [15:54:20] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016359 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [15:54:20] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:54:34] (03PS7) 10Elukey: Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) [15:54:50] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:54:53] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:54:59] (03PS8) 10Elukey: Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) [15:55:07] (03PS1) 10Alexandros Kosiaris: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) [15:55:22] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1036 [15:55:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 15%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59203 and previous config saved to /var/cache/conftool/dbconfig/20240402-155538-arnaudb.json [15:55:42] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1036 [15:56:07] (03CR) 10CI reject: [V:04-1] changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [15:57:20] (03PS1) 10Cwhite: logstash: provision and commission logging-hd200[123] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016368 (https://phabricator.wikimedia.org/T352517) [15:57:21] (03PS1) 10Cwhite: spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) [15:57:46] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [15:57:49] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:57:52] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:57:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9681134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [15:57:59] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:58:32] 10ops-codfw, 06SRE, 10Cassandra: 14aqs2001.codfw.wmnet down - 14https://phabricator.wikimedia.org/T361603#9681145 (10Jhancock.wm) 05Open→03Resolved 14replaced the SFP, server is pingable again.  [16:00:04] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 944.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:02:08] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2001.codfw.wmnet with OS bookworm [16:02:18] !log jnuche@deploy1002 Finished scap: Backport for [[gerrit:1016043|rest: add default null to nullable typed prop (T361577)]] (duration: 14m 39s) [16:02:22] T361577: Error: Typed property MediaWiki\Rest\RequestBase::$parsedBody must not be accessed before initialization - https://phabricator.wikimedia.org/T361577 [16:04:01] jnuche: puppet window is a no-op fyi [16:04:28] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:04:39] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:05:21] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [16:05:32] (03CR) 10Cwhite: [C:03+2] spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:05:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59204 and previous config saved to /var/cache/conftool/dbconfig/20240402-160532-arnaudb.json [16:05:40] rzl: thx 👍 [16:05:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59205 and previous config saved to /var/cache/conftool/dbconfig/20240402-160540-arnaudb.json [16:05:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59206 and previous config saved to /var/cache/conftool/dbconfig/20240402-160547-arnaudb.json [16:05:53] (03CR) 10Cwhite: spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:06:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59207 and previous config saved to /var/cache/conftool/dbconfig/20240402-160602-arnaudb.json [16:07:25] (03PS3) 10RLazarus: MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [16:07:44] (03CR) 10CI reject: [V:04-1] MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [16:09:52] (sorry for toe-stepping there jnuche :D) [16:10:14] hehehe, no worries, thanks for being on top of the task :) [16:10:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59208 and previous config saved to /var/cache/conftool/dbconfig/20240402-161044-arnaudb.json [16:11:01] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1006.eqiad.wmnet with reason: host reimage [16:11:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 854.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:11:20] (03PS4) 10RLazarus: MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [16:12:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9681245 (10VRiley-WMF) [16:13:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov100[56] - 14https://phabricator.wikimedia.org/T355353#9681247 (10VRiley-WMF) 14This is now completed. [16:13:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov100[56] - 14https://phabricator.wikimedia.org/T355353#9681248 (10VRiley-WMF) 05Open→03Resolved [16:13:43] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1777/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [16:13:44] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9681250 (10Marostegui) I'd prefer to also work with hostnames rather than IPs. I think under 10ms is good enough but this is just a feeling th... [16:13:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1006.eqiad.wmnet with reason: host reimage [16:15:36] (03CR) 10Volans: "FYI inline, currently a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:16:29] (03PS1) 10BryanDavis: toolforge: Exclude /usr/bin/sudo from Wheel of Misfortune [puppet] - 10https://gerrit.wikimedia.org/r/1016392 [16:16:50] (03CR) 10Volans: "ignore my previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:17:20] (03CR) 10Cwhite: spicerack: update logging-eqiad host to logging-hd1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:17:29] (ProbeDown) firing: (2) Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:12] (03CR) 10RLazarus: [V:03+1 C:03+2] "Shepherding this through as Cormac is OOO." [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [16:20:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59209 and previous config saved to /var/cache/conftool/dbconfig/20240402-162038-arnaudb.json [16:20:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59210 and previous config saved to /var/cache/conftool/dbconfig/20240402-162046-arnaudb.json [16:20:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59211 and previous config saved to /var/cache/conftool/dbconfig/20240402-162053-arnaudb.json [16:21:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59212 and previous config saved to /var/cache/conftool/dbconfig/20240402-162107-arnaudb.json [16:22:16] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1036.eqiad.wmnet with OS bookworm [16:22:27] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1036.eqiad.wmnet with... [16:22:29] (ProbeDown) resolved: (2) Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:35] (03Abandoned) 10Arlolra: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015462 (owner: 10PipelineBot) [16:23:37] 10ops-eqiad, 06SRE: PDU sensor over limit - https://phabricator.wikimedia.org/T361535#9681263 (10VRiley-WMF) a:03VRiley-WMF [16:25:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59213 and previous config saved to /var/cache/conftool/dbconfig/20240402-162550-arnaudb.json [16:26:08] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015448 (owner: 10PipelineBot) [16:26:14] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012664 (owner: 10PipelineBot) [16:26:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011431 (owner: 10PipelineBot) [16:26:24] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009955 (owner: 10PipelineBot) [16:26:33] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008915 (owner: 10PipelineBot) [16:26:38] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006197 (owner: 10PipelineBot) [16:26:45] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005527 (owner: 10PipelineBot) [16:26:49] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003863 (owner: 10PipelineBot) [16:26:52] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002409 (owner: 10PipelineBot) [16:26:55] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/999008 (owner: 10PipelineBot) [16:26:59] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/997493 (owner: 10PipelineBot) [16:27:02] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/995361 (owner: 10PipelineBot) [16:27:55] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:28:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:29:34] (03CR) 10Majavah: [C:03+2] toolforge: Exclude /usr/bin/sudo from Wheel of Misfortune [puppet] - 10https://gerrit.wikimedia.org/r/1016392 (owner: 10BryanDavis) [16:32:18] (03PS1) 10Hnowlan: mobileapps: remove cidr notation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016395 [16:33:10] (03CR) 10Jgiannelos: [C:03+1] mobileapps: remove cidr notation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016395 (owner: 10Hnowlan) [16:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T356166)', diff saved to https://phabricator.wikimedia.org/P59214 and previous config saved to /var/cache/conftool/dbconfig/20240402-163413-marostegui.json [16:34:16] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:35:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59215 and previous config saved to /var/cache/conftool/dbconfig/20240402-163544-arnaudb.json [16:35:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59216 and previous config saved to /var/cache/conftool/dbconfig/20240402-163552-arnaudb.json [16:36:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59217 and previous config saved to /var/cache/conftool/dbconfig/20240402-163559-arnaudb.json [16:36:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59218 and previous config saved to /var/cache/conftool/dbconfig/20240402-163613-arnaudb.json [16:36:43] (03CR) 10Jgiannelos: [C:03+2] mobileapps: remove cidr notation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016395 (owner: 10Hnowlan) [16:37:25] (SystemdUnitFailed) resolved: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:30] (03CR) 10Dzahn: [C:04-1] "This is what actually shows which sites still use it:" [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:37:38] (03Merged) 10jenkins-bot: mobileapps: remove cidr notation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016395 (owner: 10Hnowlan) [16:38:52] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:38:59] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:40:43] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:40:51] (03PS2) 10Dzahn: miscweb: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) [16:40:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59219 and previous config saved to /var/cache/conftool/dbconfig/20240402-164055-arnaudb.json [16:42:28] (03PS1) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016396 [16:43:57] (03Abandoned) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016396 (owner: 10Jgiannelos) [16:44:43] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:45:18] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:46:14] (03PS1) 10Dzahn: httpbb: move tests for security.wikimedia.org to k8s test file [puppet] - 10https://gerrit.wikimedia.org/r/1016398 (https://phabricator.wikimedia.org/T350796) [16:47:14] (03PS2) 10Dzahn: httpbb: move tests for security.wikimedia.org to k8s test file [puppet] - 10https://gerrit.wikimedia.org/r/1016398 (https://phabricator.wikimedia.org/T350796) [16:49:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P59220 and previous config saved to /var/cache/conftool/dbconfig/20240402-164920-marostegui.json [16:50:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59221 and previous config saved to /var/cache/conftool/dbconfig/20240402-165049-arnaudb.json [16:50:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59222 and previous config saved to /var/cache/conftool/dbconfig/20240402-165058-arnaudb.json [16:51:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59223 and previous config saved to /var/cache/conftool/dbconfig/20240402-165105-arnaudb.json [16:51:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59224 and previous config saved to /var/cache/conftool/dbconfig/20240402-165119-arnaudb.json [16:53:49] (03PS1) 10Dzahn: httpbb: add missing virtual hosts to legacy miscweb tests [puppet] - 10https://gerrit.wikimedia.org/r/1016399 (https://phabricator.wikimedia.org/T360413) [16:56:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59225 and previous config saved to /var/cache/conftool/dbconfig/20240402-165601-arnaudb.json [16:58:24] (03CR) 10Dzahn: [C:03+1] "[deploy1002:~] $ httpbb ./test_miscweb.yaml --hosts=miscweb1003.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1016399 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:59:01] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:59:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:00:01] (03CR) 10Dzahn: [C:03+1] "If I was missing a host on the cert it would show in the httpbb test as "Caused by SSLError(SSLCertVerificationError"" [puppet] - 10https://gerrit.wikimedia.org/r/1016399 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1700) [17:03:47] (03CR) 10Dzahn: "cert before the change and how to check SANs on it:" [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:04:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P59226 and previous config saved to /var/cache/conftool/dbconfig/20240402-170427-marostegui.json [17:05:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59227 and previous config saved to /var/cache/conftool/dbconfig/20240402-170555-arnaudb.json [17:06:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59228 and previous config saved to /var/cache/conftool/dbconfig/20240402-170603-arnaudb.json [17:06:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59229 and previous config saved to /var/cache/conftool/dbconfig/20240402-170610-arnaudb.json [17:06:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P59230 and previous config saved to /var/cache/conftool/dbconfig/20240402-170625-arnaudb.json [17:13:41] !log Creating cu_useragent table on WMF wikis - T359312 [17:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:45] T359312: Create cu_useragent table - https://phabricator.wikimedia.org/T359312 [17:19:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T356166)', diff saved to https://phabricator.wikimedia.org/P59232 and previous config saved to /var/cache/conftool/dbconfig/20240402-171935-marostegui.json [17:19:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [17:19:40] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [17:19:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [17:23:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:24:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:26:19] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd2001.codfw.wmnet with OS bookworm [17:46:16] (03CR) 10Dzahn: [C:03+2] miscweb: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:49:58] (03CR) 10Dzahn: [C:03+2] "root@miscweb2003:/# openssl x509 -noout -ext subjectAltName -in /etc/envoy/ssl/discovery__webserver-misc-apps_discovery_wmnet_server.chain" [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:51:08] (03CR) 10Dzahn: [C:03+2] "restarted envoyproxy and:" [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:53:01] (03Abandoned) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013648 (owner: 10Dzahn) [17:53:49] (03CR) 10Dzahn: "neutral on this one - if everyone else likes it I am not against it." [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [17:54:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 823.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:57:50] (03CR) 10Dzahn: [C:03+2] "[deploy1002:~] $ httpbb ./test_miscweb.yaml --hosts=miscweb1003.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:58:09] (03PS1) 10Dzahn: delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) [17:58:42] (03CR) 10Dzahn: [V:03+2 C:03+2] delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:58:47] (03PS2) 10Dzahn: delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) [17:59:08] (03CR) 10Dzahn: [V:03+2 C:03+2] delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:59:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 821.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:00:05] jnuche and jeena: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T1800). [18:00:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 902.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:00:55] (03PS1) 10Dzahn: ssl delete webserver-misc-apps.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1016413 (https://phabricator.wikimedia.org/T360413) [18:04:01] (03CR) 10Dzahn: [C:03+2] ssl delete webserver-misc-apps.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1016413 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:04:06] (03PS2) 10Dzahn: ssl delete webserver-misc-apps.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1016413 (https://phabricator.wikimedia.org/T360413) [18:04:30] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 810.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:09:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 898.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:10:11] (03CR) 10Dzahn: [V:03+2 C:03+2] ssl delete webserver-misc-apps.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/1016413 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:14:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 829.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:15:45] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9681866 (10Dzahn) [18:16:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9681867 (10Dzahn) [18:21:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:43] (03CR) 10Dzahn: [C:03+2] httpbb: move tests for security.wikimedia.org to k8s test file [puppet] - 10https://gerrit.wikimedia.org/r/1016398 (https://phabricator.wikimedia.org/T350796) (owner: 10Dzahn) [18:31:58] (03CR) 10Dzahn: [C:03+2] httpbb: add missing virtual hosts to legacy miscweb tests [puppet] - 10https://gerrit.wikimedia.org/r/1016399 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:32:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:11] 10ops-eqiad, 06SRE: PDU sensor over limit - https://phabricator.wikimedia.org/T361535#9681933 (10VRiley-WMF) Rebalanced power cords. [18:39:19] 10ops-eqiad, 06SRE: 14PDU sensor over limit - 14https://phabricator.wikimedia.org/T361535#9681934 (10VRiley-WMF) 05Open→03Resolved [18:40:38] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9681935 (10VRiley-WMF) Opened ticket with dell in order to see what they could assist with since when first contacting them, it was on the day the warranty expired. Awa... [18:41:38] (03CR) 10TheDJ: lists: Allow images from upload.wikimedia.org in CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [18:43:25] 10ops-codfw, 06Data-Platform-SRE: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9681937 (10bking) [18:44:16] (03PS1) 10Ebernhardson: envoy: Enable xfp: https for mw-api-into-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 [18:45:17] (03PS1) 10TheDJ: Fix incorrect : in CSP statement [puppet] - 10https://gerrit.wikimedia.org/r/1016423 (https://phabricator.wikimedia.org/T353755) [18:48:22] (03CR) 10CI reject: [V:04-1] Fix incorrect : in CSP statement [puppet] - 10https://gerrit.wikimedia.org/r/1016423 (https://phabricator.wikimedia.org/T353755) (owner: 10TheDJ) [18:50:20] (03PS2) 10Ebernhardson: envoy: Enable xfp: https for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 [18:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:51:03] (03PS1) 10Ryan Kemper: elastic: rpl custom 3rdparty curator w deb default [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) [18:52:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [18:52:30] (03PS3) 10Ebernhardson: envoy: Enable xfp: https for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 [18:57:12] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9681952 (10lmata) @bking ack! thank you for the confirmation! [18:58:00] (03PS2) 10Ryan Kemper: elastic: rpl custom 3rdparty curator w deb default [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) [18:58:00] (03PS1) 10Ryan Kemper: elastic: remove no-longer-needed package [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) [18:58:17] 07sre-alert-triage, 10SRE Observability (FY2023/2024-Q4): Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9681954 (10lmata) @andrea.denisse could you disable this alert please? [18:59:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:00:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:01:05] (03PS2) 10Ryan Kemper: elastic: remove wmf 3rd party curator [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) [19:02:48] (03CR) 10Bking: [C:03+1] elastic: rpl custom 3rdparty curator w deb default [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:03:16] (03PS1) 10Ryan Kemper: elastic: move failing host elastic2088 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1016427 (https://phabricator.wikimedia.org/T361525) [19:03:46] (03CR) 10DCausse: [C:03+1] envoy: Enable xfp: https for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 (owner: 10Ebernhardson) [19:04:19] (03CR) 10Bking: [C:03+1] elastic: move failing host elastic2088 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1016427 (https://phabricator.wikimedia.org/T361525) (owner: 10Ryan Kemper) [19:04:21] (03CR) 10Ryan Kemper: [C:03+2] elastic: move failing host elastic2088 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1016427 (https://phabricator.wikimedia.org/T361525) (owner: 10Ryan Kemper) [19:04:54] (03PS3) 10Ryan Kemper: elastic: rpl custom 3rdparty curator w deb default [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) [19:04:54] (03PS3) 10Ryan Kemper: elastic: remove wmf 3rd party curator [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) [19:07:55] (03CR) 10Ryan Kemper: [C:03+2] elastic: rpl custom 3rdparty curator w deb default [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:19:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:21:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:42] (03CR) 10Ryan Kemper: "@moritzm (or any knowledgeable onlookers) See https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016424 for context. TLDR we have switc" [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:28:17] (03CR) 10Ryan Kemper: [C:03+2] "my terminology was a bit imprecise; the thirdparty package isn't necessarily "custom built". but for whatever reason the thirdparty packag" [puppet] - 10https://gerrit.wikimedia.org/r/1016424 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:28:34] (03PS2) 10Dwisehaupt: Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) [19:29:04] (03CR) 10CI reject: [V:04-1] Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:30:31] (03CR) 10Dwisehaupt: "I have updated the change to use cfssl (I think). I'm pretty sure there is something I've missed so would love to get feedback." [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:31:54] (03CR) 10Ryan Kemper: "Further context that I've come across:" [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:33:08] (03CR) 10Ryan Kemper: "^ Meant to also include" [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [19:54:46] (03PS4) 10Volans: external clouds: get prefixes also from MaxMindDB [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) [19:55:05] (03CR) 10Volans: external clouds: get prefixes also from MaxMindDB (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1016296 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [19:55:13] (03PS2) 10Krinkle: php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) [19:55:26] (03CR) 10Krinkle: "OK. I went for consistency instead, but either works for me." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240402T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:58] Confirmed :> [20:05:32] !log sfaci@deploy1002 Started deploy [airflow-dags/analytics@75163c7]: (no justification provided) [20:06:27] !log sfaci@deploy1002 Finished deploy [airflow-dags/analytics@75163c7]: (no justification provided) (duration: 00m 54s) [20:06:46] !log sfaci@deploy1002 Started deploy [airflow-dags/analytics@75163c7]: (no justification provided) [20:07:33] !log sfaci@deploy1002 Finished deploy [airflow-dags/analytics@75163c7]: (no justification provided) (duration: 00m 46s) [20:11:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:47] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Platform-SRE (2024.03.25 - 2024.04.14): 14create and deploy new Elastic Curator deb package - 14https://phabricator.wikimedia.org/T361105#9682101 (10bking) 05Open→03Resolved 14While working... [20:12:34] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682106 (10bking) Per subtask, we no longer need to cut a custom package... [20:13:05] (03PS2) 10TheDJ: Fix incorrect : in CSP statement [puppet] - 10https://gerrit.wikimedia.org/r/1016423 (https://phabricator.wikimedia.org/T353755) [20:13:15] !log sfaci@deploy1002 Started deploy [airflow-dags/analytics@75163c7]: (no justification provided) [20:13:31] !log sfaci@deploy1002 Finished deploy [airflow-dags/analytics@75163c7]: (no justification provided) (duration: 00m 16s) [20:17:56] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682112 (10Volans) @bking I'm not sure what do you mean. As mentioned ear... [20:19:31] (03CR) 10Majavah: [C:03+2] Fix incorrect : in CSP statement [puppet] - 10https://gerrit.wikimedia.org/r/1016423 (https://phabricator.wikimedia.org/T353755) (owner: 10TheDJ) [20:29:27] (03PS1) 10Ahmon Dancy: logstash_checker.py: Fix _mwdeploy_query for k8s-less realm [puppet] - 10https://gerrit.wikimedia.org/r/1016436 [20:30:01] (03CR) 10CI reject: [V:04-1] logstash_checker.py: Fix _mwdeploy_query for k8s-less realm [puppet] - 10https://gerrit.wikimedia.org/r/1016436 (owner: 10Ahmon Dancy) [20:31:31] (03PS2) 10Ahmon Dancy: logstash_checker.py: Fix _mwdeploy_query for k8s-less realm [puppet] - 10https://gerrit.wikimedia.org/r/1016436 [20:43:50] (03PS1) 10Dzahn: langlist: add igl (Igala) project language [dns] - 10https://gerrit.wikimedia.org/r/1016437 (https://phabricator.wikimedia.org/T361644) [20:45:00] (03CR) 10Dzahn: [C:03+1] "https://en.wikipedia.org/wiki/Igala_language" [dns] - 10https://gerrit.wikimedia.org/r/1016437 (https://phabricator.wikimedia.org/T361644) (owner: 10Dzahn) [20:45:08] (03CR) 10Dzahn: [C:03+2] langlist: add igl (Igala) project language [dns] - 10https://gerrit.wikimedia.org/r/1016437 (https://phabricator.wikimedia.org/T361644) (owner: 10Dzahn) [20:46:37] !log DNS - added new project language 'igl' - Igala is a Yoruboid language, spoken by the Igala ethnic group of Nigeria (800,000 speakers) T361644 [20:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:40] T361644: Create Wikipedia Igala - https://phabricator.wikimedia.org/T361644 [20:49:06] (03CR) 10Ebernhardson: [C:03+2] cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895 (owner: 10Ebernhardson) [20:49:10] (03CR) 10Ebernhardson: [C:03+2] cirrus: More reliable reporting of reindexing status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 (owner: 10Ebernhardson) [20:50:06] (03Merged) 10jenkins-bot: cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895 (owner: 10Ebernhardson) [20:50:16] (03Merged) 10jenkins-bot: cirrus: More reliable reporting of reindexing status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 (owner: 10Ebernhardson) [20:55:58] (03PS1) 10Volans: puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) [20:56:13] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9682181 (10Volans) I've sent a proposal implementation in the patch above [20:56:34] 10ops-codfw, 06Data-Platform-SRE, 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9682183 (10bking) Hello DC Ops, This host is unreachable via SSH. We went ahead and shut it off from the DRAC; it's all yours if you need to send it back/replace hardware/etc. [20:56:43] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9682184 (10Volans) a:03Volans [20:58:36] (03PS1) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) [21:11:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:35] (03PS1) 10Dzahn: stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:14:05] 06SRE, 10Wikimedia-Mailing-lists, 07ContentSecurityPolicy, 13Patch-For-Review: 14Icon of daily-image-l broken by CSP - 14https://phabricator.wikimedia.org/T353755#9682223 (10TheDJ) 05Open→03Resolved 14{F44216186} [21:15:48] (03CR) 10CI reject: [V:04-1] stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:16:39] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2001.codfw.wmnet with OS bookworm [21:17:14] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4): Remove elasticsearch-curator dependency from Elastic cookbooks - https://phabricator.wikimedia.org/T361647 (10bking) 03NEW [21:17:53] (03PS2) 10Dzahn: stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:19:20] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4): Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9682261 (10bking) a:05RKemper→03None [21:19:57] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682264 (10bking) Oops, thank you for pointing that out. I'm discussing... [21:21:08] (03CR) 10Peter Fischer: [C:03+1] envoy: Enable xfp: https for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 (owner: 10Ebernhardson) [21:21:24] !log bking@cumin2002 START - Cookbook sre.ganeti.resource-report [21:21:25] !log bking@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [21:21:35] !log bking@cumin2002 START - Cookbook sre.ganeti.resource-report [21:21:36] !log bking@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [21:24:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:24:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:26:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:26:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:28:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:28:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:29:10] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9682286 (10Dzahn) >>! In T351202#967... [21:30:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:30:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:31:19] (03CR) 10Ryan Kemper: [C:03+2] envoy: Enable xfp: https for mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/1016422 (owner: 10Ebernhardson) [21:31:38] (03CR) 10Urbanecm: "I might be missing something, but judging from how /srv/repos looks like, the export directory will not be writable. Can we set the owner " [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:31:40] !log cwhite@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logging-hd2001.codfw.wmnet with OS bookworm [21:32:01] (03PS3) 10Dwisehaupt: Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) [21:32:10] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2001.codfw.wmnet with OS bookworm [21:32:13] (03CR) 10Urbanecm: [C:03+1] stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:32:20] (03CR) 10Urbanecm: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:32:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:32:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:34:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:34:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:34:36] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:35:10] (03CR) 10CI reject: [V:04-1] Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:36:15] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:36:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:36:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:38:26] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:34] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:38:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:40:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:40:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:41:47] (03CR) 10Thcipriani: [C:03+1] "tested and working." [puppet] - 10https://gerrit.wikimedia.org/r/1016436 (owner: 10Ahmon Dancy) [21:42:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:42:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:43:23] (03PS1) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T361537) [21:43:24] (03PS1) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [21:44:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T361537) (owner: 10Andrew Bogott) [21:44:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:46:29] (03CR) 10CI reject: [V:04-1] cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T361537) (owner: 10Andrew Bogott) [21:46:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:46:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:48:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:48:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:49:28] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:49:54] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:50:01] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:50:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:50:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:51:21] (03PS2) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T361537) [21:51:22] (03PS2) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [21:51:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T361537) (owner: 10Andrew Bogott) [21:52:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:52:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:54:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:56:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:56:38] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:58:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:58:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:00:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:01:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:01:16] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1779/console" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:02:24] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2001.codfw.wmnet with reason: host reimage [22:02:26] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1780/co" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:03:16] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:03:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:04:20] (03CR) 10Urbanecm: [C:04-1] "actually, does not lgtm. the users.yaml file would be hard to provide this way." [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [22:05:11] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2001.codfw.wmnet with reason: host reimage [22:05:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:05:32] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:07:16] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:07:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:07:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:07:50] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:08:45] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:08:58] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:09:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:09:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:23] (03PS3) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) [22:11:25] (03PS3) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [22:11:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:11:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:13:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:13:32] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:15:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:15:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:16:30] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9682432 (10Urbanecm) >>! In T351202#... [22:17:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:17:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:19:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:19:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:21:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:21:21] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:23:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:23:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:25:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:25:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:27:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:27:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:28:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2001.codfw.wmnet with OS bookworm [22:29:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:29:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:18] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2003.codfw.wmnet with OS bookworm [22:31:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:31:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:32:17] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:33:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:34:30] (03PS1) 10Scott French: Add a dedicated ACL for /spicerack keyspace [puppet] - 10https://gerrit.wikimedia.org/r/1016456 [22:34:30] (03PS1) 10Scott French: Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 [22:34:30] (03PS1) 10Scott French: PCC TEST: Make /spicerack read-only [puppet] - 10https://gerrit.wikimedia.org/r/1016458 [22:35:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:35:38] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:37:46] (03CR) 10CI reject: [V:04-1] Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 (owner: 10Scott French) [22:38:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:38:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:40:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:40:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:42:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:42:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:44:04] (03PS2) 10Scott French: Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 [22:44:04] (03PS2) 10Scott French: PCC TEST: Make /spicerack read-only [puppet] - 10https://gerrit.wikimedia.org/r/1016458 [22:45:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [22:46:37] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:46:41] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:47:39] (03CR) 10CI reject: [V:04-1] Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 (owner: 10Scott French) [22:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:52:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:52:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:52:45] (03PS1) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [22:57:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:58:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:01:01] (03PS4) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [23:02:09] (03PS3) 10Scott French: Add support for path-level read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/1016457 [23:02:09] (03PS3) 10Scott French: PCC TEST: Make /spicerack read-only [puppet] - 10https://gerrit.wikimedia.org/r/1016458 [23:03:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:03:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:06:18] (03PS2) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [23:09:10] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:09:14] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:09:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.044s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:11:35] (03PS3) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [23:14:06] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 (owner: 10Andrew Bogott) [23:14:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [23:15:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:15:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:15:51] (03CR) 10Scott French: "PCC: https://puppet-compiler.wmflabs.org/output/1016456/1781/" [puppet] - 10https://gerrit.wikimedia.org/r/1016456 (owner: 10Scott French) [23:16:23] (03CR) 10Scott French: "PCC: https://puppet-compiler.wmflabs.org/output/1016457/1785/" [puppet] - 10https://gerrit.wikimedia.org/r/1016457 (owner: 10Scott French) [23:16:40] (03CR) 10Scott French: "PCC: https://puppet-compiler.wmflabs.org/output/1016458/1786/" [puppet] - 10https://gerrit.wikimedia.org/r/1016458 (owner: 10Scott French) [23:19:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.156s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:22:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:22:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:28:31] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd2003.codfw.wmnet with OS bookworm [23:28:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:29:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:30:08] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2003.codfw.wmnet with OS bookworm [23:36:05] (03CR) 10Pppery: [C:04-1] "Self -1 since I intend to poke this further once the parent patch is merged." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [23:37:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016371 [23:37:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016371 (owner: 10TrainBranchBot) [23:42:12] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:42:16] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:48:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:48:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:50:06] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2002.codfw.wmnet with OS bookworm [23:50:58] 06SRE, 06Infrastructure-Foundations, 06serviceops-radar, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741#9682572 (10Pppery) [23:59:44] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:59:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:59:56] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2003.codfw.wmnet with reason: host reimage