[00:00:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on netflow2003.codfw.wmnet with reason: reboot netflow2003 [00:00:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on netflow2003.codfw.wmnet with reason: reboot netflow2003 [00:00:37] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005374 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=23e26d8b-bf98-4528-9f4f-f796eb123261) set by cmooney@cumin1002 for 0:15:00 on 1 host(s) and th... [00:02:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056048 (owner: 10TrainBranchBot) [00:02:05] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow2003.codfw.wmnet [00:02:19] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005377 (10ops-monitoring-bot) VM netflow2003.codfw.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [00:05:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow2003.codfw.wmnet [00:05:38] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:09:20] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:16:03] (03PS1) 10Eevans: data-gateway: Upgrade to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056049 [00:20:05] (03CR) 10Eevans: [C:03+2] data-gateway: Upgrade to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056049 (owner: 10Eevans) [00:20:59] (03Merged) 10jenkins-bot: data-gateway: Upgrade to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056049 (owner: 10Eevans) [00:22:02] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [00:22:21] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [00:24:20] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:20] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:30:38] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.15 [core] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056056 (https://phabricator.wikimedia.org/T366960) [01:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.15 [core] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056056 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [01:19:20] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:45] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:23:04] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055513 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [01:24:07] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:24:09] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:24:10] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:24:11] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055513 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [01:24:12] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:24:13] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:24:16] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:24:29] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:24:31] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:24:32] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:24:34] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:24:36] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:24:38] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:27:03] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:27:18] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:27:19] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [01:27:43] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [01:27:44] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [01:28:05] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [01:35:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.15 [core] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056056 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [01:53:46] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370731 (10phaultfinder) 03NEW [01:53:47] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370733 (10phaultfinder) 03NEW [01:53:48] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370735 (10phaultfinder) 03NEW [01:53:49] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370734 (10phaultfinder) 03NEW [01:55:24] (03CR) 10RLazarus: "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0200) [02:01:04] (03PS1) 10Pppery: Update to 2024.19 (part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1056058 (https://phabricator.wikimedia.org/T363188) [02:31:16] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739 (10Catrope) 03NEW [02:37:31] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10005577 (10Catrope) The Chart extension is still in early development, so this is by no means the final form of the code, but for now we have a simpl... [02:39:20] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:44] (03PS2) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [02:42:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [02:47:12] (03PS3) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [02:47:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [02:59:20] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0300) [03:01:51] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056060 (https://phabricator.wikimedia.org/T366960) [03:01:53] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056060 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [03:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:02:35] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056060 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [03:03:06] !log mwpresync@deploy1002 Started scap sync-world: testwikis to 1.43.0-wmf.15 refs T366960 [03:03:10] T366960: 1.43.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T366960 [03:54:56] !log mwpresync@deploy1002 Finished scap: testwikis to 1.43.0-wmf.15 refs T366960 (duration: 51m 50s) [03:55:04] T366960: 1.43.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T366960 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0400) [04:01:08] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.12 (duration: 01m 00s) [04:14:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T367856)', diff saved to https://phabricator.wikimedia.org/P66888 and previous config saved to /var/cache/conftool/dbconfig/20240723-041442-marostegui.json [04:14:48] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:29:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P66889 and previous config saved to /var/cache/conftool/dbconfig/20240723-042950-marostegui.json [04:33:12] (03PS1) 10Clare Ming: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) [04:34:36] (03PS14) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [04:44:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P66890 and previous config saved to /var/cache/conftool/dbconfig/20240723-044457-marostegui.json [05:00:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T367856)', diff saved to https://phabricator.wikimedia.org/P66891 and previous config saved to /var/cache/conftool/dbconfig/20240723-050004-marostegui.json [05:00:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:00:17] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:00:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:00:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [05:00:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [05:00:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T367856)', diff saved to https://phabricator.wikimedia.org/P66892 and previous config saved to /var/cache/conftool/dbconfig/20240723-050042-marostegui.json [05:01:28] (03PS15) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [05:05:35] (03CR) 10Clare Ming: "thanks @cgoubert@wikimedia.org -- I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056062 << does this look correct? and ar" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [05:15:53] (03CR) 10Clare Ming: "@cgoubert@wikimedia.org one thing that i'm confused about is should I reference `https://mpic.svc.eqiad.wmnet:30443` anywhere? this is the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [05:19:05] (03CR) 10Clare Ming: "one more link for reference:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [05:19:45] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:39:38] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10005664 (10Legoktm) > However, the Chart extension's use case would involve shelling out to a Node.js script, which would need to install dependencie... [05:53:59] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10005666 (10phaultfinder) [05:54:01] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10005667 (10phaultfinder) [05:54:03] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10005668 (10phaultfinder) [05:54:04] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10005669 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0600) [06:00:05] marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0600). nyaa~ [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:53:57] marostegui: OK to deploy cxserver? Updating staging first. [06:54:33] kart_: go for it [06:54:44] Thanks [06:55:34] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-07-22-050142-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055876 (https://phabricator.wikimedia.org/T363968) (owner: 10KartikMistry) [06:56:26] (03Merged) 10jenkins-bot: Update cxserver to 2024-07-22-050142-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055876 (https://phabricator.wikimedia.org/T363968) (owner: 10KartikMistry) [06:56:36] (03PS1) 10Ayounsi: Interface validator: fix connected_endpoints type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056076 (https://phabricator.wikimedia.org/T336275) [06:58:15] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:58:40] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:37] OK. I'm here. [07:02:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:11] (03PS2) 10KartikMistry: uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) [07:07:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) (owner: 10KartikMistry) [07:07:42] (03Merged) 10jenkins-bot: uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) (owner: 10KartikMistry) [07:08:31] !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1055653|uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups (T370387)]] [07:08:35] T370387: Set wgContentTranslationPublishRequirements for uzwiki - https://phabricator.wikimedia.org/T370387 [07:15:31] !log kartik@deploy1002 kartik: Backport for [[gerrit:1055653|uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups (T370387)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:15:35] T370387: Set wgContentTranslationPublishRequirements for uzwiki - https://phabricator.wikimedia.org/T370387 [07:17:05] !log kartik@deploy1002 kartik: Continuing with sync [07:22:08] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1055653|uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups (T370387)]] (duration: 13m 37s) [07:22:13] T370387: Set wgContentTranslationPublishRequirements for uzwiki - https://phabricator.wikimedia.org/T370387 [07:24:51] (03CR) 10MVernon: [C:03+2] Prepare for more new-style ms-be nodes [puppet] - 10https://gerrit.wikimedia.org/r/1055254 (https://phabricator.wikimedia.org/T368928) (owner: 10MVernon) [07:25:07] (03CR) 10MVernon: [C:03+2] Thanos: use new-style swift storage layout for forthcoming backends [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [07:25:26] (03PS2) 10MVernon: Thanos: use new-style swift storage layout for forthcoming backends [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) [07:27:11] (03CR) 10MVernon: [V:03+2 C:03+2] Thanos: use new-style swift storage layout for forthcoming backends [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [07:30:29] (03CR) 10Brouberol: [C:03+2] Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [07:32:13] (03CR) 10Cathal Mooney: [C:03+1] "nice!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056076 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:32:13] (03PS1) 10Brouberol: growthbook: define helmfile and production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056078 (https://phabricator.wikimedia.org/T365839) [07:35:04] (03CR) 10Ayounsi: [C:03+2] Interface validator: fix connected_endpoints type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056076 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:36:03] (03Merged) 10jenkins-bot: Interface validator: fix connected_endpoints type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056076 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:43:57] (03CR) 10DCausse: [C:03+2] team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [07:45:42] (03Merged) 10jenkins-bot: team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [07:48:14] (03CR) 10Ayounsi: [C:03+1] "some nits but overall lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1056026 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [07:50:45] (03CR) 10DCausse: [C:03+1] "checked the latency thresholds against the incident yesterday and they appear that they would have captured the incident" [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [07:53:22] (03CR) 10DCausse: [C:04-1] "might not be necessary if we're OK waiting for private wiki support, the sole wiki unsupported will remain wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse) [07:58:10] (03CR) 10Ayounsi: [C:03+1] Widen netmak for allowed in BGP prefixes codfw frack [homer/public] - 10https://gerrit.wikimedia.org/r/1056029 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [07:59:23] (03CR) 10Kosta Harlan: [C:03+1] Define wgGlobalBlockingCentralWiki as 'metawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056014 (https://phabricator.wikimedia.org/T370457) (owner: 10Dreamy Jazz) [08:08:03] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:08:44] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:09:51] (03CR) 10Stevemunene: [C:03+1] "nit: some typos on the commit message" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056078 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [08:12:00] (03CR) 10Volans: [C:04-1] "It would unfortunately not work this way, see inline for details" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [08:12:26] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [08:13:10] (03PS2) 10Brouberol: growthbook: define helmfile and production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056078 (https://phabricator.wikimedia.org/T365839) [08:14:14] (03CR) 10Brouberol: [C:03+2] growthbook: define helmfile and production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056078 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [08:16:14] (03CR) 10Hoo man: [C:03+1] Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [08:17:31] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:20:32] (03PS1) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [08:20:49] (03PS2) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [08:24:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [08:24:25] (03CR) 10CI reject: [V:04-1] openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [08:24:56] (03PS3) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [08:25:02] (03PS6) 10Elukey: profile::puppetmaster::frontend: allow puppetservers via ssh [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) [08:26:02] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3380/co" [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:26:45] (03CR) 10Elukey: [V:03+1] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:26:50] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [08:27:35] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [08:38:22] (03PS4) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [08:38:30] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [08:41:33] (03PS1) 10Brouberol: growthbook: small fixes to the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056118 (https://phabricator.wikimedia.org/T365839) [08:49:19] (03PS5) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [08:49:28] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [08:58:22] (03CR) 10Jelto: [C:03+2] etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:58:40] jouncebot: nowandnext [08:58:40] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [08:58:40] In 1 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1000) [08:59:38] Going to do a no-op deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1056014 [08:59:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056014 (https://phabricator.wikimedia.org/T370457) (owner: 10Dreamy Jazz) [09:00:11] (03CR) 10FNegri: openstack: opentofu: init modules before runnig plan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [09:00:44] (03Merged) 10jenkins-bot: Define wgGlobalBlockingCentralWiki as 'metawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056014 (https://phabricator.wikimedia.org/T370457) (owner: 10Dreamy Jazz) [09:01:17] !log dreamyjazz@deploy1002 Started scap sync-world: Backport for [[gerrit:1056014|Define wgGlobalBlockingCentralWiki as 'metawiki' (T370457)]] [09:01:21] T370457: Add global block log link to the global block message on Special:Contributions - https://phabricator.wikimedia.org/T370457 [09:02:19] (03CR) 10Stevemunene: [C:03+1] growthbook: small fixes to the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056118 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:02:30] (03CR) 10Brouberol: [C:03+2] growthbook: small fixes to the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056118 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:05:35] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:07:47] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1056014|Define wgGlobalBlockingCentralWiki as 'metawiki' (T370457)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:07:51] T370457: Add global block log link to the global block message on Special:Contributions - https://phabricator.wikimedia.org/T370457 [09:07:54] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [09:12:47] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1056014|Define wgGlobalBlockingCentralWiki as 'metawiki' (T370457)]] (duration: 11m 29s) [09:13:02] Finished my deploy [09:14:58] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:15:54] (03CR) 10Cathal Mooney: [C:03+2] Widen netmak for allowed in BGP prefixes codfw frack [homer/public] - 10https://gerrit.wikimedia.org/r/1056029 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [09:16:24] (03Merged) 10jenkins-bot: Widen netmak for allowed in BGP prefixes codfw frack [homer/public] - 10https://gerrit.wikimedia.org/r/1056029 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [09:18:08] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10005968 (10akosiaris) What @Legoktm suggsted. If you have already a JSON input for that command and expect back an SVG (it looks this way judging fro... [09:18:12] (03PS5) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:19:06] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3381/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:19:45] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:21:03] (03PS2) 10Cathal Mooney: Add new mgmt range for frack codfw to network defs [puppet] - 10https://gerrit.wikimedia.org/r/1056026 (https://phabricator.wikimedia.org/T370164) [09:21:41] (03Restored) 10Hashar: tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [09:21:56] (03PS6) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:22:26] (03CR) 10Hashar: "I am restoring this patch since that has hit me locally due to pip installing whatever version I had in my cache :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [09:22:47] (03CR) 10Cathal Mooney: Add new mgmt range for frack codfw to network defs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056026 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [09:22:51] (03CR) 10Ayounsi: [C:03+1] Add monitoring definitions for new codfw row C/D switches [puppet] - 10https://gerrit.wikimedia.org/r/1056031 (https://phabricator.wikimedia.org/T369106) (owner: 10Cathal Mooney) [09:23:18] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:23:33] (03PS1) 10Brouberol: growthbook: bind ferretdb service to 0.0.0.0 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056123 (https://phabricator.wikimedia.org/T365839) [09:25:04] (03CR) 10Cathal Mooney: [C:03+2] Add monitoring definitions for new codfw row C/D switches [puppet] - 10https://gerrit.wikimedia.org/r/1056031 (https://phabricator.wikimedia.org/T369106) (owner: 10Cathal Mooney) [09:26:27] (03PS6) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) [09:29:35] (03CR) 10FNegri: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [09:31:51] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [09:32:08] (03CR) 10Arturo Borrero Gonzalez: openstack: opentofu: init modules before runnig plan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [09:32:26] (03CR) 10Alexandros Kosiaris: [C:04-1] Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:33:28] (03CR) 10Cathal Mooney: [C:03+2] Add new mgmt range for frack codfw to network defs [puppet] - 10https://gerrit.wikimedia.org/r/1056026 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [09:34:18] (03CR) 10Ayounsi: [C:03+1] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:34:23] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: opentofu: init modules before runnig plan [puppet] - 10https://gerrit.wikimedia.org/r/1056117 (https://phabricator.wikimedia.org/T370037) (owner: 10Arturo Borrero Gonzalez) [09:35:09] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:35:18] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:39:06] (03CR) 10Stevemunene: [C:03+1] growthbook: bind ferretdb service to 0.0.0.0 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056123 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:39:37] (03CR) 10Brouberol: [C:03+2] growthbook: bind ferretdb service to 0.0.0.0 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056123 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:39:56] (03CR) 10Cathal Mooney: [C:03+1] "Awesome, LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1055543 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [09:40:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:41:09] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:41:17] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:44:47] (03CR) 10Elukey: [V:03+1] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:45:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:47:41] (03PS10) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [09:48:48] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:50:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:52:52] (03PS1) 10Brouberol: growthbook: add mesh service, configuration, container and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056126 (https://phabricator.wikimedia.org/T365839) [09:53:04] (03CR) 10Clément Goubert: "It would be indirectly referenced in the service definition @akosiaris@wikimedia.org mentioned in the linked patchset. For consistency, I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:55:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:56:26] (03CR) 10Kamila Součková: [C:03+2] Revert "benthos/mw_accesslog_metrics: Add buffer" [puppet] - 10https://gerrit.wikimedia.org/r/1055399 (owner: 10Kamila Součková) [09:56:29] (03PS2) 10Brouberol: growthbook: add mesh service, configuration, container and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056126 (https://phabricator.wikimedia.org/T365839) [09:57:48] (03CR) 10Clément Goubert: Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:58:29] (03PS3) 10Brouberol: growthbook: add mesh service, configuration, container and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056126 (https://phabricator.wikimedia.org/T365839) [09:58:45] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006113 (10phaultfinder) [09:58:48] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006112 (10phaultfinder) [09:58:51] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006114 (10phaultfinder) [09:58:52] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006115 (10phaultfinder) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1000) [10:00:14] (03CR) 10Stevemunene: [C:03+1] growthbook: add mesh service, configuration, container and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056126 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [10:00:20] (03CR) 10Brouberol: [C:03+2] growthbook: add mesh service, configuration, container and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056126 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [10:02:25] RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:59] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:03:07] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:03:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:03:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:08:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:08:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:09:01] !hmm [10:09:41] databases :/ [10:10:26] s4 writes [10:11:27] Amir1: how can I tell what cluster27/28/29 are wrt sections? [10:11:45] let me get you [10:11:47] that's es [10:12:06] well it's complaining :p [10:12:15] 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#10006143 (10kamila) FTR, I have reverted the buffer patch, as it shouldn't be necessary now that we have more partitions thanks to T3692... [10:12:29] looks like a big spike of writes https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s4&var-role=All&from=now-3h&to=now [10:13:01] hnowlan: yep, first sustained high s4 writes, then a spike to es [10:13:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:13:38] claime: check src/etcd.php in mw config repo [10:14:30] Amir1: ack [10:14:53] I'm not seeing a jump on master [10:14:53] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&var-job=All&var-server=db1238&var-port=9104&viewPanel=2&refresh=1m [10:15:12] it might be large queries (instead of lots of small ones) [10:15:16] (03PS1) 10Elukey: Revert^2 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056130 [10:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:17:15] claime: in general https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/src/etcd.php#81 [10:20:18] https://usercontent.irccloud-cdn.com/file/45xUsR82/grafik.png [10:21:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:21:26] (03CR) 10Ayounsi: [C:03+1] Revert^2 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056130 (owner: 10Elukey) [10:22:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:22:51] I got the binlogs during the high reads, it's /home/ladsgroup/binlog_cry in db1238 [10:24:11] number-wise linter is still the largest [10:24:23] (one third) [10:24:38] https://www.irccloud.com/pastebin/Dvz6ZeKk/ [10:24:59] but let me check for large transactions [10:26:30] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:27:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:29:15] Should I open a status page incident? [10:29:37] It is user facing, but not in a major way [10:29:52] I haven't find any specific traffic pattern so far [10:30:01] volans: me neither [10:31:22] (03CR) 10Elukey: [C:03+2] Revert^2 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056130 (owner: 10Elukey) [10:31:39] there is no large transactions from what I'm seeing, there is 238 transactions that took 1s to exec but that's not really an issue [10:31:48] nothing 2s or more [10:34:28] https://www.irccloud.com/pastebin/oLVd6owi/ [10:34:32] holy shit [10:34:51] dayum [10:35:30] still one third of writes is linter, it's now just delete ops, it'll take a while, as you can see to go through them [10:35:56] I can start manually dropping them so it ends sooner [10:36:21] (large batches are more effective) [10:37:00] It's calmed down for now as far as user-facing errors go, but is this going to keep happening or is it temporary for the backfill of linter data? [10:37:23] Because if it's going to keep happening, I fear it's completely unsustainable [10:39:11] !log Cordoning kubernetes1025.eqiad.wmnet kubernetes1026.eqiad.wmnet kubernetes1052.eqiad.wmnet kubernetes1053.eqiad.wmnet kubernetes1054.eqiad.wmnet kubernetes1055.eqiad.wmnet kubernetes1056.eqiad.wmnet mw1496.eqiad.wmnet for T365998 [10:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [10:41:24] claime: the lint writes might be coming from deferred updates or something else, but if they don't, we can reduce the linter job concurrency [10:42:49] (03PS1) 10Giuseppe Lavagetto: puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 [10:43:24] (03CR) 10CI reject: [V:04-1] puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 (owner: 10Giuseppe Lavagetto) [10:43:32] Amir1: let me check what the linter jobs say in jobqueue [10:44:38] Amir1: I wouldn't be surprised if this aligned with the issues btw https://grafana.wikimedia.org/goto/disdKbuIg?orgId=1 [10:45:03] I want to cry [10:45:22] *hugs* [10:45:26] average 800 per sec? [10:45:27] how [10:47:01] (03PS1) 10FNegri: Don't use proxy for wikitech-static [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 [10:47:28] I'm going to start running "delete from linter where linter_cat = 23 limit 1000;" in a loop in mwmaint [10:47:39] (with sleep obviously) [10:51:02] !log running "delete from linter where linter_cat = 23 limit 1000;" in a loop in mwmaint (T370304) [10:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:07] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [10:51:33] It's got 100 concurrency but outside spikes we're not even there [10:52:35] but yeah, average 800 jobs/s on the last 7 days [10:52:55] (both insertion and processing) [10:53:54] that's not entirely outside of the historical trend [10:54:16] no :/ [10:54:26] those spikes are odd, but are mostly preceded by a drop to 0 [10:54:36] one part of the problem might be that it's running on commons mostly [10:54:51] so it's focused on doing a lot of writes on commons for soemthing that's global [10:54:58] the linter job has become more expensive recently on the jobrunner end [10:55:04] sharding it might help [10:55:34] sorry no, I am wrong on that one, that's refreshlinks [10:58:07] (03CR) 10Volans: "might good to add it, I have it in my local config." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [10:58:56] There's more jobs than historically though, average on now-30d to now-15d is 650, last 7d average is 821 [10:59:12] (processed per s) [11:00:26] (03PS2) 10Giuseppe Lavagetto: puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 [11:02:23] half a million is dropped, gonna take a day or two to get rid of all of them [11:02:37] Amir1: <3 [11:02:59] I can drop the concurrency, do we care much about it being potentially backlogged? [11:03:38] nope [11:03:41] (03PS3) 10Effie Mouzeli: app.job: update module (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) [11:03:52] +1 [11:04:15] sharding is a more elegant solution IMHO but it's a bit of work [11:04:41] I'd say drop it by 10 or 20 to start and see how the backlog does [11:05:05] are there any scripts causing these big spikes? [11:05:58] (03PS4) 10Effie Mouzeli: cronjobs : update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) [11:06:12] I'd assume this is behind this? https://phabricator.wikimedia.org/T367417 (h/t hnowlan) [11:06:26] (03CR) 10Jbond: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1056131 (owner: 10Giuseppe Lavagetto) [11:07:11] templatelinks runs happen all the time, they basically reparse the whole wiki every month or so [11:07:16] (03PS1) 10Clément Goubert: jobqueue: Lower concurrency for RecordLint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056133 (https://phabricator.wikimedia.org/T370304) [11:07:23] there is always a widely used template that has changed [11:07:37] (03PS2) 10Clément Goubert: jobqueue: Lower concurrency for RecordLint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056133 (https://phabricator.wikimedia.org/T370304) [11:08:22] yeah last time there was a risky linting spike it was a huge template change, was just about to check that [11:08:33] hnowlan: ha, I dropped it to 50 to start x) [11:08:48] I'll put it back up to 80 [11:09:01] claime: shrug, let's see how bad that does [11:09:03] it's a queue after all [11:09:22] (03CR) 10Hnowlan: [C:03+1] jobqueue: Lower concurrency for RecordLint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056133 (https://phabricator.wikimedia.org/T370304) (owner: 10Clément Goubert) [11:09:27] but I think 50 will be too low [11:09:38] but we can work back up [11:09:54] (03CR) 10Clément Goubert: [C:03+2] jobqueue: Lower concurrency for RecordLint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056133 (https://phabricator.wikimedia.org/T370304) (owner: 10Clément Goubert) [11:11:03] (03Merged) 10jenkins-bot: jobqueue: Lower concurrency for RecordLint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056133 (https://phabricator.wikimedia.org/T370304) (owner: 10Clément Goubert) [11:13:02] even if it gets backed up for days, it's totally fine [11:14:07] it's not anything critical [11:15:47] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:16:03] my kingdom for an exhaustive list of jobs, what they do and their criticality [11:16:07] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:17:14] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:17:43] (03PS1) 10Elukey: Revert^3 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056134 [11:18:24] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:18:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:18:44] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006345 (10cmooney) So, we hit a bit of a speed-bump in codfw with the gnmic stats once the new switches were made live there. We now have 36 active gnmic subscriptions... [11:19:08] !log Lowered concurrency of RecordLint job to 50 - T370304 [11:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:12] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [11:19:38] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:19:44] (03PS1) 10Cathal Mooney: Tweak gnmic parameters to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1056136 (https://phabricator.wikimedia.org/T369384) [11:20:02] (03PS2) 10FNegri: Don't use proxy for wikitech-static [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 [11:20:13] (03CR) 10FNegri: Don't use proxy for wikitech-static (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [11:20:51] I'm going through binlog of db1199 in case there is a fast transaction that writes a lot of rows [11:21:00] (03PS2) 10Cathal Mooney: Tweak gnmic parameters to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1056136 (https://phabricator.wikimedia.org/T369384) [11:23:44] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1056134 (owner: 10Elukey) [11:24:20] (03CR) 10Ayounsi: [C:03+1] Revert^3 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056134 (owner: 10Elukey) [11:25:58] (03CR) 10Elukey: [C:03+2] Revert^3 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1056134 (owner: 10Elukey) [11:26:48] (03PS1) 10Brouberol: growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) [11:27:29] (03CR) 10CI reject: [V:04-1] growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [11:28:19] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=(kubernetes1025|kubernetes1026|kubernetes1052|kubernetes1053|kubernetes1054|kubernetes1055|kubernetes1056|mw1496).eqiad.wmnet,cluster=kubernetes,service=kubesvc [11:38:02] (03CR) 10Ayounsi: [C:03+1] Tweak gnmic parameters to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1056136 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [11:39:19] it changed absolutely nothing to the processing rate, backlog raised up a bit but is regressing to what it was before [11:39:50] (03PS2) 10Volans: Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) [11:41:33] (03CR) 10CI reject: [V:04-1] Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1200) [12:04:15] (03PS1) 10Alexandros Kosiaris: imagecatalog: Vary gunicorn package on Debian version [puppet] - 10https://gerrit.wikimedia.org/r/1056144 (https://phabricator.wikimedia.org/T364417) [12:10:31] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056144 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [12:12:16] (03PS2) 10Alexandros Kosiaris: imagecatalog: Vary gunicorn package on Debian version [puppet] - 10https://gerrit.wikimedia.org/r/1056144 (https://phabricator.wikimedia.org/T364417) [12:13:12] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056144 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [12:14:05] (03CR) 10Filippo Giunchedi: [C:03+1] Tweak gnmic parameters to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1056136 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [12:18:19] (03CR) 10Cathal Mooney: [C:03+2] Tweak gnmic parameters to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1056136 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [12:21:35] Insertion rate of RecordLint jobs: 1k/s [12:21:36] * claime braces [12:23:04] 20kw/s to s6 [12:25:48] (03PS1) 10Kosta Harlan: AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) [12:26:03] (03PS4) 10Brouberol: growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) [12:26:15] (03CR) 10Kosta Harlan: [C:04-2] "Pending QA." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [12:26:53] (03CR) 10Alexandros Kosiaris: [C:03+2] imagecatalog: Vary gunicorn package on Debian version [puppet] - 10https://gerrit.wikimedia.org/r/1056144 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [12:27:53] (03PS1) 10Brouberol: growthbook: provision DNS records for each backend/frontend service [dns] - 10https://gerrit.wikimedia.org/r/1056147 (https://phabricator.wikimedia.org/T365839) [12:28:23] (03PS2) 10Brouberol: growthbook: provision DNS records for each backend/frontend service [dns] - 10https://gerrit.wikimedia.org/r/1056147 (https://phabricator.wikimedia.org/T365839) [12:32:55] (03PS5) 10Brouberol: growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) [12:36:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:21] (03PS6) 10Brouberol: growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1300) [13:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:15] o/ [13:00:43] o/ [13:01:58] I can deploy! [13:03:19] (03PS3) 10Daimona Eaytoy: [arwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) [13:03:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) (owner: 10Daimona Eaytoy) [13:03:41] hmmm [13:03:41] “Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field wbtl_type_id” [13:04:33] (03Merged) 10jenkins-bot: [arwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) (owner: 10Daimona Eaytoy) [13:05:03] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1055206|[arwiki] Enable the CampaignEvents extension (T370066)]] [13:05:10] T370066: Release CampaignEvents extension to Arabic Wikipedia - https://phabricator.wikimedia.org/T370066 [13:05:17] !log Cordoning dse-k8s-worker1008.eqiad.wmnet for T365998 [13:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:22] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [13:05:40] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=dse-k8s-worker1008.eqiad.wmnet,cluster=dse-k8s,service=kubesvc [13:06:22] Daimona: any maintenance script or something needed for CampaignEvents? [13:06:41] (filed T370769 for the error I mentioned above btw, unrelated) [13:06:42] T370769: InvalidArgumentException: Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field wbtl_type_id - https://phabricator.wikimedia.org/T370769 [13:07:41] Nope, no scripts or anything [13:10:14] ok [13:10:39] ah, one of the testserver checks failed, that’s why scap seemed to take longer than usual [13:10:43] * Lucas_WMDE wasn’t looking at that screen for a moment [13:10:45] * Lucas_WMDE retries [13:10:50] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1055206|[arwiki] Enable the CampaignEvents extension (T370066)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:55] T370066: Release CampaignEvents extension to Arabic Wikipedia - https://phabricator.wikimedia.org/T370066 [13:10:56] ok, now it worked [13:11:00] Daimona: please test :) [13:11:28] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow1002.eqiad.wmnet [13:11:50] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006695 (10ops-monitoring-bot) VM netflow1002.eqiad.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:12:43] hm, sudden spike of “Call to a member function audienceCan() on null” in logspam-watch [13:12:46] * Lucas_WMDE looks which wiki that’s on [13:13:07] ok, not arwiki, probably not related then [13:13:15] (wikidatawiki, in fact) [13:15:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow1002.eqiad.wmnet [13:16:24] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow3003.esams.wmnet [13:16:50] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006699 (10ops-monitoring-bot) VM netflow3003.esams.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:19:24] Lucas_WMDE looks good! [13:19:30] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, daimona: Continuing with sync [13:19:43] ok the audienceCan just seems to be due to replication lag, cirrusbuilddoc trying to get a revision before it’s visible [13:19:45] Daimona: thanks! [13:20:24] (03PS1) 10Lucas Werkmeister (WMDE): MoveLogFormatter::getPreloadTitles: Handle bad titles [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056155 (https://phabricator.wikimedia.org/T370396) [13:20:27] (03PS1) 10Alexandros Kosiaris: imagecatalog: Force uid/gid for imagecatalog user [puppet] - 10https://gerrit.wikimedia.org/r/1056156 (https://phabricator.wikimedia.org/T364417) [13:20:37] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "backporting" [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056155 (https://phabricator.wikimedia.org/T370396) (owner: 10Lucas Werkmeister (WMDE)) [13:20:42] I’ll deploy ^ afterwards [13:21:30] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056156 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:22:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow3003.esams.wmnet [13:23:44] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow4002.ulsfo.wmnet [13:24:06] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow5002.eqsin.wmnet [13:24:10] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006727 (10ops-monitoring-bot) VM netflow4002.ulsfo.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:24:21] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1055206|[arwiki] Enable the CampaignEvents extension (T370066)]] (duration: 19m 17s) [13:24:26] T370066: Release CampaignEvents extension to Arabic Wikipedia - https://phabricator.wikimedia.org/T370066 [13:24:27] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006728 (10ops-monitoring-bot) VM netflow5002.eqsin.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:26:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056155 (https://phabricator.wikimedia.org/T370396) (owner: 10Lucas Werkmeister (WMDE)) [13:27:37] (03CR) 10Alexandros Kosiaris: [C:03+2] imagecatalog: Force uid/gid for imagecatalog user [puppet] - 10https://gerrit.wikimedia.org/r/1056156 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:29:05] (03CR) 10Stevemunene: [C:03+1] growthbook: provision DNS records for each backend/frontend service [dns] - 10https://gerrit.wikimedia.org/r/1056147 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:29:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow4002.ulsfo.wmnet [13:29:40] (03CR) 10Brouberol: [C:03+2] growthbook: provision DNS records for each backend/frontend service [dns] - 10https://gerrit.wikimedia.org/r/1056147 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:30:29] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow6001.drmrs.wmnet [13:30:47] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006766 (10ops-monitoring-bot) VM netflow6001.drmrs.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:31:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow5002.eqsin.wmnet [13:31:57] (03PS1) 10Alexandros Kosiaris: imagecatalog: Actually comment about the uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/1056157 (https://phabricator.wikimedia.org/T364417) [13:32:47] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1056000 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [13:33:04] (03CR) 10Alexandros Kosiaris: [C:03+2] imagecatalog: Actually comment about the uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/1056157 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:33:21] (03PS5) 10Clément Goubert: Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) [13:33:50] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006779 (10cmooney) In Eqiad our netflow VM was also running a little hot, and swapping to disk. I've now increased the resources for it and also the other netflow VMs i... [13:34:22] (03CR) 10CI reject: [V:04-1] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:34:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow6001.drmrs.wmnet [13:34:31] !log cmooney@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netflow7001.magru.wmnet [13:34:50] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006783 (10ops-monitoring-bot) VM netflow7001.magru.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:36:45] (03PS1) 10Alexandros Kosiaris: imagecatalog: Remove require for data directory [puppet] - 10https://gerrit.wikimedia.org/r/1056158 (https://phabricator.wikimedia.org/T364417) [13:37:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [reason: upgrading anycast-hc: T370068] [13:37:27] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [13:37:36] (03PS6) 10Clément Goubert: Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) [13:37:43] (03CR) 10Alexandros Kosiaris: [C:03+2] imagecatalog: Remove require for data directory [puppet] - 10https://gerrit.wikimedia.org/r/1056158 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:37:45] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:37:47] 06SRE, 06Infrastructure-Foundations, 10netops: Set Leaf switches in Codfw rows C & D to active and make new vlans live - https://phabricator.wikimedia.org/T370629#10006786 (10cmooney) 05Open→03Resolved All actions complete. @papaul, @Jhancock.wm please note that after this change if running the netb... [13:38:08] (03CR) 10Brouberol: [C:03+2] growthbook: split chart into 2 charts (frontend/backend) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056137 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:38:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow7001.magru.wmnet [13:38:39] (03CR) 10CI reject: [V:04-1] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:40:53] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [reason: finished upgrading anycast-hc: T370068] [13:41:03] (03PS7) 10Clément Goubert: Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) [13:41:38] (03CR) 10Southparkfan: [C:04-1] "No opinion about removal of the obsolete records, one comment about apus" [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:41:50] (03PS4) 10Clément Goubert: service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) [13:41:50] (03PS5) 10Clément Goubert: Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) [13:41:50] (03PS5) 10Clément Goubert: Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) [13:42:31] (03CR) 10Ssingh: [C:03+1] "What do you think about running it by olly? I too want to get an answer to this!" [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [13:42:32] (03CR) 10Ayounsi: [C:03+2] border-in: remove git-ssh term [homer/public] - 10https://gerrit.wikimedia.org/r/1055543 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [13:43:12] (03Merged) 10jenkins-bot: border-in: remove git-ssh term [homer/public] - 10https://gerrit.wikimedia.org/r/1055543 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [13:43:28] (03CR) 10Clément Goubert: "Done" [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:43:50] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:44:19] !log cdobbins@cumin1002:~$ sudo cumin 'A:cp' 'disable-puppet "merging CR #1041705"' [13:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:54] (03CR) 10Southparkfan: [C:03+1] "'appservers', you have served us so well, for such a long amount of time" [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:46:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:46:47] (03CR) 10CDobbins: [V:03+1 C:03+2] varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [13:47:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10006824 (10Clement_Goubert) [13:47:13] (03Merged) 10jenkins-bot: MoveLogFormatter::getPreloadTitles: Handle bad titles [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056155 (https://phabricator.wikimedia.org/T370396) (owner: 10Lucas Werkmeister (WMDE)) [13:47:18] finally [13:47:46] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1056155|MoveLogFormatter::getPreloadTitles: Handle bad titles (T370396)]] [13:47:52] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:48:01] T370396: Error: Call to a member function inNamespace() on null - https://phabricator.wikimedia.org/T370396 [13:49:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye [13:49:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:07] !log deploy CR1055543: border-in: remove git-ssh term [13:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:18] !log running authdns-update after dns6001 depool [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:07] (03CR) 10Ayounsi: [C:03+2] border-in: remove squid and nrpe filters, expand LVS filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055544 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [13:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:51:43] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1056155|MoveLogFormatter::getPreloadTitles: Handle bad titles (T370396)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:51:44] (03Merged) 10jenkins-bot: border-in: remove squid and nrpe filters, expand LVS filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055544 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [13:51:52] !log deploy CR1055544 border-in: remove squid and nrpe filters, expand LVS filter [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:56] alright, https://inh.wikipedia.org/wiki/%D0%93%D3%8F%D1%83%D0%BB%D0%B0%D0%BA%D1%85%D0%B0:%D0%9A%D0%B5%D1%80%D0%B4%D0%B0_%D1%85%D1%83%D0%B2%D1%86%D0%B0%D0%BC%D0%B0%D1%88?hidepageedits=1&hidenewpages=1&hidecategorization=1&hideWikibase=1&hidenewuserlog=1&limit=500&days=30&urlversion=2&uselang=en is an internal error normally… [13:52:02] XioNoX: nice! [13:52:04] …and works on mwdebug \o/ [13:52:07] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:53:11] I hope we don't break the border policies ✌️ [13:54:47] Southparkfan: we won't be breaking million of computers, what's the worse that can happen :) [13:55:03] Southparkfan: deploying the first two first, then later on the other two [13:56:25] haha, breaking access to the sum of all human knowledge isn't much better [13:56:38] and that's why it makes sense to do the authdns and bgp/bfd ones in a second round, yeah [13:57:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1056155|MoveLogFormatter::getPreloadTitles: Handle bad titles (T370396)]] (duration: 09m 24s) [13:57:15] T370396: Error: Call to a member function inNamespace() on null - https://phabricator.wikimedia.org/T370396 [13:57:30] (03PS2) 10Clare Ming: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) [13:58:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10006880 (10Jclark-ctr) a:05Jclark-ctr→03cmooney [13:58:42] !log UTC afternoon backport+config window done [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10006894 (10Jclark-ctr) a:03Papaul [13:59:03] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006896 (10phaultfinder) [13:59:05] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006897 (10phaultfinder) [13:59:08] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006898 (10phaultfinder) [13:59:12] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10006899 (10phaultfinder) [13:59:34] (03PS1) 10Clare Ming: Add MPIC service port [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) [13:59:40] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#10006906 (10Jclark-ctr) @BTullis can this drive be changed at anytime? [14:00:00] (03PS1) 10Brouberol: growthbook: define one TLS hostname per subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056164 (https://phabricator.wikimedia.org/T365839) [14:00:02] (03PS1) 10Brouberol: growthbook: fix volume/configmap name problem [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056165 (https://phabricator.wikimedia.org/T365839) [14:00:15] (03PS3) 10Clare Ming: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) [14:01:45] (03PS2) 10Brouberol: growthbook: fix volume/configmap name problem [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056165 (https://phabricator.wikimedia.org/T365839) [14:02:00] (03PS4) 10Clare Ming: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) [14:02:03] (03CR) 10Giuseppe Lavagetto: [C:04-1] "please note this won't work with our puppet-lint:" [puppet] - 10https://gerrit.wikimedia.org/r/1056131 (owner: 10Giuseppe Lavagetto) [14:02:26] (03CR) 10CI reject: [V:04-1] Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:03:19] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [14:03:35] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [14:03:41] (03CR) 10Clare Ming: Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:03:43] (03CR) 10Stevemunene: [C:03+1] growthbook: define one TLS hostname per subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056164 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [14:04:26] (03PS5) 10Clare Ming: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) [14:04:38] (03CR) 10Stevemunene: [C:03+1] growthbook: fix volume/configmap name problem [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056165 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [14:05:17] (03PS11) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [14:05:54] noob question for the k8s experts: I was just surprised to see a “kubernetes.host” set to “parse2020.codfw.wmnet” in logstash, for an mw-api-ext request [14:06:16] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [14:06:26] is that just expected, that some k8s hosts have an “older” host name at the moment, until they’ve been relabeled? [14:06:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [14:06:50] Lucas_WMDE: yes, we have reused the names of the ex-mediawiki servers [14:06:52] (03PS3) 10Giuseppe Lavagetto: puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 [14:07:03] alright, then I won’t pay further attention to that [14:07:04] thanks :) [14:07:07] cheers [14:09:59] !log cdobbins@cumin1002:~$ sudo cumin 'A:cp' 'run-puppet-agent "merging CR #1041705"' [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [14:26:00] (03CR) 10JHathaway: [C:03+1] admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [14:30:00] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:33:00] 06SRE, 06Traffic, 13Patch-For-Review: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718#10007033 (10CDobbins) This has been deployed as of 14:25 on 7/23/24, with CR #1041705. 1. I added a n... [14:33:18] 06SRE, 06Traffic, 13Patch-For-Review: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718#10007036 (10CDobbins) 05Open→03Resolved a:03CDobbins [14:35:36] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetmaster::frontend: allow puppetservers via ssh [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:39:20] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:18] (03CR) 10Scott French: [C:03+1] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:45:01] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T370667#10007055 (10VRiley-WMF) a:03VRiley-WMF [14:46:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10007060 (10ssingh) On `dns6001`, we have anycast-hc 0.9.8 running with the patch to change the logging level to WARN for when a service is dow... [14:48:37] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-f3-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f3-eqiad [14:48:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-f3-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f3-eqiad [14:49:00] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007067 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=85a0a04b-e091-4107-9bc3-7c9ca22300c8) set by cmooney... [14:50:53] (03CR) 10Scott French: "I think the `service_setup` change might have been lost in the rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:51:27] (03CR) 10Ayounsi: [C:03+2] border-in: remove authdns filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:51:33] (03CR) 10CI reject: [V:04-1] border-in: remove authdns filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:51:53] (03PS4) 10Southparkfan: border-in: remove authdns filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) [14:53:32] (03PS1) 10Elukey: WIP dhcp: add dhcp_filename and dhcp_options [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 [14:54:44] !log moss-be1003 into maintenance mode for network downtime T365998 [14:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:49] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [14:55:52] (03PS79) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:56:17] (03CR) 10CI reject: [V:04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:56:26] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:57:05] (03Merged) 10jenkins-bot: border-in: remove authdns filter [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:57:07] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007073 (10MatthewVernon) @cmooney Swift (ms-be) and Ceph (moss-be) ready when you are. [14:58:29] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:59:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [14:59:20] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:30] !log deploy CR1055546 border-in: remove authdns filter [14:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:37] sukhe: ^ [14:59:56] nice, thanks! [15:00:04] eoghan, jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1500). [15:00:06] doing codfw first [15:00:22] !log rebooting lsw1-f3-eqiad to complete JunOS upgrade (T365998) [15:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:29] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10007074 (10Jhancock.wm) @Papaul now that the C/D lsw's are live-ish we can move it over to 10G. I can swap the cable if you can update the link. [15:00:31] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [15:00:46] (03PS80) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [15:00:49] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f3-eqiad,lsw1-f3-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f3-eqiad [15:00:56] (03CR) 10CI reject: [V:04-1] WIP dhcp: add dhcp_filename and dhcp_options [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (owner: 10Elukey) [15:00:57] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10007079 (10elukey) ` (2) puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet... [15:01:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f3-eqiad,lsw1-f3-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f3-eqiad [15:01:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 25 hosts with reason: JunOS upgrade lsw1-f3-eqiad [15:01:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007080 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=71f4229e-483c-4848-9bc3-6926b62b02ae) set by cmooney... [15:01:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 25 hosts with reason: JunOS upgrade lsw1-f3-eqiad [15:01:45] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007081 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=18d9056a-9166-4006-b516-a07496523fd2) set by cmooney... [15:02:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:02:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:02:27] (03PS6) 10Clément Goubert: Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) [15:02:42] (03CR) 10Clément Goubert: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:02:47] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:48] !log brennen@deploy1002 Started deploy [phabricator/deployment@7335128]: deploy phab2002 for T370776 [15:03:52] T370776: Deploy Phabricator/Phorge 2024-07-23 - https://phabricator.wikimedia.org/T370776 [15:04:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10007112 (10Jhancock.wm) a:03Jhancock.wm [15:05:05] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7335128]: deploy phab2002 for T370776 (duration: 01m 17s) [15:05:30] !log brennen@deploy1002 Started deploy [phabricator/deployment@3902e30]: deploy phab2002 for T370776 (redux, first deploy a mistaken no-op) [15:06:05] !log brennen@deploy1002 Finished deploy [phabricator/deployment@3902e30]: deploy phab2002 for T370776 (redux, first deploy a mistaken no-op) (duration: 00m 34s) [15:06:29] !log brennen@deploy1002 Started deploy [phabricator/deployment@3902e30]: deploy phab1004 for T370776 [15:07:02] !log brennen@deploy1002 Finished deploy [phabricator/deployment@3902e30]: deploy phab1004 for T370776 (duration: 00m 33s) [15:07:06] (03PS7) 10Clément Goubert: Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) [15:07:26] (03PS2) 10Elukey: WIP dhcp: add dhcp_filename and dhcp_options [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 [15:08:00] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370736#10007126 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu2 and cable. cleared. [15:08:48] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370732#10007131 (10fnegri) The same error happened last month (T368211), and was fixed by @Jhancock.wm by reseating the cable. [15:08:52] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10007136 (10Papaul) @Jhancock.wm yes i can take care of the link. Thanks [15:10:35] (03PS1) 10Anzx: knwikisource: Enable local uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) [15:11:21] (03PS2) 10Anzx: knwikisource: Enable local uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) [15:12:16] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370732#10007161 (10fnegri) The alert went back to green 1 minute after posting the comment above :) [15:13:55] (03PS1) 10Cathal Mooney: Remove clear_dhcp_cache function from reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1056182 (https://phabricator.wikimedia.org/T306421) [15:14:12] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370735#10007173 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu2 and cable. alert cleared. [15:14:33] (03PS3) 10Anzx: knwikisource: Enable local uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) [15:15:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:17:02] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370734#10007211 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable. alert cleared. [15:17:20] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370733#10007215 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu2 and cable. alert cleared. [15:17:21] (03PS1) 10Cathal Mooney: Set DHCP relay function to 'forward-only' mode for all EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1056183 (https://phabricator.wikimedia.org/T306421) [15:20:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:20:21] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370732#10007219 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:20:28] !log find /srv/mediawiki/images/wikitech/archive -type f | xargs delete on wikitech-static, drive is full of nonsense [15:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007228 (10cmooney) Upgrade complete, things look ok network wise and all host are back pinging again. Thanks all for the assis... [15:21:42] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T370731#10007225 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:21:43] (03CR) 10Ayounsi: [C:03+1] "😎" [homer/public] - 10https://gerrit.wikimedia.org/r/1056183 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [15:22:14] !log Uncordoning dse-k8s-worker1008.eqiad.wmnet after T365998 [15:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:21] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [15:22:41] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=dse-k8s-worker1008.eqiad.wmnet,cluster=dse-k8s,service=kubesvc [15:23:21] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10007234 (10Jhancock.wm) a:03Jhancock.wm [15:23:34] (03CR) 10Cathal Mooney: [C:03+2] Set DHCP relay function to 'forward-only' mode for all EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1056183 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [15:23:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10007236 (10Jhancock.wm) a:03Jhancock.wm [15:24:15] !log moss-be1003 out of maintenance mode after network downtime T365998 [15:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:27] (03Merged) 10jenkins-bot: Set DHCP relay function to 'forward-only' mode for all EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1056183 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [15:24:28] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=(kubernetes1025|kubernetes1026|kubernetes1052|kubernetes1053|kubernetes1054|kubernetes1055|kubernetes1056|mw1496).eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Uncordoning following T365998] [15:25:01] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10007251 (10Jhancock.wm) a:03Jhancock.wm [15:26:43] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007276 (10MatthewVernon) Both Ceph and Swift back to normal, thanks. [15:29:20] (03PS1) 10Brouberol: gobblin: ignore GobblinKafkaRecordsExtractedNotEqualRecordsExpected for compacted topics [alerts] - 10https://gerrit.wikimedia.org/r/1056188 [15:29:27] 06SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790#10007297 (10jhathaway) [15:30:01] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056182 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [15:31:26] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783 (10jhathaway) 03NEW [15:31:57] 06SRE-OnFire, 10Incident Tooling: corto: implement changing the IC - https://phabricator.wikimedia.org/T370784 (10jhathaway) 03NEW [15:32:34] (03CR) 10Brouberol: [C:03+2] growthbook: define one TLS hostname per subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056164 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [15:32:59] 06SRE-OnFire, 10Incident Tooling: corto: implement changing the IC - https://phabricator.wikimedia.org/T370784#10007331 (10jhathaway) [15:33:01] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10007332 (10jhathaway) [15:33:04] 06SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790#10007333 (10jhathaway) [15:34:43] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785 (10jhathaway) 03NEW [15:35:07] (03CR) 10Brouberol: [C:03+2] growthbook: fix volume/configmap name problem [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056165 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [15:36:02] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786 (10jhathaway) 03NEW [15:36:06] sukhe: deploy almost done, but I'm a but surprised/concerned about the graphs on https://grafana.wikimedia.org/d/Jj8MztfZz/authoritative-dns?orgId=1&refresh=30s&from=now-3h&to=now [15:36:15] (03PS1) 10Brouberol: growthbook: fix typo in tls hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056191 (https://phabricator.wikimedia.org/T365839) [15:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:16] 06SRE-OnFire, 10Incident Tooling: corto: CI & packaging - https://phabricator.wikimedia.org/T370788 (10jhathaway) 03NEW [15:38:18] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789 (10jhathaway) 03NEW [15:39:38] (03PS1) 10Ayounsi: Revert "border-in: remove authdns filter" [homer/public] - 10https://gerrit.wikimedia.org/r/1056192 [15:39:56] sukhe: yeah.. I have to revert it, it's actually blocking legit udp traffic [15:40:48] (03CR) 10Clément Goubert: Add MPIC service listener proxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [15:41:25] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1056192 (cc topranks) [15:41:32] (03CR) 10Clément Goubert: Add MPIC service port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [15:41:35] yes, 208.80.153.231 times out [15:41:40] in a meeting but what's up? [15:41:48] oh ok [15:41:54] sp does 208.80.154.238 [15:42:31] reverted manually in eqiad/codfw [15:42:46] how did we all get this wrong damn [15:42:57] so... DNS requests were blocked? [15:43:00] I ahve to step away to go to an important apointment [15:43:04] ns1.wm.o doesn't respond, ns2.wm.o does [15:43:11] damnit [15:43:13] topranks: udp requests [15:43:36] surprising [15:43:49] The NS IPs are in 'LVS-service-ips' ? [15:44:00] yeah... for ns1 and ns0 [15:44:12] dunno how I didn't catch it [15:44:14] anyway [15:44:19] ah shit, I never looked and just assumed based on the name of that [15:44:22] well none of us did [15:44:28] in the prefix-list: 208.80.153.224/27; 208.80.154.224/27; [15:44:39] that... makes sense [15:44:43] but I thought that was a separate list [15:44:45] the comment on the term we removed literally says "the next term will block" [15:44:50] the change has been manually rolled back [15:44:56] I have to step away [15:45:01] caught by the udp-lvs-ddos term [15:45:13] yeah I was convinced it was on a different range [15:45:13] I looked at it and assumed it used to be just a "block udp", and later the "dest prefix LVS-service-IPs" was added, thus making it not apply to the DNS IPs [15:45:14] anway.. [15:45:26] it is what it is, caught quick [15:45:37] POPs are not impacted as it's only the anycast IP [15:45:38] at least DNS is quite resilient in design that way [15:45:53] I'll resume the revert when I'm back [15:46:24] (03CR) 10Brouberol: [C:03+2] growthbook: fix typo in tls hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056191 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [15:46:25] graphs look better too :) [15:46:39] eqiad should have also been serving requests on the NS2 IP right? [15:46:51] XioNoX: I can finish the tidy up [15:46:52] yep, ns1 back here as well [15:47:13] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:47:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:47:50] thanks folks! a weird one [15:47:59] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:48:03] topranks: yeah exactly, so the impact is null [15:48:21] may some slower responses waiting on timeout, but nothing disasterous [15:48:49] (03CR) 10Southparkfan: [C:03+1] "208.80.154.238 # ns0 and 208.80.153.231 # ns1 are actually part of LVS-service-ips, therefore legitimate AuthDNS to ns0 and ns1 is blocked" [homer/public] - 10https://gerrit.wikimedia.org/r/1056192 (owner: 10Ayounsi) [15:49:06] sorry for this :) [15:49:15] Southparkfan: it's on all of us :) [15:49:37] even the eight-eyes principle failed us [15:49:39] it's recovering so all's good [15:49:44] (03PS1) 10Elukey: role::puppetmaster::backend: allow puppetservers to connect via ssh [puppet] - 10https://gerrit.wikimedia.org/r/1056193 (https://phabricator.wikimedia.org/T368023) [15:50:50] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3382/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056193 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:50:53] for my understanding - the ns3/anycast authdns stayed online, and only ns1/ns2 were impacted? [15:51:07] or ns0/ns1 I should say I guess [15:51:13] only ns2 is anycast [15:51:30] Southparkfan: correct [15:51:30] ok right, that explains why dig @ns2.wm.o worked [15:52:05] although the reason is nothing to do with anycast, it's just because that IP is from a different range not part of 'LVS-service-ips' [15:52:33] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephmon1004.eqiad.wmnet [15:52:36] yep, prefix list contains ranges covering ns0/ns1, but not ns2 [15:53:32] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:53:37] (03PS16) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [15:53:59] (03PS2) 10BCornwall: haproxy: Calculate rate of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) [15:54:19] (03PS3) 10BCornwall: haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) [15:54:24] (03CR) 10Elukey: "Added David as well only as FYI for the cloud part :)" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:55:59] (03CR) 10CI reject: [V:04-1] haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [15:56:03] (03CR) 10Elukey: [V:03+1] "Jesse sorry for this extra review but I totally forgot that puppetmaster backends need to receive the post-commit updates as well (from pu" [puppet] - 10https://gerrit.wikimedia.org/r/1056193 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:56:08] (03CR) 10Cathal Mooney: [C:03+2] Revert "border-in: remove authdns filter" [homer/public] - 10https://gerrit.wikimedia.org/r/1056192 (owner: 10Ayounsi) [15:56:59] (03Merged) 10jenkins-bot: Revert "border-in: remove authdns filter" [homer/public] - 10https://gerrit.wikimedia.org/r/1056192 (owner: 10Ayounsi) [15:58:04] (03CR) 10Scott French: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:58:23] (03CR) 10Scott French: [C:03+1] Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:58:27] (03PS81) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [15:58:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10007492 (10elukey) Next steps: - Immediate: I/F is going to add code to Spicerack and the reimage cookbook to force a tftp-only boot, so we'll be able... [16:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1600). [16:00:04] Lucas_WMDE and tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] o/ [16:00:18] (03CR) 10JHathaway: [C:03+1] role::puppetmaster::backend: allow puppetservers to connect via ssh [puppet] - 10https://gerrit.wikimedia.org/r/1056193 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [16:01:17] (03CR) 10Cathal Mooney: [C:03+2] Remove clear_dhcp_cache function from reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1056182 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [16:01:19] o/ [16:01:23] o/ [16:02:09] (03PS4) 10BCornwall: haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) [16:03:03] (03CR) 10Elukey: [V:03+1 C:03+2] role::puppetmaster::backend: allow puppetservers to connect via ssh [puppet] - 10https://gerrit.wikimedia.org/r/1056193 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [16:03:05] tgr|away: & Lucas_WMDE I'll merge both in, any special roll out procedures [16:03:54] (03PS1) 10Brouberol: growthbook: bump chart version to allow subchart upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056196 (https://phabricator.wikimedia.org/T365839) [16:04:08] I imagine a Varnish restart or reload? Nothing beyond that. [16:04:25] nothing for me, I think [16:04:38] I assume a `systemctl daemon-reload` to pick up the unit file changes happens automatically [16:04:56] (03Merged) 10jenkins-bot: Remove clear_dhcp_cache function from reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1056182 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [16:04:57] if you are talking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030591 [16:04:59] (03PS17) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [16:05:00] (03CR) 10Brouberol: [C:03+2] growthbook: bump chart version to allow subchart upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056196 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [16:05:05] okay, I'm in a meeting, so for the varnish piece, I think I may need to wait till after the meeting, as I have never done a varnish restart [16:05:07] then a varnish reload thatpuppet automatically does [16:05:15] ah good [16:05:24] bless notify [16:05:25] no restart required (and we shouldn't as it will clear the caches) [16:05:34] nod [16:05:53] okay, sounds as if both patches can go out via normal puppet runs then [16:06:17] just disable Puppet on A:cp-text, try on one host and then roll it to all others (that's how at least we do it) [16:06:34] nod [16:06:41] okay, will do that after this meeting [16:07:17] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:07:28] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:09:45] (03CR) 10Clare Ming: Add MPIC service port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [16:10:18] (03CR) 10Clare Ming: Add MPIC service listener proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [16:10:24] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10007542 (10elukey) And of course past-Luca forgot about puppetmaster backends, those needs to be updated as well (and they ar... [16:11:20] (03PS1) 10Cathal Mooney: Rename LVS-service-IPs prefix-list [homer/public] - 10https://gerrit.wikimedia.org/r/1056198 (https://phabricator.wikimedia.org/T370156) [16:14:40] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudcephmon1004.eqiad.wmnet [16:14:41] (03CR) 10Ssingh: [C:03+1] "Thanks :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1056198 (https://phabricator.wikimedia.org/T370156) (owner: 10Cathal Mooney) [16:15:21] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [16:16:59] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:17:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10007580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:18:07] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10007582 (10Scott_French) Silenced ProbeDown for api-https:443 and appservers-https:443 for 24h: * f6f67d8d-6381-43b3-9262-9a8cf58f2b19 * ed0d352b-fb83-4bd4-... [16:21:28] (03PS3) 10Herron: ipmi-sel: create task on critical ipmi sel events [alerts] - 10https://gerrit.wikimedia.org/r/1054649 (https://phabricator.wikimedia.org/T368088) [16:23:32] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007599 (10cmooney) 05Open→03Resolved [16:24:27] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#10007601 (10cmooney) 05Open→03Resolved [16:24:42] (03CR) 10Milimetric: [C:03+2] "Thank you for the change, may the alerts be quieter 😊" [alerts] - 10https://gerrit.wikimedia.org/r/1056188 (owner: 10Brouberol) [16:25:53] (03Merged) 10jenkins-bot: gobblin: ignore GobblinKafkaRecordsExtractedNotEqualRecordsExpected for compacted topics [alerts] - 10https://gerrit.wikimedia.org/r/1056188 (owner: 10Brouberol) [16:28:37] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10007607 (10lmata) [16:29:39] (03CR) 10BCornwall: "I brought it up and they preferred increase() as well." [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [16:30:07] (03PS1) 10Hashar: puppetmaster: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) [16:30:52] (03CR) 10Ssingh: "Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [16:30:57] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:32:57] (03PS4) 10Herron: ipmi-sel: create task on critical ipmi sel events [alerts] - 10https://gerrit.wikimedia.org/r/1054649 (https://phabricator.wikimedia.org/T368088) [16:37:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T370667#10007636 (10VRiley-WMF) 05Open→03Resolved Rebalanced Power. [16:37:56] 06SRE-OnFire, 10Incident Tooling: corto: CI & packaging - https://phabricator.wikimedia.org/T370788#10007632 (10BCornwall) 05Open→03Resolved a:03BCornwall We have CI using blubber/kokkuri: https://gitlab.wikimedia.org/repos/sre/corto/-/blob/main/.pipeline/blubber.yaml?ref_type=heads For .deb builds,... [16:39:42] does anybody happen to know how long jhathaway’s meeting might take? 😅 [16:39:59] almost done, sorry [16:40:01] I wanted to send out an email to ops-l once my Puppet change was merged (just in case it breaks something) but it’s getting kinda late here ^^ [16:40:03] ah ok, thanks [16:43:27] (03CR) 10JHathaway: [C:03+2] systemd::timer::job: Use TimeoutStartSec= [puppet] - 10https://gerrit.wikimedia.org/r/1054603 (https://phabricator.wikimedia.org/T370171) (owner: 10Lucas Werkmeister (WMDE)) [16:43:53] \o/ [16:46:10] Lucas_WMDE: merged in [16:47:25] thanks! [16:48:09] email sent [16:49:23] (03CR) 10Hashar: "On I6e72813e2e6a637c20a3bd5a455665ea93450fc1, I remove `umask` entirely in favor of computing it from the requested `mode` for files. The" [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:50:43] tgr|away: still around for your patch? [16:52:08] jhathaway: yes [16:52:25] okay, disabling cp-test nodes [16:52:31] for puppet [16:53:44] (03CR) 10Ebernhardson: [C:03+2] team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [16:54:18] (03PS2) 10Hashar: puppetmaster: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) [16:54:44] (03CR) 10JHathaway: [C:03+2] varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:54:50] jhathaway: we use cp4037 for testing fwiw (basically any host in ulsfo), so feel free to merge there first to test and check [16:54:56] (03Merged) 10jenkins-bot: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [16:55:16] sukhe: thanks, will do, any reason to use that node in particular? [16:56:17] not at all, just the first cp text node so we use that. you can pick any cp-text in ulsfo [16:56:51] and why ulsfo? [16:57:26] it gets the least amount of traffic and is our usual testbed for any such stuff [16:57:58] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:58:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:58:26] (one can argue that margu is that DC now but we don't hit that as the first DC for other reasons, such as the unique devices experiment) [16:58:50] thanks makes sense [16:59:59] !log applying varnish change on cp4037, 1030591 [17:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1700). [17:01:22] here ^ will be starting on this work momentarily [17:01:42] sukhe: any tips or docs on how to assess success? [17:02:05] (03CR) 10Scott French: [C:03+2] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:02:15] swfrench-wmf: Is it okay if I do a `scap sync-world` while you're working? [17:02:47] (03PS5) 10Dzahn: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:02:55] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:02:57] dancy: I don't anticipate it causing problems, so no objections. thanks for checking, though. [17:02:58] jhathaway: as long as varnish reloaded successfully (which it seems it does), that's basically it [17:03:04] * Lucas_WMDE off [17:03:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10007769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye ex... [17:03:07] OK [17:03:08] thanks again for the deploy jhathaway! [17:03:12] (03CR) 10CI reject: [V:04-1] firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:03:23] Lucas_WMDE: no problem [17:03:31] sukhe: okay [17:03:40] jhathaway: I can test if you tell me what domain to hit. But probably easier to test once fully deployed, I don't think it's a risky change. [17:03:50] swfrench-wmf: no gate-and-submit on dns, you have to submit manually [17:03:50] !log dancy@deploy1002 Started scap sync-world: testing [17:04:03] claime: thanks, yes - did do :) [17:04:04] tgr|away: ok, I'll finish the deploy [17:04:06] ah you did [17:04:13] sorry <3 [17:04:24] tgr|away: if you can tell us which cp node you are hitting, we can merge the change there and you can try against that [17:04:41] !log dancy@deploy1002 sync-world aborted: testing (duration: 00m 51s) [17:05:05] swfrench-wmf: Done testing [17:05:17] dancy: ack, thank you [17:05:32] (03CR) 10Dzahn: [V:03+1 C:03+2] planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:05:36] sukhe: cp3068 [17:05:56] jhathaway: if you need one more test, ^ [17:06:43] !log ran authdns-update on dns1004 to pick up removal of appservers / api records - T367949 [17:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:47] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:06:54] (03CR) 10Clare Ming: Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:07:44] cgoubert@cumin1002:~$ dig +short appserver.discovery.wmnet [17:07:46] cgoubert@cumin1002:~$ [17:07:48] F [17:09:06] I've borked the domain, but the result is the same) [17:09:08] rip [17:09:40] (03CR) 10Scott French: [C:03+2] service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:10:26] 🫗 [17:10:31] (03CR) 10Clare Ming: Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:10:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:06] tgr|away: ready to test [17:11:37] claime: an nxdomain I'm happy to see [17:12:00] swfrench-wmf: ain't that often you have one :p [17:12:27] (03PS1) 10WMDE-Fisch: Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056204 (https://phabricator.wikimedia.org/T370585) [17:12:54] (03PS1) 10WMDE-Fisch: Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056205 (https://phabricator.wikimedia.org/T370585) [17:13:55] "It was DNS! 😁" [17:14:03] (03CR) 10Clément Goubert: Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:14:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:25] thanks jhathaway! It works. [17:15:40] tgr|away: amazing [17:17:09] !log run-puppet-agent on A:dnsbox to pick up switch to lvs_setup - T367949 [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:13] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:17:54] (03PS2) 10Gergő Tisza: debug: Enable Special:WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) [17:19:36] (03PS3) 10Gergő Tisza: debug: Enable Special:WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) [17:20:09] (03PS7) 10Clément Goubert: Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) [17:20:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:21:53] (03CR) 10Scott French: [C:03+2] Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:22:49] (03PS1) 10Dzahn: phorge: add UnsafeAllow3F rewrite flag [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) [17:26:22] (03CR) 10CI reject: [V:04-1] phorge: add UnsafeAllow3F rewrite flag [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn) [17:27:09] (03CR) 10Dzahn: [C:03+2] phorge: add UnsafeAllow3F rewrite flag [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn) [17:28:04] !log run-puppet-agent on O:lvs::balancer to pick up switch to service_setup, removal of profile::lvs::realserver::pools - T367949 [17:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:08] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:28:42] hmm profile::mediawiki::webserver fails tests [17:29:14] seems related: parameter 'realserver_ips' variant 0 expects size to be at least 1, got 0 [17:29:19] patiently waits a while [17:29:44] mutante: where? [17:29:51] https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster/5955/console [17:29:58] unrelated puppet changes, CI downvotes [17:30:04] looking [17:30:47] swfrench-wmf: do your prod thing [17:30:52] I'll take a look at this [17:30:57] it's just the tests: ./modules/profile/spec/classes/profile_mediawiki_webserver_spec.rb:45 [17:31:22] yeah, I didn't think about them when carving out the patches [17:31:42] claime: thanks, and ack [17:31:43] probably just need to remove this block since it's no longer required [17:32:02] yeah, this sounds like a test that asserts configuration-as-fixture? [17:32:05] sukhe: yep that's what I'm doing [17:32:11] <3 [17:32:15] :) [17:33:51] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T367949) [17:33:56] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:35:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Cite] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056204 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [17:35:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Cite] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056205 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [17:37:24] sukhe: just had a thought as I starting running the cookbook, that I think might be happening - it's going to wrap up by waiting for icinga checks to clear, but in the service turndown case they never will because the LVS diff check should be failing. does that sound plausible? [17:37:27] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:37:39] very much so [17:37:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10007870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:38:11] sukhe: any objections if I switch back to cumin? [17:38:19] so we can just let it finish, I don't have many good ideas around this other than ACKing that check before and then using skipping known icinga checks [17:38:36] swfrench-wmf: none at all, all we ask is the logging of the restart. the cookbook is optiona [17:38:39] l [17:38:55] if it's helpful you can go to Icinga web UI and click "reschedule" on any check you are waiting for and it should be near instant instead of waiting a couple minutes [17:39:53] sukhe: great, I'll do that, and clean up the old service IPs in the final step as planned [17:40:34] mutante: thanks for the tip! alas, it's more the tool waiting on something that will never happen (with a linear backoff) [17:40:55] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T367949) [17:40:56] ack [17:40:57] fair enough. I will think a bit more about how to solve this particular problem. so far the only idea I have is the above since if that's the only check that is failing, that is expected [17:40:59] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:41:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2014.codfw.wmnet [17:41:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2014.codfw.wmnet [17:41:37] (03PS1) 10Clément Goubert: Remove appserver tests [puppet] - 10https://gerrit.wikimedia.org/r/1056212 [17:41:54] sukhe: thanks for cleaning up the downtime [17:42:41] np [17:43:27] (03PS2) 10Clément Goubert: Remove appserver tests [puppet] - 10https://gerrit.wikimedia.org/r/1056212 (https://phabricator.wikimedia.org/T367949) [17:44:20] mutante: in this case it's bit tricky though because the only way to resolve that alert is to remove the IPs from LVS, which is the last step and is done manually [17:44:32] (03CR) 10Dzahn: [C:03+1] "I was about to add the bug link:) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1056212 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:44:40] !log sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service' - T367949 [17:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:15] sukhe: yea, I wouldn't worry about it too much since it seems the entire thing is a one-time change [17:45:58] (03PS2) 10Dzahn: phorge: add UnsafeAllow3F rewrite flag [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) [17:46:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn) [17:46:22] (03CR) 10Clément Goubert: [C:03+2] Remove appserver tests [puppet] - 10https://gerrit.wikimedia.org/r/1056212 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [17:46:29] !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@ebd9e13]: (no justification provided) [17:46:37] !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@ebd9e13]: (no justification provided) (duration: 00m 07s) [17:46:51] quickly merging CI change to unblock [17:47:00] claime: thank you so much! [17:47:14] (03PS8) 10Clément Goubert: Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) [17:47:25] Might as well hit that rebase button as well [17:47:26] claime: thanks! [17:47:37] Don't forget to smash that rebase button [17:47:40] [17:47:49] heh [17:47:53] likes and subscribes [17:47:58] It's late ok [17:48:15] merged, mutante you should be good now [17:48:28] * swfrench-wmf pacing around waiting for 5m to elapse before moving on to eqiad [17:48:33] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck recheck recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1056207 (https://phabricator.wikimedia.org/T370110) (owner: 10Dzahn) [17:48:46] claime: thank you :) [17:49:12] I confirm CI is working [17:49:58] yay [17:50:56] * sukhe hands claime a sock [17:51:07] since t-shirt is breaking wikis and this was not that :) [17:51:15] !log sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' - T367949 [17:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:20] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:51:21] lol [17:51:52] Need to break and fix CI twice to get a pair of socks [17:51:54] Inflation smh [17:51:55] never get to full suit with tie level [17:54:22] (03PS6) 10Dzahn: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:57:15] (03CR) 10CI reject: [V:04-1] firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:58:25] !log sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' - T367949 [17:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:30] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [17:59:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:04] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T1800). [18:00:24] (03PS1) 10Kgraessle: When user is reverted by Automoderator, send them a talk page message - one last non primitive data type left [extensions/AutoModerator] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056214 (https://phabricator.wikimedia.org/T355930) [18:01:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/AutoModerator] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056214 (https://phabricator.wikimedia.org/T355930) (owner: 10Kgraessle) [18:01:14] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:01:16] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056215 (https://phabricator.wikimedia.org/T366960) [18:01:17] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056215 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:01:27] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts1001.eqiad.wmnet [18:01:59] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056215 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:03:30] dancy: FYI, we're running a bit over, but at this point the dust has mostly settled, so feel free to move ahead in parallel [18:03:52] swfrench-wmf: thx. deploying now [18:04:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:20] FIRING: [3x] ProbeDown: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:40] dduvall: ack and thanks :) [18:05:55] o/ [18:06:00] uh oh. "Error: UPGRADE FAILED: release pinkunicorn failed, and has been rolled back due to atomic being set: Get "https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/mw-debug/services/mediawiki-pinkunicorn-tls-service": dial tcp 10.2.1.8:6443: connect: connection refused" [18:06:12] ooh that's new. [18:06:36] Looks like a try-again situation. [18:07:26] yeah, X-W-D for k8s-codfw and k8s-eqiad works [18:07:50] i'll run `scap train` again [18:08:18] re: vrts1001 - it's scheduled maintenance [18:08:23] (03PS1) 10AOkoth: vrts: remove TicketCounter.log cp line [cookbooks] - 10https://gerrit.wikimedia.org/r/1056216 (https://phabricator.wikimedia.org/T366078) [18:08:30] !log sudo cumin 'A:lvs-secondary-codfw or A:lvs-low-traffic-codfw' 'ipvsa [18:08:30] dm --delete-service --tcp-service 10.2.1.22:443' (api-https codfw) - T367949 [18:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:33] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [18:09:19] claime, dancy: it appears to have only failed on namespace mw-debug [18:09:20] RESOLVED: [3x] ProbeDown: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:04] alright. no failure this time [18:10:04] !log sudo cumin 'A:lvs-secondary-codfw or A:lvs-low-traffic-codfw' 'ipvsa [18:10:05] dm --delete-service --tcp-service 10.2.1.1:443' (appservers-https codfw) - T367949 [18:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:44] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:11:31] !log sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsa [18:11:31] dm --delete-service --tcp-service 10.2.2.22:443' (api-https eqiad) - T367949 [18:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:53] !log sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.22:443' (api-https eqiad) - T367949 [18:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:02] ^ now with no extra newlines :) [18:12:02] dduvall: That looks like an actual failure to connect to the kube api [18:12:10] !log aokoth@cumin1002 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1001.eqiad.wmnet [18:12:13] right [18:12:31] mediawiki-multiversion-debug:2024-07-23-180726-publish [18:12:43] that's the image currently deployed on mw-debug codfw [18:13:29] !log sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.1:443' (appservers-https eqiad) - T367949 [18:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:33] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [18:13:38] claime: perhaps only a subset of the helm operations failed. it's not super clear [18:14:02] (03CR) 10Ssingh: [C:03+1] haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [18:14:36] (03PS4) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [18:14:36] (03PS1) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [18:14:54] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.15 refs T366960 [18:14:58] T366960: 1.43.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T366960 [18:15:19] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10008083 (10BCornwall) I'm guessing we want a Ganeti corto1001.eqiad.wmnet with internal-only networking and the typical debian packaging, puppetization, /etc/corto, and systemd service, etc? Happy to han... [18:15:39] dduvall: helmfile would have rolled back everything. There is a spike in events and CPU usage for the server [18:15:40] (03CR) 10BCornwall: [C:03+2] haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [18:15:42] bunch of events [18:16:08] (03PS2) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [18:16:09] (03PS5) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [18:16:19] errors are back down, but I'm not sure if there's an actual problem rn [18:16:19] (03PS7) 10Dzahn: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [18:16:39] https://grafana.wikimedia.org/goto/3RUVnaXSg?orgId=1 [18:16:42] I don't like that spike [18:16:53] (03Merged) 10jenkins-bot: haproxy: Calculate increase of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [18:18:12] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:19:29] (03PS3) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [18:19:29] (03PS6) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [18:19:29] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:19:53] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:20:05] (03CR) 10Scott French: [C:03+2] Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [18:20:10] (03CR) 10EoghanGaffney: [C:03+1] vrts: remove TicketCounter.log cp line [cookbooks] - 10https://gerrit.wikimedia.org/r/1056216 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:20:40] claime, dancy: all is well now it seems. there were multiple exceptions related to not being able to connect to kubemaster, but the first failure appears to have been in scap's deployment monitor. the rollback then occurred and failed as well [18:20:49] (03CR) 10Jsn.sherman: [C:03+1] "looks good to me!" [extensions/AutoModerator] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056214 (https://phabricator.wikimedia.org/T355930) (owner: 10Kgraessle) [18:20:54] https://www.irccloud.com/pastebin/yPjnF5Wl/ [18:22:07] Active: active (running) since Tue 2024-07-23 18:07:01 UTC; 14min ago [18:22:17] I think you hit the api server restart [18:22:38] claime: ah, good find [18:22:39] (03PS4) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [18:22:39] (03PS7) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [18:23:23] (03CR) 10Scardenasmolinar: [C:03+1] When user is reverted by Automoderator, send them a talk page message - one last non primitive data type left [extensions/AutoModerator] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056214 (https://phabricator.wikimedia.org/T355930) (owner: 10Kgraessle) [18:24:34] dduvall: claime: confirmed I do NOT get connection refused on kubemaster 6443 from deploy1002 [18:25:16] yeah so the kubernetes api servers restarted for some reason it's too late for me to search for [18:25:33] ah, ok [18:25:58] just repeat that, i would say [18:26:05] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:26:10] i can connect with telnet [18:26:13] (03CR) 10CI reject: [V:04-1] cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:27:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_api-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:27:43] (03PS5) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [18:27:43] (03PS8) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [18:28:01] (03PS1) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 [18:28:25] (03CR) 10CI reject: [V:04-1] logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [18:29:37] (03PS2) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 [18:29:44] the confd alerts above are related to cleanup work and talked about in -traffic [18:31:17] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:31:21] (03CR) 10CI reject: [V:04-1] cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:32:41] FIRING: [16x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_api-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:32:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:36:00] thanks, mutante. yeah the subsequent `scap train` succeeded [18:37:03] 06SRE, 10observability, 06Traffic: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000#10008167 (10BCornwall) 05In progress→03Stalled [18:37:41] FIRING: [16x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_api-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:40:09] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [18:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10008171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye ex... [18:40:45] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add main and scholarly puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [18:42:19] !log puppetmaster1001/puppetmaster2001 - rm /var/run/confd-template/_srv_config-master_pybal_codfw_api-https.err to clear pybal icinga alerts after T367949 [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:30] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [18:45:07] !log puppetmaster1001/puppetmaster2001 - rm /var/run/confd-template/*.err to clear pybal icinga alerts after T367949 [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:25] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682#10008202 (10BCornwall) @CDanis Friendly ping. [18:47:15] (03CR) 10Awight: [C:03+1] Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056204 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [18:47:21] (03CR) 10Awight: [C:03+1] Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056205 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [18:47:41] RESOLVED: [16x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_api-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:49:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:22] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106#10008209 (10BCornwall) 05Open→03Stalled [18:50:03] (03PS1) 10Ryan Kemper: wdqs graph split: fix tab alignment [puppet] - 10https://gerrit.wikimedia.org/r/1056230 (https://phabricator.wikimedia.org/T364368) [18:50:32] (03CR) 10Gehel: [C:03+1] wdqs graph split: fix tab alignment [puppet] - 10https://gerrit.wikimedia.org/r/1056230 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [18:50:42] (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: fix tab alignment [puppet] - 10https://gerrit.wikimedia.org/r/1056230 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [18:54:17] (03PS1) 10Scott French: Remove has_lvs: true from appserver / api_appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) [18:54:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:56] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [18:57:35] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682#10008250 (10Vgutierrez) 05Stalled→03Invalid cp3050 is now longer being used, definitely this task can be... [18:58:58] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@01e1952]: (no justification provided) [18:59:14] (03CR) 10Dzahn: [C:03+1] "Eoghan, wanna try this together some time?" [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:59:29] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@01e1952]: (no justification provided) (duration: 00m 30s) [19:00:19] (03PS2) 10Scott French: Set has_lvs: false on appserver / api_appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) [19:00:42] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:01:07] (03CR) 10Dzahn: [C:03+1] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:01:29] (03CR) 10Ssingh: [C:03+1] Set has_lvs: false on appserver / api_appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:05:29] (03CR) 10Scott French: "Thank you both!" [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:05:30] (03CR) 10Scott French: [C:03+2] Set has_lvs: false on appserver / api_appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056231 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:10:11] (03PS8) 10Dzahn: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [19:13:00] (03CR) 10CI reject: [V:04-1] firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [19:14:07] well.. this V-1 here is another one related to lvs realserver [19:14:16] but let's see first if it stays [19:14:52] mutante: looking [19:16:45] (03PS3) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 [19:16:45] (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate several more persistent log messages [puppet] - 10https://gerrit.wikimedia.org/r/1056232 [19:18:03] (03PS6) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [19:18:03] modules/profile/spec/classes/profile_mediawiki_webserver_spec.rb [19:18:03] (03PS9) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [19:18:03] (03PS1) 10Andrew Bogott: trivial/test patch [puppet] - 10https://gerrit.wikimedia.org/r/1056233 [19:18:07] let(:params) { super().merge({:has_lvs => true}) } [19:18:09] .with_realserver_ips(['10.2.2.26', '10.2.2.5']) [19:18:57] (03PS2) 10Ahmon Dancy: logspam: Consolidate several more persistent log messages [puppet] - 10https://gerrit.wikimedia.org/r/1056232 [19:20:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056233 (owner: 10Andrew Bogott) [19:21:38] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:21:44] (03CR) 10CI reject: [V:04-1] cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:24:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:59] (03PS2) 10Andrew Bogott: trivial/test patch [puppet] - 10https://gerrit.wikimedia.org/r/1056233 [19:24:59] (03PS7) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [19:24:59] (03PS10) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [19:25:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:19] mutante: for lack of a better idea, I can point this at jobrunner for now, and we can sort out how to proceed [19:26:49] swfrench-wmf: see the spec file above, we should remove class lvs_realserver from there [19:28:12] sukhe: wait, I'm confused: the spec file you linked above should have all tests pass as-is [19:29:24] (03CR) 10CI reject: [V:04-1] trivial/test patch [puppet] - 10https://gerrit.wikimedia.org/r/1056233 (owner: 10Andrew Bogott) [19:29:26] (03CR) 10CI reject: [V:04-1] puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:29:34] (03CR) 10CI reject: [V:04-1] cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:30:25] but it doesn't seem to as per the CI because it tries to compile the catalog and fails. meaning that since we are not using lvs anymore for this, the test for modules/profile/spec/classes/profile_mediawiki_webserver_spec.rb should be updated to remove the include for class lvs::realserver [19:31:28] on the other hand, I do see other failures down the chain too [19:31:29] hmm [19:33:00] sukhe: for the cookbook discussion of before, it could either use downtime_services() for the specific service you want to downtime before it fires (so it's not alerting) or yes use skip_acked=True in wait_for_optimal() and ack the alert [19:33:07] * volans about to disappear into the night [19:33:15] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [19:33:45] volans: yeah thanks, I remember the discussion. I will add you on the review depending on which path to take for the fix. [19:34:34] feel free, just don't tell jo.bo ;) [19:34:53] oh sorry [19:34:58] I will add eluke.y :) [19:35:02] * sukhe muscle memory [19:36:18] hey, how can I help? [19:36:31] alright, to summarize the current state, there are a couple of breakages in test specs [19:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:10] https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster/5977/consoleFull from the patch linked by mutante above ^ is what I'm working from [19:37:14] rspec './modules/profile/spec/classes/profile_lvs_realserver_spec.rb[1:1:2:1]' # profile::lvs::realserver on debian-11-x86_64 with conftool is expected to compile into a catalogue without dependency cycles [19:37:21] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008340 (10Volans) [19:37:44] modules/profile/spec/classes/profile_services_proxy_envoy_spec.rb is broken due to the disappearance of appservers-https [19:38:01] modules/profile/spec/classes/profile_lvs_realserver_spec.rb is broken due to pools [19:38:02] modules/profile/spec/classes/profile_lvs_realserver_spec.rb is broken similarly, but in a different way [19:38:04] 'api-https' => {'services' => ['apache2', 'php', 'mcrouter']}, [19:38:07] 'appservers-https' => {'services' => ['apache2', 'php', 'mcrouter', 'nginx']}, [19:38:11] precisely, yeah [19:38:21] so yeah, that's what I see as well [19:38:45] I'm working on a patch that addresses both, but if folks could take a look through those CI failures to see if there are additional ones to fix, that would be greatly appreciated [19:38:47] IMO, grepping for api-https and appservers-https etc [19:38:53] should fix it [19:39:23] swfrench-wmf: I think that should be it =1 [19:39:25] +1 [19:39:34] great, thank you! [19:39:46] I trust sukhe, but looking :) [19:40:32] (03PS3) 10Andrew Bogott: trivial/test patch [puppet] - 10https://gerrit.wikimedia.org/r/1056233 [19:40:32] (03PS8) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [19:40:32] (03PS11) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [19:40:32] (03PS1) 10Andrew Bogott: Revert "Remove conftool-data and service catalog for legacy appservers 3/3" [puppet] - 10https://gerrit.wikimedia.org/r/1056236 [19:40:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:34] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008343 (10Volans) I took the liberty to add a cleanup item to the task description. If that should be part of another task feel to move it around. [19:44:16] swfrench-wmf: here to help if I can but don't want to step on your toes since you are already working on a patch [19:44:19] but don't hesitate to ping [19:44:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:47] sukhe: thanks! I'll ping for review once I have a patch [19:47:55] (belatedly yes I agree that's everything) [19:49:59] (03PS1) 10Scott French: Remove references to turned down service in spec files [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) [19:51:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:51:11] alright, let's see how that does ... waiting on CI [19:51:45] looking [19:52:30] of course, there's a typo :) [19:52:41] (03CR) 10CI reject: [V:04-1] Remove references to turned down service in spec files [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:53:01] let me confirm that's the only issue [19:53:37] fixing [19:53:40] looks good otherwise [19:54:06] (03CR) 10AOkoth: [C:03+2] vrts: remove TicketCounter.log cp line [cookbooks] - 10https://gerrit.wikimedia.org/r/1056216 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [19:55:29] (03PS2) 10Scott French: Remove references to turned down service in spec files [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) [19:56:38] (03CR) 10Ssingh: [C:03+1] "nice fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:57:11] swfrench-wmf: I guess we wait for CI but +1 [19:57:13] (03CR) 10RLazarus: [C:03+1] Remove references to turned down service in spec files [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:57:14] 06SRE, 06Infrastructure-Foundations, 10netops: Add data to automation for new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#10008422 (10cmooney) 05Open→03Resolved [19:57:44] sukhe: rzl: thank you both :) yeah, let's see if I've indeed whacked all the moles [19:57:46] (03Merged) 10jenkins-bot: vrts: remove TicketCounter.log cp line [cookbooks] - 10https://gerrit.wikimedia.org/r/1056216 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [19:58:13] seems like you did :) [19:58:55] yayyy off we go to merrrrge [19:59:14] (03CR) 10Scott French: [C:03+2] "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1056239 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240723T2000). [20:00:04] tgr and WMDE-Fisch: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] \o [20:01:41] o/ [20:02:26] ^ please check with swfrench-wmf before deploying just to make sure [20:02:35] (03PS1) 10Scott French: Noop change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/1056242 [20:02:44] (03PS9) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [20:02:44] (03PS12) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [20:03:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [20:03:34] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [20:03:58] WMDE-Fisch: tgr|away: I believe you should be good to proceed. please let me know if you observe any issues on the first backport [20:05:20] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [20:05:34] thanks, I'll deploy then [20:05:58] (03PS4) 10Gergő Tisza: debug: Enable Special:WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) [20:06:08] 06SRE, 06Traffic: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821 (10ssingh) 03NEW [20:08:06] (03PS10) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) [20:08:06] (03PS13) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [20:08:24] (03Abandoned) 10Scott French: Noop change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/1056242 (owner: 10Scott French) [20:09:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [20:11:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [20:11:54] (03Merged) 10jenkins-bot: debug: Enable Special:WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030590 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [20:12:24] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1030590|debug: Enable Special:WikimediaDebug (T350094)]] [20:12:26] 06SRE, 06Traffic: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10008529 (10BBlack) Firefox has historically been the reason we've been stapling OCSP for the past many years. If our certificate has an OCSP URI in its metadata, th... [20:12:32] T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094 [20:13:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [20:14:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:29] 06SRE, 06Traffic: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10008544 (10BBlack) Note also Digicert's annual renewal is coming soon in T368560 . We should maybe look at whether the OCSP URI is optional in the form for making t... [20:14:56] !log tgr@deploy1002 tgr: Backport for [[gerrit:1030590|debug: Enable Special:WikimediaDebug (T350094)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:57] !log tgr@deploy1002 tgr: Continuing with sync [20:17:58] WMDE-Fisch: should I do yours or do you want to self-deploy? [20:18:23] tgr|away: would be nice, if you could do it [20:18:44] (03PS9) 10Dzahn: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [20:19:08] (03CR) 10Gergő Tisza: [C:03+2] Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056204 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [20:19:10] (03CR) 10Gergő Tisza: [C:03+2] Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056205 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [20:19:13] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#10008570 (10Kappakayala) Is there an update on this? We have a new team member joining us and this will be super helpful as we onboard them. [20:19:36] (03CR) 10Brennen Bearnes: [C:03+1] "I think that these (sometimes?) may indicate real new code breakage rather than just database weather:" [puppet] - 10https://gerrit.wikimedia.org/r/1056232 (owner: 10Ahmon Dancy) [20:21:53] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1030590|debug: Enable Special:WikimediaDebug (T350094)]] (duration: 09m 28s) [20:22:07] (03PS4) 10Dzahn: doc: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (https://phabricator.wikimedia.org/T370677) [20:22:17] T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094 [20:23:27] (03PS3) 10Dzahn: aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) [20:24:37] (03PS4) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [20:25:01] (03CR) 10CI reject: [V:04-1] phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:26:23] (03CR) 10Brennen Bearnes: [C:03+1] logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [20:30:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:15] (03CR) 10Krinkle: [C:03+1] sshkey_list: Fix stray quote [software/bitu] - 10https://gerrit.wikimedia.org/r/1055998 (owner: 10Bartosz Dziewoński) [20:34:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:07] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [20:38:10] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [20:41:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10008619 (10Jclark-ctr) [20:42:03] (03CR) 10Krinkle: logspam: Consolidate CurlFactory cURL errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [20:43:12] (03Merged) 10jenkins-bot: Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056204 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [20:43:21] (03Merged) 10jenkins-bot: Respect wgTranslateNumerals in Cite footnote markers [extensions/Cite] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056205 (https://phabricator.wikimedia.org/T370585) (owner: 10WMDE-Fisch) [20:44:08] !log tgr@deploy1002 Started scap sync-world: Backport for [[gerrit:1056204|Respect wgTranslateNumerals in Cite footnote markers (T370585)]], [[gerrit:1056205|Respect wgTranslateNumerals in Cite footnote markers (T370585)]] [20:44:12] T370585: Reference numbering appears incorrectly in arwiki (due to ignoring wgTranslateNumerals) - https://phabricator.wikimedia.org/T370585 [20:46:01] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10008630 (10Jclark-ctr) [20:46:26] !log tgr@deploy1002 wmde-fisch, tgr: Backport for [[gerrit:1056204|Respect wgTranslateNumerals in Cite footnote markers (T370585)]], [[gerrit:1056205|Respect wgTranslateNumerals in Cite footnote markers (T370585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:07] * WMDE-Fisch testing [20:47:50] tgr|away: Works [20:48:07] !log tgr@deploy1002 wmde-fisch, tgr: Continuing with sync [20:50:00] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10008641 (10Jclark-ctr) [20:53:43] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1056204|Respect wgTranslateNumerals in Cite footnote markers (T370585)]], [[gerrit:1056205|Respect wgTranslateNumerals in Cite footnote markers (T370585)]] (duration: 09m 34s) [20:53:47] T370585: Reference numbering appears incorrectly in arwiki (due to ignoring wgTranslateNumerals) - https://phabricator.wikimedia.org/T370585 [20:54:17] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10008645 (10Jclark-ctr) @Arnoldokoth if you can update Partitioning/Raid: HW Raid: Y/N, Partman recipe and/or desired Raid Level: TODO and Update the operations/puppet [20:54:46] !log UTC late deploys done [20:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:35] tgr|away: Works. Thanks for getting it done. ;-) [20:55:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10008650 (10Jclark-ctr) [20:56:18] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008652 (10Scott_French) Many thanks, all who helped get this out the door. At this point, the LVS service turndown is done, and we've shaken out a handful... [20:56:40] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008653 (10Scott_French) [20:59:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10008664 (10Jclark-ctr) [21:01:46] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10008684 (10Jclark-ctr) @Dzahn can you update operations/puppet repo [21:04:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10008698 (10VRiley-WMF) [21:19:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:16] 06SRE, 06Traffic, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10008822 (10BCornwall) 05Open→03Stalled [21:41:24] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10008833 (10Krinkle) [21:43:12] 06SRE, 06Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097#10008836 (10BCornwall) 05In progress→03Stalled [21:52:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P66894 and previous config saved to /var/cache/conftool/dbconfig/20240723-215225-ladsgroup.json [21:53:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P66895 and previous config saved to /var/cache/conftool/dbconfig/20240723-215309-ladsgroup.json [21:53:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P66896 and previous config saved to /var/cache/conftool/dbconfig/20240723-215338-ladsgroup.json [21:54:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10008870 (10Ladsgroup) I'm repooling the replicas now. [21:55:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:59] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:05:16] (03CR) 10Scott French: "Thanks, Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [22:07:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt pc1017 - jclark@cumin1002" [22:07:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P66897 and previous config saved to /var/cache/conftool/dbconfig/20240723-220731-ladsgroup.json [22:08:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P66898 and previous config saved to /var/cache/conftool/dbconfig/20240723-220815-ladsgroup.json [22:08:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P66899 and previous config saved to /var/cache/conftool/dbconfig/20240723-220844-ladsgroup.json [22:08:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt pc1017 - jclark@cumin1002" [22:08:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:22:09] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T370672#10008928 (10Jhancock.wm) a:03Jhancock.wm [22:22:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P66900 and previous config saved to /var/cache/conftool/dbconfig/20240723-222236-ladsgroup.json [22:22:44] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [22:22:45] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [22:23:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P66901 and previous config saved to /var/cache/conftool/dbconfig/20240723-222320-ladsgroup.json [22:23:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P66902 and previous config saved to /var/cache/conftool/dbconfig/20240723-222349-ladsgroup.json [22:24:13] (03PS1) 10JHathaway: fix Puppet::FileServing::Content for puppet 7 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1056258 (https://phabricator.wikimedia.org/T367547) [22:29:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:37] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10008969 (10Jhancock.wm) [22:37:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P66903 and previous config saved to /var/cache/conftool/dbconfig/20240723-223742-ladsgroup.json [22:38:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10008971 (10Jhancock.wm) [22:38:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P66904 and previous config saved to /var/cache/conftool/dbconfig/20240723-223826-ladsgroup.json [22:38:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P66905 and previous config saved to /var/cache/conftool/dbconfig/20240723-223855-ladsgroup.json [22:38:57] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10008973 (10Jhancock.wm) [22:39:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10008974 (10Jhancock.wm) [22:45:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:11] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:56:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:57:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host pc1017.mgmt.eqiad.wmnet with reboot policy FORCED [22:59:32] (03PS7) 10Scott French: mediawiki-cache-warmup: support 'clone' for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) [22:59:32] (03PS6) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [23:03:11] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 14.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:05:05] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10008996 (10Jhancock.wm) @Dzahn could you update the puppet repo for us when you have a moment? thanks in advance! [23:06:26] (03PS1) 10BryanDavis: hieradata: Update Striker to 2024-07-20-113830-production [puppet] - 10https://gerrit.wikimedia.org/r/1056263 (https://phabricator.wikimedia.org/T369395) [23:09:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1017.mgmt.eqiad.wmnet with reboot policy FORCED [23:11:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [23:11:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10009021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [23:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 18.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:12:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:13:00] (03CR) 10Scott French: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [23:16:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pc2017 to codfw - jhancock@cumin2002" [23:17:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pc2017 to codfw - jhancock@cumin2002" [23:17:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:18:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 6.126% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:19:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:34] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-07-17-145014 to 2024-07-19-164024 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056266 (https://phabricator.wikimedia.org/T57876) [23:19:44] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-07-17-145805 to 2024-07-23-225548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056267 [23:20:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host pc2017.mgmt.codfw.wmnet with reboot policy FORCED [23:24:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 14.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:28:54] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10009085 (10Papaul) @Jhancock.wm i setup the node to use xe-0/0/39 ` papaul@lsw1-d8-codfw# run show interfaces xe-0/0/39 descriptions Interface Admin Link Description... [23:34:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:36:54] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:54] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056268 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056268 (owner: 10TrainBranchBot) [23:39:04] (03CR) 10Scott French: [C:03+1] "While very much out of depth, this sounds reasonable to me! LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1056258 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [23:39:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2017.mgmt.codfw.wmnet with reboot policy FORCED [23:42:34] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2017'] [23:43:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['pc2017'] [23:43:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2017'] [23:44:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0.4967% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:49:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc2017'] [23:53:58] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0.2111% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:54:18] upstream connect error or disconnect/reset before headers. reset reason: connection failure [23:54:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2017.codfw.wmnet with OS bookworm [23:54:46] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10009097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm [23:55:38] FIRING: [15x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:58:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1017.eqiad.wmnet with OS bookworm [23:58:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10009102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm executed with errors: - pc1017 (*... [23:58:50] easing up now [23:58:57] RESOLVED: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0.04153% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:59:20] FIRING: [15x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown