[00:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33775 and previous config saved to /var/cache/conftool/dbconfig/20220905-000606-ladsgroup.json [00:13:24] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33776 and previous config saved to /var/cache/conftool/dbconfig/20220905-002112-ladsgroup.json [00:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33777 and previous config saved to /var/cache/conftool/dbconfig/20220905-003619-ladsgroup.json [00:36:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [00:36:22] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:36:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [01:11:54] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:14:22] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:21:44] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:30:06] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:06] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:05:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:04] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:20:32] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:40:16] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33778 and previous config saved to /var/cache/conftool/dbconfig/20220905-024602-ladsgroup.json [02:46:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [02:46:05] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:46:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [03:27:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:36] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:01:58] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:13:33] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:16:26] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:18:28] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:42:48] (03PS7) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [04:58:04] (03PS1) 10Stang: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) [04:59:43] (03PS1) 10Stang: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) [05:02:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:10:38] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:17:48] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:24] (03CR) 10Tim Starling: [C: 03+2] Multi-DC: go back to testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/828677 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [05:53:54] (03PS3) 10Giuseppe Lavagetto: deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) [05:55:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [05:56:49] (03PS5) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) [06:07:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:08:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:11:38] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:54] (03PS1) 10Giuseppe Lavagetto: canary_appserver: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829550 (https://phabricator.wikimedia.org/T271736) [06:13:56] (03PS1) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) [06:13:58] (03PS1) 10Giuseppe Lavagetto: api appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829552 (https://phabricator.wikimedia.org/T271736) [06:14:00] (03PS1) 10Giuseppe Lavagetto: appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829553 [06:28:24] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr2-eqiad:xe-4/1/3 [06:28:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr2-eqiad:xe-4/1/3 [06:30:48] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:36] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) From Lumen diagnostic tool: > SERVICE ALARMS NEEDS ATTENTION We have detected equipment alarms. Further Investigation is required. > Ticket ID... [06:44:42] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T317009 - The acknowledgement expires at: 2022-09-06 06:44:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:42] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T317009 - The acknowledgement expires at: 2022-09-06 06:44:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:09] (03PS3) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) [06:50:38] 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Ladsgroup) One random note: Can dumps migrate to apache from nginx? To standardize our infra so I don't look for apache logs in hurry in a Sunday. [06:54:22] (03CR) 10Muehlenhoff: [C: 03+2] webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff) [06:55:21] (03CR) 10Muehlenhoff: [C: 03+2] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:55:26] (03PS4) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) [06:55:28] (03PS3) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T0700) [07:00:05] koi and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:20] o/ [07:00:32] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37110/console" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [07:00:34] o/ [07:00:35] Joe can self-serve [07:01:05] <_joe_> Amir1: actually, I'd like to be served :D [07:01:15] <_joe_> jokes aside, should I just go now or wait? [07:01:23] https://deploy-commands.toolforge.org/bacc/823678 [07:01:33] _joe_: you go first, it's important [07:01:39] _joe_: deploy your patch first, I'll deploy koi's after? :) [07:01:44] <_joe_> ack [07:02:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:03:04] (03Merged) 10jenkins-bot: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:04:05] <_joe_> syncing [07:07:10] once you all are done, ping me [07:07:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:07:48] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823678|Move 10% of traffic to php 7.4 (T271736)]] (duration: 03m 50s) [07:07:50] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [07:08:26] (03PS1) 10Ladsgroup: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) [07:10:13] _joe_: are you done with your patch? [07:10:21] <_joe_> urbanecm: yes [07:10:25] <_joe_> sorry, the log was enough [07:10:30] <_joe_> it's a simple config knob [07:10:38] I wasn't sure if there's any follow-up or anything :) [07:10:47] Thanks, going ahead with koi's patches now. [07:10:49] <_joe_> and tbh, I don't expect great changes until we switch 100% of the traffic over :) [07:10:54] <_joe_> yes, sorry koi for the wait [07:12:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:12:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:14:00] (03PS2) 10Urbanecm: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:14:05] (03CR) 10Urbanecm: [C: 03+2] Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:14:16] (03PS2) 10Urbanecm: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:14:19] (03CR) 10Urbanecm: [C: 03+2] Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:14:53] (03Merged) 10jenkins-bot: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:15:15] (03Merged) 10jenkins-bot: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang) [07:15:32] koi: your patch is at mwdebug1001 [07:15:38] looking [07:15:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:16:52] urbanecm: both logo on these two sites LGTM [07:17:21] great, syncing [07:19:35] !log installing ghostscript security updates [07:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:22:02] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: ff2e1082d8b3fe0ba93cd37a1b516dece84a834b: Upload missing logo for mniwiktionary and frwikiquote (T317004) (duration: 03m 50s) [07:22:04] T317004: Missing logo for mniwiktionary and frwikiquote - https://phabricator.wikimedia.org/T317004 [07:25:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:25:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:25:39] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 739920ceb09358a2ea89d82494522876fffd2621: Fix missing logo for mniwiktionary and frwikiquote (T317004) (duration: 03m 36s) [07:25:46] koi: should be live now [07:25:48] (03PS1) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [07:26:04] Amir1: over to you :) [07:26:15] awesome [07:26:30] (03PS2) 10Ladsgroup: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) [07:26:40] (03CR) 10Ladsgroup: [C: 03+2] Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [07:27:34] (03Merged) 10jenkins-bot: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [07:29:09] (03CR) 10Elukey: Add a helmfile configuration for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [07:29:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:32:52] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:829556|Make English Wikipedia read new on templatelinks migration (T306673)]] (duration: 03m 31s) [07:32:56] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [07:34:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:38:14] (03PS1) 10Ayounsi: Rename Telia to Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/829558 [07:38:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:38:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:42:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:44:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:24] (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:51:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:22] (03PS1) 10Slyngshede: P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) [07:52:58] (03CR) 10CI reject: [V: 04-1] P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:57:21] (03PS8) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [07:57:23] (03PS1) 10Stang: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) [07:58:05] (03PS2) 10Slyngshede: P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) [08:01:37] !log rename Telia to Arelion in Netbox [08:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:52] (03PS1) 10Ladsgroup: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) [08:03:25] (03PS1) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) [08:04:51] (03PS2) 10Ladsgroup: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) [08:05:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:06:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37111/console" [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:07:00] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:07:58] (03Merged) 10jenkins-bot: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:08:37] (03PS2) 10David Caro: dynamicproxy: add simple compile test [puppet] - 10https://gerrit.wikimedia.org/r/826299 [08:09:50] (03CR) 10Filippo Giunchedi: [C: 03+1] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro) [08:10:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37112/console" [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:13:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:13:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:14:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:14:11] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:829562|Stop writing to old templatelinks fields in s7 (T312865)]] (duration: 03m 51s) [08:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:14:15] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:14:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:14:19] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:14:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:15:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:15:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:15:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:15:40] (03CR) 10David Caro: [C: 03+2] "Safe and small enough, merging." [puppet] - 10https://gerrit.wikimedia.org/r/826299 (owner: 10David Caro) [08:17:39] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah) [08:18:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:18:28] (03CR) 10David Caro: [C: 03+2] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro) [08:19:52] (03PS2) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [08:20:14] (03CR) 10David Caro: [C: 03+2] P:toolforge::grid: remove legacy host key stuff [puppet] - 10https://gerrit.wikimedia.org/r/829305 (owner: 10Majavah) [08:21:19] (03CR) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:21:28] (03PS3) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [08:22:00] (03Merged) 10jenkins-bot: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro) [08:24:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:27:26] 10SRE: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Volans) 05Open→03Resolved a:03Volans Obsolete, resolving. [08:28:00] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: output thanos-query syslogs to kafka and local file [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) (owner: 10Herron) [08:30:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert) [08:30:31] (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:31:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [08:32:17] (03PS3) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) [08:33:00] (03PS2) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) [08:35:12] (03CR) 10Muehlenhoff: [C: 03+2] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:37:14] jouncebot: next [08:37:14] In 4 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1300) [08:39:39] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [08:40:22] (03CR) 10Vgutierrez: [C: 03+1] "preview available here: https://grafana.wikimedia.org/dashboard/snapshot/8g27u7vLB6Hlc0EA0FK3zRmzf7WaivB3?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [08:41:30] (03PS4) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [08:41:34] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:42:53] (03CR) 10Klausman: [C: 03+1] api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan) [08:43:06] (03PS5) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [08:43:29] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10fgiunchedi) Thank you for following up, I think the culprit is the fact that the S3 compat API stores chunk... [08:44:04] (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:48:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [08:48:57] 10SRE, 10Observability-Metrics: Not all carbon service start at graphite reboot - https://phabricator.wikimedia.org/T316747 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [08:49:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:50:49] (03PS3) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) [08:51:04] (03CR) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [08:51:22] (03PS3) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) [08:53:04] (03PS6) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) [08:53:53] (03CR) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:54:29] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:55:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1006.eqiad.wmnet [08:57:02] PROBLEM - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: Text\sis\savailable\sunder\sthe a\shref=\/\/creativecommons\.org\/licenses\/by-sa\/3\.0\/Creative\sCommons\sAttribution-ShareAlike\sLicense./a: additional\sterms\smay\sapply\. html not found https://phabricator.wikimedia.org/project/members/28/ [08:59:01] (03Abandoned) 10Jon Harald Søby: Remove GeoCrumbs from the Wikimedia Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826279 (https://phabricator.wikimedia.org/T316109) (owner: 10Jon Harald Søby) [09:02:06] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) > I am seeing an issue on our SUBSEA portion of the circuit. I am engaging my SUBSEA group at this time. [09:03:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1006.eqiad.wmnet [09:04:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1008.eqiad.wmnet [09:04:41] !log hnowlan@deploy1002 Started deploy [restbase/deploy@a571f9a]: Add pcmwiki T310880 [09:04:43] T310880: Post-creation work for pcmwiki - https://phabricator.wikimedia.org/T310880 [09:05:46] (03CR) 10Muehlenhoff: [C: 03+2] k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:05:47] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@a571f9a]: Add pcmwiki T310880 (duration: 01m 06s) [09:06:12] (03PS3) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) [09:11:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1008.eqiad.wmnet [09:14:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1010.eqiad.wmnet [09:16:01] (03PS1) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 [09:17:04] (03CR) 10Ayounsi: [C: 03+2] Squid: permit production networks instead of aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [09:17:05] !log installing flac security updates [09:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:55] !log Squid: permit production networks instead of aggregate_networks - T265864 [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:57] T265864: Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 [09:20:01] (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 (owner: 10David Caro) [09:22:26] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:22:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1010.eqiad.wmnet [09:23:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [09:23:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:23:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [09:23:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:23:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33779 and previous config saved to /var/cache/conftool/dbconfig/20220905-092338-ladsgroup.json [09:23:41] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:24:27] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:24:36] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:24:51] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:25:36] (03PS2) 10Jbond: raid: use modern nrpe defines [puppet] - 10https://gerrit.wikimedia.org/r/825740 (owner: 10Majavah) [09:25:40] !log deployed calico to dse-k8s cluster T310174 [09:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:43] T310174: Configure routing for dse-k8s cluster - https://phabricator.wikimedia.org/T310174 [09:25:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825740 (owner: 10Majavah) [09:26:22] (03PS1) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [09:27:15] (03PS2) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 [09:27:17] (03PS2) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [09:28:20] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816105 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:28:22] (03PS3) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 [09:28:24] (03PS3) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [09:28:52] (03CR) 10Jbond: [C: 03+2] C:admin: add support for deprecated groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond) [09:29:18] (03PS1) 10Jelto: gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) [09:29:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1012.eqiad.wmnet [09:30:40] (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [09:32:09] (03CR) 10Jbond: [C: 03+2] apt::noupgrade: remove [puppet] - 10https://gerrit.wikimedia.org/r/826350 (owner: 10Majavah) [09:32:15] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/826350 (owner: 10Majavah) [09:32:51] (03PS4) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [09:33:15] (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [09:34:55] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:35:41] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [09:37:03] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:37:29] (03CR) 10Clément Goubert: [C: 04-1] "Comment on where fcgi_proxies ordering comes in needs to be corrected, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [09:38:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1012.eqiad.wmnet [09:39:31] (03CR) 10Jbond: "LGTM but will leave to wmcs for th4e final approval" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [09:39:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [09:39:37] (03CR) 10Jbond: [C: 03+1] Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [09:41:29] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:44:15] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) a:05Volans→03None @Andrew what is the issue that you're still seeing? It looks good to me. I see that the host is correctly... [09:44:22] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37113/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [09:44:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [09:46:44] (03PS4) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) [09:47:05] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:54:17] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:54:47] (03CR) 10Jbond: "lgtm but see comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff) [09:56:22] !log hnowlan@deploy1002 Started deploy [restbase/deploy@79b3cd2]: Add guwwiktionary and bjnwiktionary T309058 T312216 [09:56:25] T312216: Add bjnwiktionary to RESTBase - https://phabricator.wikimedia.org/T312216 [09:56:26] T309058: Add guwwiktionary to RESTBase - https://phabricator.wikimedia.org/T309058 [09:56:44] (03PS1) 10Volans: ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) [09:57:14] (03PS5) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) [09:57:17] (03CR) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [09:58:20] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [09:59:04] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37114/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [10:02:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [10:02:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, though I see from https://phabricator.wikimedia.org/T315866#8194791 the alert might be noisy :| Let's go ahead with it though a" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [10:03:17] (03PS1) 10Muehlenhoff: Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 [10:03:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:03:52] (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [10:04:16] (03CR) 10Jbond: [C: 03+1] Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:04:21] (03PS2) 10Muehlenhoff: Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) [10:05:11] (03CR) 10Muehlenhoff: Allow cookbooks to handle restarts based on running one of more commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff) [10:05:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1013.eqiad.wmnet [10:05:54] (03CR) 10Jbond: "thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10Clément Goubert) [10:05:58] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37115/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [10:08:51] (03PS3) 10Giuseppe Lavagetto: deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042) [10:09:16] (03CR) 10Roman Stolar: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [10:11:27] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@79b3cd2]: Add guwwiktionary and bjnwiktionary T309058 T312216 (duration: 15m 05s) [10:11:30] T312216: Add bjnwiktionary to RESTBase - https://phabricator.wikimedia.org/T312216 [10:11:31] T309058: Add guwwiktionary to RESTBase - https://phabricator.wikimedia.org/T309058 [10:12:16] (03PS1) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) [10:12:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [10:13:00] (03PS2) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) [10:13:33] !log upgrade python-pynetbox to 6.6 on netbox frontends - T310745 [10:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] T310745: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745 [10:14:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1013.eqiad.wmnet [10:16:09] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:17:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1014.eqiad.wmnet [10:17:52] (03CR) 10FNegri: [C: 03+1] "Thanks for the responses, I think this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [10:19:48] (03PS2) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 [10:20:31] (03CR) 10CI reject: [V: 04-1] C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [10:21:22] (03CR) 10Clément Goubert: "Can you take a look? ensure_packages ordering is iffy on a new install and makes us do multiple puppet runs to achieve the desired state." [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [10:21:45] (03PS1) 10Vgutierrez: mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) [10:22:15] (03PS3) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 [10:22:17] (03CR) 10CI reject: [V: 04-1] mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [10:22:58] (03CR) 10Jbond: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [10:23:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:23:31] (03CR) 10Vlad.shapik: [C: 03+1] "Looks good to me." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [10:24:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1014.eqiad.wmnet [10:25:27] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:25:51] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert) [10:25:59] (03PS1) 10Ayounsi: Enable pynetbox threading for generate_dns_snippets.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486) [10:26:17] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10jbond) >>! In T296832#8205878, @Volans wrote: >>>! In T296832#8143427, @cmooney wrote: >> For a bit of context the ab... [10:26:23] (03PS2) 10Vgutierrez: mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) [10:26:47] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan) [10:26:55] (03CR) 10Volans: [C: 03+1] "LGTM, let's test it!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:27:05] (03CR) 10Hnowlan: [C: 03+2] Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [10:27:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1015.eqiad.wmnet [10:28:00] (03CR) 10Ayounsi: [C: 03+2] Enable pynetbox threading for generate_dns_snippets.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:28:13] (03PS3) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [10:29:34] (03PS3) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 [10:29:40] (03PS1) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705) [10:30:13] (03CR) 10Clément Goubert: [C: 03+2] C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert) [10:30:15] (03CR) 10Alexandros Kosiaris: "Adding Arnold and Daniel. As far as I am concerned, this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [10:30:47] (03Merged) 10jenkins-bot: api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan) [10:31:11] (03CR) 10Jbond: global: drop owner/group => root from file resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [10:31:48] 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10jcrespo) I believe this was the impact and subsequent mitigation on eqord router (but hopfully someone can confirm): {F35509646} [10:31:51] 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Resolving since Thanos retention has been trimmed, more space is being freed as part of {T314835} [10:32:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff) [10:32:15] (03CR) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [10:32:24] (03Abandoned) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [10:34:14] (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [10:34:25] (03PS2) 10Muehlenhoff: Remove sre.misc-clusters.sretest [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 [10:35:29] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:36:04] (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [10:36:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1015.eqiad.wmnet [10:37:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:37:27] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:37:36] (03PS2) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) [10:37:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33780 and previous config saved to /var/cache/conftool/dbconfig/20220905-103749-ladsgroup.json [10:37:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:37:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:39:40] (03PS2) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705) [10:41:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove sre.misc-clusters.sretest [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff) [10:41:56] (03Merged) 10jenkins-bot: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [10:43:18] (03PS1) 10MVernon: swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049) [10:43:23] (03PS1) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) [10:44:25] (03CR) 10JMeybohm: [C: 03+1] "Fine by me :)" [puppet] - 10https://gerrit.wikimedia.org/r/820748 (owner: 10Muehlenhoff) [10:44:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove sre.misc-clusters.sretest (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff) [10:45:41] 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) > one is e.g. Keith losing his VMs because of not knowing the context TBH i think this is the issue we need to reso... [10:49:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:51:45] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P33781 and previous config saved to /var/cache/conftool/dbconfig/20220905-105255-ladsgroup.json [10:52:59] (03PS1) 10Hnowlan: Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) [10:55:19] !log set thanos ring replicas to 3.90 T311690 [10:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:22] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [10:55:46] (03CR) 10Jbond: [C: 04-1] Add clean-stale-puppet-certs script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [10:57:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:00:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [11:01:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff) [11:02:38] (03CR) 10Slavina Stefanova: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [11:02:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [11:03:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829743 (owner: 10David Caro) [11:04:42] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet [11:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P33782 and previous config saved to /var/cache/conftool/dbconfig/20220905-110801-ladsgroup.json [11:12:19] (03PS1) 10Jbond: C:cpufrequtils: Add package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/829791 [11:12:21] (03PS1) 10Jbond: C:cpufrequtils: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/829792 [11:15:22] (03CR) 10Jbond: "See comments, i created a new CR in https://gerrit.wikimedia.org/r/c/operations/puppet/+/829791 with the suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [11:15:23] !log cgoubert@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet [11:15:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37116/console" [puppet] - 10https://gerrit.wikimedia.org/r/829791 (owner: 10Jbond) [11:15:57] !log tstarling@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test [11:16:00] T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847 [11:16:12] !log tstarling@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test [11:16:19] !log pooled parse1003.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638 [11:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cpufrequtils: Add package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/829791 (owner: 10Jbond) [11:17:52] (03CR) 10Jbond: [C: 03+2] C:cpufrequtils: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/829792 (owner: 10Jbond) [11:18:14] !log on db2142: stopped mariadb replication [11:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33783 and previous config saved to /var/cache/conftool/dbconfig/20220905-112308-ladsgroup.json [11:23:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:23:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:23:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:24:55] !log depooled wtp1036.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638 [11:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:38] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1003.eqiad.wmnet [11:27:38] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1003.eqiad.wmnet [11:29:58] !log on db2142: set master_delay=30 and restarted replication T316847 [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:01] T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847 [11:30:01] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1004.eqiad.wmnet [11:32:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:32:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1034-1036].eqiad.wmnet with reason: Downtiming replaced wtp servers [11:32:47] (03CR) 10Roman Stolar: [C: 03+1] "Great!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [11:32:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1034-1036].eqiad.wmnet with reason: Downtiming replaced wtp servers [11:34:08] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1034.eqiad.wmnet [11:34:16] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1035.eqiad.wmnet [11:36:40] !log Set wtp103[4-5].eqiad.wmnet inactive pending decommission https://phabricator.wikimedia.org/T317025 [11:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] !log on db2142: dropping inbound mysql traffic T316847 [11:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:38] T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847 [11:39:59] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10BTullis) [11:40:52] !log jnuche@deploy1002 Installing scap version "4.16.0" for 584 hosts [11:40:56] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) > LUMEN Subsea group performed a cold reset on a card in Bude England to restore service. I am seeing traffic up at this time. [11:41:08] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr2-eqiad:xe-4/1/3 [11:41:10] !log jnuche@deploy1002 Installation of scap version "4.16.0" completed for 584 hosts [11:41:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr2-eqiad:xe-4/1/3 [11:41:25] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox interface ID cr2-eqiad:xe-4/1/3 --- **Interface cr2-eqiad:xe-4/1/3** - admin-status: up - oper-... [11:42:21] (03CR) 10Clément Goubert: "Dropping in favour of https://gerrit.wikimedia.org/r/c/operations/puppet/+/829791" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [11:42:27] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) 05Open→03Resolved a:03ayounsi [11:42:35] (03Abandoned) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [11:43:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [11:43:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [11:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33784 and previous config saved to /var/cache/conftool/dbconfig/20220905-114352-ladsgroup.json [11:43:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [11:44:58] (03PS2) 10Ayounsi: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) [11:45:06] (03PS1) 10Hnowlan: Add script for automating joining a single node to the cluster [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/829807 (https://phabricator.wikimedia.org/T309619) [11:46:33] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:47:41] (03CR) 10Muehlenhoff: [C: 03+2] puppet_compiler: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:47:45] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10BTullis) I have applied this change. {F35509708,width=60%} [11:51:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [11:51:13] RECOVERY - mediawiki-installation DSH group on parse1004 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:52:54] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1004.eqiad.wmnet [11:52:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1004.eqiad.wmnet [11:53:52] !log pooled parse1004.eqiad.wmnet (php 7.4 only) in parsoid cluster T312638 [11:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:55] T312638: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 [11:55:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [11:55:22] !log on db2142: rejecting inbound mysql traffic T316847 [11:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:24] T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847 [11:56:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:56:17] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[1001-1004].eqiad.wmnet [11:56:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[1001-1004].eqiad.wmnet [11:59:09] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:01:14] (03CR) 10Jbond: "took another pass thanks" [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:02:29] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:03:17] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:05:48] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon) [12:09:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [12:10:00] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye [12:10:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec... [12:10:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1004.mgmt [12:10:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1004.mgmt [12:10:37] !log tstarling@cumin1001 START - Cookbook sre.hosts.remove-downtime for db[2142-2144].codfw.wmnet [12:10:38] !log tstarling@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[2142-2144].codfw.wmnet [12:13:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [12:14:10] !log depooled wtp1037.eqiad.wmnet from parsoid cluster T312638 [12:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:12] T312638: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 [12:14:29] (03CR) 10MVernon: [C: 03+2] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon) [12:16:40] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1005.eqiad.wmnet [12:16:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [12:17:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [12:18:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [12:20:01] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:20:19] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 18 hosts with reason: Downtime pending inclusion in production [12:20:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 18 hosts with reason: Downtime pending inclusion in production [12:22:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [12:24:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [12:25:44] (03PS1) 10David Caro: build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 [12:26:32] (03CR) 10David Caro: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [12:31:20] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1005,parse1005.mgmt [12:31:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1005,parse1005.mgmt [12:31:55] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:33:26] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host datahubsearch1003.eqiad.wmnet [12:38:39] (03PS1) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) [12:47:39] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1007.eqiad.wmnet with OS bullseye [12:47:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec... [12:48:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye [12:48:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye [12:49:19] (03PS2) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) [12:50:06] (03PS2) 10Ayounsi: Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) [12:51:06] (03PS3) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) [12:51:12] (03CR) 10Slavina Stefanova: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [12:56:17] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:31] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1300). [13:00:05] Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:23] o/ [13:00:54] Hi [13:01:20] hi Tchanders. I can deploy today :) [13:01:22] (03PS2) 10Urbanecm: Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders) [13:01:25] (03CR) 10Urbanecm: [C: 03+2] Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders) [13:01:34] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [13:02:17] (03Merged) 10jenkins-bot: Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders) [13:03:35] Tchanders: your patch is at mwdebug1001. can you check? [13:03:46] urbanecm: Thanks, testing... [13:03:51] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:10] urbanecm: Looks good to me [13:05:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [13:05:15] thanks, syncing [13:06:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:07:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:07:36] !log disabling puppet in codfw and the edges temporarily [13:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:23] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:08:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:08:45] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:09:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: edbcee4d9a901ce475ebcc53e4c4bc18e04bc2b8: Enable partial action blocks on fawiki (T315525) (duration: 03m 34s) [13:09:11] T315525: Deploy action blocks to pilot wikis - https://phabricator.wikimedia.org/T315525 [13:09:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:09:14] Tchanders: and, should be live [13:09:16] anything else? [13:09:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33785 and previous config saved to /var/cache/conftool/dbconfig/20220905-130944-ladsgroup.json [13:09:45] urbanecm: Thank you, as always! All good now [13:09:49] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:09:49] okay! [13:09:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson - sorry to trouble you with this old ticket, but I'm having an issue with three of these new an-presto hosts. * an-presto... [13:09:59] !log UTC afternoon B&C window done [13:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb [13:11:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb [13:13:02] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1011.eqiad.wmnet with OS bullseye [13:13:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye exec... [13:14:55] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:16:25] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1135512 bytes in 5.848 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:17:49] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:18:47] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:22:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:25:36] uh [13:27:32] (03CR) 10Muehlenhoff: [C: 03+2] keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:30:58] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) Great, thanks for checking it @jbond [13:31:13] !log wdqs1009 sudo systemctl stop wdqs-blazegraph.service [13:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:33] (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [13:33:04] (03Merged) 10jenkins-bot: ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [13:34:25] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:36:18] (03PS3) 10Volans: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [13:36:24] (03CR) 10Volans: [C: 03+2] Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [13:37:25] (03Merged) 10jenkins-bot: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [13:38:31] (03CR) 10Jbond: "LGTM but some minor issues" [puppet] - 10https://gerrit.wikimedia.org/r/829016 (owner: 10Giuseppe Lavagetto) [13:39:22] (03CR) 10Elukey: [C: 03+1] Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff) [13:41:00] (03CR) 10Jbond: [C: 04-1] gitlab: add gitlab::release::binary [puppet] - 10https://gerrit.wikimedia.org/r/829016 (owner: 10Giuseppe Lavagetto) [13:41:47] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [13:47:30] (03CR) 10Muehlenhoff: [C: 03+2] Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff) [13:48:28] !log pooled parse1005.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [13:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:33] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [13:50:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] api_appserver: convert all canaries to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:50:29] (03PS1) 10Volans: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) [13:51:49] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:52:07] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:16] (03CR) 10CI reject: [V: 04-1] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [13:54:26] (03CR) 10Ayounsi: [C: 03+1] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [14:01:33] !log depooled wtp1037.eqiad.wmnet from parsoid cluster T307219 [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:36] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:02:04] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:02:47] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:11:16] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1006.eqiad.wmnet [14:13:34] (03PS2) 10Volans: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) [14:13:45] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:19:59] (03CR) 10Filippo Giunchedi: [C: 03+1] p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [14:20:06] (03CR) 10Volans: [C: 03+2] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [14:21:01] (03Merged) 10jenkins-bot: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans) [14:21:55] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1006,parse1006.mgmt [14:21:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1006,parse1006.mgmt [14:22:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33786 and previous config saved to /var/cache/conftool/dbconfig/20220905-142240-ladsgroup.json [14:22:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:22:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] canary_appserver: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829550 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [14:23:39] !log pooled parse1006.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [14:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:42] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:26:20] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:26:38] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:28:46] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:28:58] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:29:56] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:29:58] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:30:26] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:30:27] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:32:23] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:34] !log depooled wtp1039.eqiad.wmnet from parsoid cluster T307219 [14:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:38] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:34:57] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:37:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33788 and previous config saved to /var/cache/conftool/dbconfig/20220905-143746-ladsgroup.json [14:40:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1036-1038].eqiad.wmnet with reason: Downtiming replace wtp servers [14:40:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1036-1038].eqiad.wmnet with reason: Downtiming replace wtp servers [14:42:00] (03PS1) 10Btullis: Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175) [14:42:08] (03CR) 10FNegri: "If you don't see any downsides, I would suggest rebasing this change on the main branch, combining this patch with https://gerrit.wikimedi" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [14:46:43] !add 100G to prometheus codfw / global instance [14:46:46] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1036.eqiad.wmnet [14:47:03] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1037.eqiad.wmnet [14:48:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:48:51] !log Set wtp103[6-7].eqiad.wmnet inactive pending decommission T317025 [14:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:54] T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 [14:50:56] (03CR) 10Majavah: [C: 04-1] "This won't work as cloudmetrics* hosts are in the production private network and so can't access cloud vps endpoints directly" [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [14:50:58] (03PS2) 10Btullis: Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175) [14:52:40] (03CR) 10Majavah: [C: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [14:52:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33789 and previous config saved to /var/cache/conftool/dbconfig/20220905-145252-ladsgroup.json [14:53:05] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:53:23] (03CR) 10Hnowlan: [C: 03+2] Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [15:02:55] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:04:36] !log updating docker.io on gitlab-runners [15:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:46] (03Merged) 10jenkins-bot: Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [15:06:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [15:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33790 and previous config saved to /var/cache/conftool/dbconfig/20220905-150758-ladsgroup.json [15:08:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:08:02] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:08:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:08:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:08:27] 10SRE, 10ops-codfw: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10colewhite) p:05Triage→03High The cluster will remain in a degraded state until replacements are installed. Please replace the failed disks as soon as possible. Thanks! [15:08:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33791 and previous config saved to /var/cache/conftool/dbconfig/20220905-150837-ladsgroup.json [15:09:17] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1007.eqiad.wmnet [15:09:43] RECOVERY - memcached socket on parse1007 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [15:15:15] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Swift [15:16:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1007,parse1007.mgmt [15:16:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1007,parse1007.mgmt [15:17:24] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) a:05Jelto→03pfischer [15:17:37] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [15:17:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [15:18:26] (03CR) 10Muehlenhoff: [C: 03+2] cadvisor_exporter: Remove check fo Stretch [puppet] - 10https://gerrit.wikimedia.org/r/820748 (owner: 10Muehlenhoff) [15:19:07] !log pooled parse1007.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [15:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:11] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [15:19:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Add a deprecated flag to admin groups - https://phabricator.wikimedia.org/T248161 (10jbond) 05Open→03Resolved implmented [15:23:21] (03PS1) 10Muehlenhoff: Remove peek-admins grup [puppet] - 10https://gerrit.wikimedia.org/r/829828 [15:27:39] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:28:24] !log depooled wtp1040.eqiad.wmnet from parsoid cluster T307219 [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:27] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [15:29:37] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [15:30:04] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1530). [15:30:48] !log installing apache2 security updates [15:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:51] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1038.eqiad.wmnet [15:33:12] (03CR) 10FNegri: [C: 03+1] "The combined diff of this patch and 829031 is minimal (it's basically only s/buster/bullseye/) and I verified I can build the new image us" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [15:33:58] (03CR) 10FNegri: [C: 03+1] "Works fine on my Mac" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro) [15:36:12] (03PS1) 10Volans: ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 [15:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:14] (03CR) 10Raymond Ndibe: [C: 03+1] "not explicitly tested but shutil.which("docker") returns expected value on my machine so +1" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro) [15:48:26] (03CR) 10FNegri: "LGTM, I only have a small question (see inline comment). What would be the best way to test this? What is a scenario where you want to "ch" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [15:52:04] (03PS4) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 [15:52:06] (03PS5) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [15:52:08] (03CR) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [15:52:10] (03PS3) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) [15:52:12] (03CR) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [15:55:40] (03CR) 10David Caro: bullseye0: Improve the install-packages script (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [15:59:39] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:04:16] (03CR) 10David Caro: Remove buster0 buildpacks images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [16:05:00] (03PS2) 10David Caro: bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) [16:05:02] (03CR) 10David Caro: bullseye0: Improve the install-packages script (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [16:05:05] (03PS2) 10David Caro: build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 [16:12:45] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:17:05] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:39] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [16:27:14] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [16:29:23] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:30:15] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:32:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:39:14] (03CR) 10Ayounsi: [C: 03+2] Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [16:44:16] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) [16:44:57] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) [16:45:52] (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans) [16:47:11] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:00:04] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1700) [17:00:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:25] (03CR) 10Hashar: Json schema from Gerrit Java event classes (036 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:09:33] (03PS4) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [17:16:39] (03CR) 10Hashar: "Instead of having all the java classes in the same directory, I have split them in:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:16:45] (03CR) 10FNegri: [C: 03+1] bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [17:17:45] (03CR) 10Hashar: "The coverage report yields:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:20:28] (03CR) 10FNegri: [C: 03+1] Remove buster0 buildpacks images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [17:29:02] (03CR) 10Volans: [C: 03+1] "post-merge +1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond) [17:29:13] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:59] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:35:51] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [17:37:08] (03PS1) 10Volans: Simplify cumin query in comment for confd [dns] - 10https://gerrit.wikimedia.org/r/829856 (https://phabricator.wikimedia.org/T314489) [17:39:31] (03PS5) 10Volans: cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [17:39:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33792 and previous config saved to /var/cache/conftool/dbconfig/20220905-173951-ladsgroup.json [17:39:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [17:41:52] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [17:51:31] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.092 second response time https://wikitech.wikimedia.org/wiki/Swift [17:53:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [17:53:53] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Swift [17:54:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [17:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T314041)', diff saved to https://phabricator.wikimedia.org/P33793 and previous config saved to /var/cache/conftool/dbconfig/20220905-175423-ladsgroup.json [17:54:26] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:54:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33794 and previous config saved to /var/cache/conftool/dbconfig/20220905-175457-ladsgroup.json [17:56:25] (03PS1) 10Ladsgroup: Improvements on css [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 [17:59:44] (03PS3) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562 [18:04:39] (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/819568 (owner: 10Ayounsi) [18:07:23] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33795 and previous config saved to /var/cache/conftool/dbconfig/20220905-181003-ladsgroup.json [18:13:52] (03PS4) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 [18:19:17] (03PS5) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:23:14] (03CR) 10CI reject: [V: 04-1] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:24:06] (03CR) 10AOkoth: vrts: install vrts script (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33796 and previous config saved to /var/cache/conftool/dbconfig/20220905-182510-ladsgroup.json [18:25:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:25:13] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [18:25:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:25:20] (03PS6) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:30:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33797 and previous config saved to /var/cache/conftool/dbconfig/20220905-183017-ladsgroup.json [18:31:05] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37120/" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:45:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33798 and previous config saved to /var/cache/conftool/dbconfig/20220905-184522-ladsgroup.json [19:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33799 and previous config saved to /var/cache/conftool/dbconfig/20220905-190027-ladsgroup.json [19:15:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33800 and previous config saved to /var/cache/conftool/dbconfig/20220905-191532-ladsgroup.json [19:19:39] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.088 second response time https://wikitech.wikimedia.org/wiki/Swift [19:21:59] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [19:25:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:25:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:25:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:25:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:25:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33801 and previous config saved to /var/cache/conftool/dbconfig/20220905-192554-ladsgroup.json [19:25:57] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:50:29] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:15] indeed, nothing to do [20:18:35] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.081 second response time https://wikitech.wikimedia.org/wiki/Swift [20:20:55] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [20:24:53] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:32:39] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33802 and previous config saved to /var/cache/conftool/dbconfig/20220905-203824-ladsgroup.json [20:38:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:49:12] (03PS8) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [20:53:15] 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Platonides) I suspect it's a timeout at Varnish level, and it then got cached somewhere. I was always getting 200, even when asking every dc: ` for dc i... [20:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33803 and previous config saved to /var/cache/conftool/dbconfig/20220905-205330-ladsgroup.json [20:55:39] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.288 second response time https://wikitech.wikimedia.org/wiki/Swift [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T2100). [21:00:29] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [21:01:15] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.247 second response time https://wikitech.wikimedia.org/wiki/Swift [21:03:33] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift [21:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33804 and previous config saved to /var/cache/conftool/dbconfig/20220905-210837-ladsgroup.json [21:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33805 and previous config saved to /var/cache/conftool/dbconfig/20220905-212343-ladsgroup.json [21:23:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:23:47] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:24:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33806 and previous config saved to /var/cache/conftool/dbconfig/20220905-212415-ladsgroup.json [21:25:18] (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans) [21:26:01] (03Merged) 10jenkins-bot: ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans) [21:29:05] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Legoktm) It seems like page history caches are not being invalidated properly, which I su... [21:30:06] (03PS1) 10Volans: ganeti-netbox-sync: add missing space in exception [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864 [21:30:22] (03CR) 10Volans: [C: 03+2] "Just adding a space, self-merging" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864 (owner: 10Volans) [21:31:07] (03Merged) 10jenkins-bot: ganeti-netbox-sync: add missing space in exception [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864 (owner: 10Volans) [21:32:24] 10SRE, 10MediaWiki-Page-history, 10Traffic, 10Regression: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Legoktm) [21:39:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: an-presto1011, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w [21:39:27] wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:54:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:57:12] (03CR) 10Volans: [C: 03+1] "LGTM, just couple of typos inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [22:03:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: an-presto1011, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w [22:03:27] wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:14:09] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.246 second response time https://wikitech.wikimedia.org/wiki/Swift [22:14:51] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.168 second response time https://wikitech.wikimedia.org/wiki/Swift [22:14:53] (03CR) 10Volans: "post-merge FYI comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/823704 (https://phabricator.wikimedia.org/T315360) (owner: 10Ryan Kemper) [22:16:31] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:08] (03Abandoned) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans) [22:22:09] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Swift [22:36:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33807 and previous config saved to /var/cache/conftool/dbconfig/20220905-223657-ladsgroup.json [22:37:00] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:52:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33808 and previous config saved to /var/cache/conftool/dbconfig/20220905-225203-ladsgroup.json [22:55:55] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:07:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33809 and previous config saved to /var/cache/conftool/dbconfig/20220905-230709-ladsgroup.json [23:22:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33810 and previous config saved to /var/cache/conftool/dbconfig/20220905-232216-ladsgroup.json [23:22:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:22:19] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:22:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:22:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33811 and previous config saved to /var/cache/conftool/dbconfig/20220905-232237-ladsgroup.json