[00:07:15] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-05 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:13:10] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-05 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:20:29] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [00:28:25] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-05 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:10:13] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-12 00:00:01 (3256 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:49] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-12 00:00:01 (3235 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:14:39] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:20:09] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:03] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-12 00:00:02 (3235 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:22:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:51:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:56:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:03:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:09:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:18:33] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:27] (03PS3) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [04:29:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:34:18] (ProbeDown) resolved: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:40:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:44:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:28] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) Thank you Chris - just started db1137 again. [04:52:39] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:00:42] (03PS1) 10Marostegui: db2162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813361 (https://phabricator.wikimedia.org/T311493) [05:03:32] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) No, that's probably from the overload a few days ago that made it stall (all the queries are there hanging). Just fixed... [05:04:38] (03CR) 10Marostegui: [C: 03+2] db2162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813361 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:06:49] (03PS1) 10Marostegui: instances.yaml: Add db2162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813362 (https://phabricator.wikimedia.org/T311493) [05:10:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813362 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:12:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2162 in s8 T311493', diff saved to https://phabricator.wikimedia.org/P31023 and previous config saved to /var/cache/conftool/dbconfig/20220713-051239-marostegui.json [05:12:43] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:13:44] (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813307 [05:16:35] (03CR) 10Marostegui: [C: 03+2] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813307 (owner: 10Marostegui) [05:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31024 and previous config saved to /var/cache/conftool/dbconfig/20220713-051701-root.json [05:32:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31025 and previous config saved to /var/cache/conftool/dbconfig/20220713-053205-root.json [05:47:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31026 and previous config saved to /var/cache/conftool/dbconfig/20220713-054709-root.json [06:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31027 and previous config saved to /var/cache/conftool/dbconfig/20220713-060213-root.json [06:16:38] !log analytics/refinery deployment [06:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:58] !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67]: Regular analytics weekly train [analytics/refinery@bd39e67] [06:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31028 and previous config saved to /var/cache/conftool/dbconfig/20220713-061717-root.json [06:23:44] (03PS1) 10Ayounsi: interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 [06:32:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31029 and previous config saved to /var/cache/conftool/dbconfig/20220713-063221-root.json [06:33:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/813257 (owner: 10Jbond) [06:34:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though the single to double quote conversion makes it hard to clearly see what's going on" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [06:39:11] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [06:44:00] !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67]: Regular analytics weekly train [analytics/refinery@bd39e67] (duration: 27m 02s) [06:45:11] !log analytics/refinery deploy aborted, no more space to deploy in /srv on an-launcher1002 eqiad [06:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31030 and previous config saved to /var/cache/conftool/dbconfig/20220713-064725-root.json [06:56:28] 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) >>! In T311690#8068456, @gerritbot wrote: > Change 811933 **merged** by Filippo Giunchedi: > %%%[operations/puppet@production] thanos: trim raw samples retention to 54 weeks%%% > https://... [07:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:02] (03CR) 10Majavah: wmcs: Add novafullstack alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [07:02:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31031 and previous config saved to /var/cache/conftool/dbconfig/20220713-070229-root.json [07:16:57] (03PS1) 10Muehlenhoff: Extend access for jm [puppet] - 10https://gerrit.wikimedia.org/r/813582 [07:18:40] (03CR) 10Majavah: wmcs: don't page for most checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [07:20:47] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jm [puppet] - 10https://gerrit.wikimedia.org/r/813582 (owner: 10Muehlenhoff) [07:50:23] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [07:51:14] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [07:51:19] (03PS3) 10JMeybohm: k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) [07:51:21] (03PS7) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [07:51:23] (03PS5) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) [07:52:19] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [07:52:55] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [07:59:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'm 99% sure the package had definitely gotten installed before, probably it got lost in the refactor towards the define, not " [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond) [08:05:02] !log 'systemctl restart rsyslog' on kubernetes2007.codfw.wmnet,kubernetes2010.codfw.wmnet,kubernetes2014.codfw.wmnet,kubernetes2020.codfw.wmnet,kubernetes2009.codfw.wmnet [08:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:58] (KubernetesRsyslogDown) resolved: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:13:39] 10SRE, 10SRE-Access-Requests: Requesting access to _security IRC channel for TheresNoTime - https://phabricator.wikimedia.org/T312771 (10Joe) 05Open→03Resolved a:03Joe [08:15:14] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) a:03Joe [08:20:19] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) Hi @Aline_Bruenger_WMDE can you please provide the email with which you've registered your developer account on wikitech? Thanks! [08:22:09] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) p:05Triage→03Medium [08:22:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31032 and previous config saved to /var/cache/conftool/dbconfig/20220713-082236-ladsgroup.json [08:30:02] (03PS1) 10Giuseppe Lavagetto: admin: add ddw to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/813588 (https://phabricator.wikimedia.org/T312675) [08:32:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add ddw to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/813588 (https://phabricator.wikimedia.org/T312675) (owner: 10Giuseppe Lavagetto) [08:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31033 and previous config saved to /var/cache/conftool/dbconfig/20220713-083740-ladsgroup.json [08:38:32] (03PS1) 10Ayounsi: Interface description: handle patch panels properly [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) [08:38:37] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:05dr0ptp4kt→03Joe @Ddwaal-WMF you should now be able to access all those resources. [08:38:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:40:10] (03CR) 10Ayounsi: "Example diff:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [08:45:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) Hi @Bethany can you confirm you've signed the L3 document and you've read it? [08:47:10] (03CR) 10David Caro: wmcs: don't page for most checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [08:47:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) [08:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31034 and previous config saved to /var/cache/conftool/dbconfig/20220713-085244-ladsgroup.json [08:56:01] (03CR) 10David Caro: wmcs: Add novafullstack alerts (039 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [09:01:23] (03PS1) 10Majavah: P:toolforge::proxy: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813590 [09:07:28] (03CR) 10David Caro: [C: 03+2] "thanks! <3" [puppet] - 10https://gerrit.wikimedia.org/r/813590 (owner: 10Majavah) [09:07:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31035 and previous config saved to /var/cache/conftool/dbconfig/20220713-090748-ladsgroup.json [09:08:01] (03CR) 10Jbond: [C: 03+1] "LGTM lets give it a go" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi) [09:10:05] (03PS2) 10Jbond: spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256 [09:10:16] (03PS2) 10Jbond: idp: add spdx headers to idp role and profile [puppet] - 10https://gerrit.wikimedia.org/r/813257 [09:15:43] (03CR) 10Jbond: [C: 03+2] P:aptrepo: install python3-apt required by validate_cmd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond) [09:17:30] (03CR) 10Jbond: [C: 03+2] spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256 (owner: 10Jbond) [09:17:34] (03CR) 10Jbond: [C: 03+2] idp: add spdx headers to idp role and profile [puppet] - 10https://gerrit.wikimedia.org/r/813257 (owner: 10Jbond) [09:21:11] (03PS23) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [09:21:47] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [09:46:08] (03CR) 10Filippo Giunchedi: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [09:46:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [09:46:56] (03CR) 10Filippo Giunchedi: [C: 03+2] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [09:49:44] (03PS1) 10Lucas Werkmeister (WMDE): Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920) [09:51:58] (03PS1) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [09:51:59] (03PS1) 10Zabe: acme_chief: Add SPDX headers to acme_chief profile [puppet] - 10https://gerrit.wikimedia.org/r/813595 (https://phabricator.wikimedia.org/T308013) [09:52:01] (03PS1) 10Zabe: admin: Add SPDX headers to admin profile [puppet] - 10https://gerrit.wikimedia.org/r/813596 (https://phabricator.wikimedia.org/T308013) [09:52:04] (03PS1) 10Zabe: airflow: Add SPDX headers to airflow profile [puppet] - 10https://gerrit.wikimedia.org/r/813597 (https://phabricator.wikimedia.org/T308013) [09:52:05] (03PS1) 10Zabe: alertmanager: Add SPDX headers to alertmanager profile [puppet] - 10https://gerrit.wikimedia.org/r/813598 (https://phabricator.wikimedia.org/T308013) [09:52:07] (03PS1) 10Zabe: apt: Add SPDX headers to apt profile [puppet] - 10https://gerrit.wikimedia.org/r/813599 (https://phabricator.wikimedia.org/T308013) [09:52:09] (03PS1) 10Zabe: aqs: Add SPDX headers to aqs profile [puppet] - 10https://gerrit.wikimedia.org/r/813600 (https://phabricator.wikimedia.org/T308013) [09:52:11] (03PS1) 10Zabe: archiva: Add SPDX headers to archiva profile [puppet] - 10https://gerrit.wikimedia.org/r/813601 (https://phabricator.wikimedia.org/T308013) [09:52:13] (03PS1) 10Zabe: base: Add SPDX headers to base profile [puppet] - 10https://gerrit.wikimedia.org/r/813602 (https://phabricator.wikimedia.org/T308013) [09:52:15] (03PS1) 10Zabe: bird: Add SPDX headers to bird profile [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013) [09:52:30] (03CR) 10CI reject: [V: 04-1] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [09:52:55] (03PS2) 10Filippo Giunchedi: Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) [09:53:59] (03CR) 10Filippo Giunchedi: [V: 03+2] Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [09:54:09] (03PS3) 10Filippo Giunchedi: Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) [09:55:49] (03PS3) 10JMeybohm: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333 [09:57:04] (03CR) 10JMeybohm: [C: 04-1] New service: function-evaluator (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [10:02:57] (03CR) 10Filippo Giunchedi: [V: 03+2] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [10:05:01] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:06:57] (03CR) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE)) [10:12:12] (03CR) 10CI reject: [V: 04-1] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [10:23:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2012.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [10:23:21] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:23:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2012.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [10:25:18] !log draining ganeti1028 T311686 [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] oh no, wikibugs left [10:32:44] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813609 (https://phabricator.wikimedia.org/T306016) (owner: 10Michael Große) [10:38:02] !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67]: Regular analytics weekly train (2nd try. --force) [analytics/refinery@bd39e67] [10:42:54] !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67]: Regular analytics weekly train (2nd try. --force) [analytics/refinery@bd39e67] (duration: 04m 52s) [11:13:37] (03PS1) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) [11:31:47] (03CR) 10Jbond: [C: 03+2] apt: Add SPDX headers to apt profile [puppet] - 10https://gerrit.wikimedia.org/r/813599 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:32:51] (03CR) 10Ayounsi: "Example diff before:" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [11:43:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: codfw s8 sanitarium master switch [11:43:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: codfw s8 sanitarium master switch [11:49:03] (03PS1) 10Marostegui: mariadb: db2082 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/813619 (https://phabricator.wikimedia.org/T311493) [11:50:23] (03CR) 10Marostegui: [C: 03+2] mariadb: db2082 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/813619 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:58:33] (03CR) 10Ayounsi: [C: 03+2] interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi) [11:59:57] (03Merged) 10jenkins-bot: interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi) [12:06:24] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:33] (03CR) 10David Caro: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [12:08:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2018.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [12:08:48] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [12:08:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2018.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [12:12:11] !log draining ganeti2028 T311686 [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Ottomata) Approved once we have the details. Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? for help... [12:59:45] (03CR) 10Nskaggs: [C: 03+1] "+1, this looks like a rather straightforward shuffle, with no logic changes. Thanks for moving code out of __init__.py into grid.py. I too" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T1300). [13:00:04] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:16] still in a meeting, I can deploy in 5 minutes [13:01:22] (03CR) 10Nskaggs: [C: 03+1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [13:03:15] alright, let’s go [13:04:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920) (owner: 10Lucas Werkmeister (WMDE)) [13:04:53] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2049.codfw.wmnet [13:04:56] (03Merged) 10jenkins-bot: Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920) (owner: 10Lucas Werkmeister (WMDE)) [13:05:15] testing on mwdebug1001 [13:05:23] !log bking@elastic2049 rebooting for read-only fs [13:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:37] yup, does what it’s supposed to [13:06:47] (03CR) 10David Caro: wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [13:08:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813594|Configure $wgBabelCategoryNames on Test Wikidata (T312920)]] (duration: 02m 51s) [13:08:51] T312920: Configure $wgBabelCategoryNames on Test Wikidata - https://phabricator.wikimedia.org/T312920 [13:09:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:10:09] (03PS3) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) [13:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:35] (03CR) 10Filippo Giunchedi: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [13:12:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE)) [13:12:58] (03PS24) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [13:13:54] (03Merged) 10jenkins-bot: Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE)) [13:14:15] this one can’t be tested, I’ll sync it directly [13:16:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:20] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790399|Configure wgLexemeLexicalCategoryItemIds on Wikidata (T307441)]] (duration: 02m 45s) [13:17:24] T307441: configure correct Q-IDs for lexical categories in production for deployment - https://phabricator.wikimedia.org/T307441 [13:17:44] I think that’s it, unless anyone else has something to deploy? [13:18:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:07] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host elastic2049.codfw.wmnet [13:20:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36265/console" [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond) [13:20:35] !log UTC afternoon backport window done [13:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:41] (03CR) 10CI reject: [V: 04-1] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond) [13:25:56] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:26:46] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [13:26:58] PROBLEM - SSH on elastic2049 is CRITICAL: connect to address 10.192.32.179 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:27:26] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [13:30:38] (03PS6) 10Jbond: initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) [13:33:38] ACKNOWLEDGEMENT - SSH on elastic2049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King alert should have been suppressed, my apologies https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:33:38] ACKNOWLEDGEMENT - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% Brian_King alert should have been suppressed, my apologies [13:35:30] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:35] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye [13:37:44] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye [13:37:57] (03CR) 10Jbond: [C: 03+2] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond) [13:39:50] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:54] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:42:54] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:43:20] looking [13:43:27] here too [13:43:35] here [13:43:38] here [13:43:40] here if needed [13:44:04] * jbond here [13:44:12] around [13:44:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from x1 master', diff saved to https://phabricator.wikimedia.org/P31037 and previous config saved to /var/cache/conftool/dbconfig/20220713-134413-marostegui.json [13:44:25] https://librenms.wikimedia.org/graphs/to=1657719600/id=17846/type=port_bits/from=1657633200/ brief spike [13:45:06] yep.. upload is getting some love [13:45:20] https://grafana.wikimedia.org/d/uz11QGcnk/haproxy-tls-cluster-view?orgId=1&var-cluster=upload&var-site=eqsin&viewPanel=6&from=now-3h&to=now [13:47:06] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye [13:47:14] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed... [13:47:54] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:47:54] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [14:00:39] !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67] (thin): Regular analytics weekly train THIN [analytics/refinery@bd39e67] [14:00:47] !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67] (thin): Regular analytics weekly train THIN [analytics/refinery@bd39e67] (duration: 00m 07s) [14:01:05] !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bd39e67] [14:04:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye [14:04:38] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye [14:08:47] !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bd39e67] (duration: 07m 42s) [14:11:23] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye [14:11:30] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed... [14:15:44] (03PS1) 10Elukey: ml-services: test event production for editquality in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/813633 (https://phabricator.wikimedia.org/T301878) [14:18:59] !log Deployed refinery using scap, then deployed onto hdfs [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:28] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@03c1a05]: Deploy [airflow-dags/analytics_test@03c1a05] [14:34:41] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@03c1a05]: Deploy [airflow-dags/analytics_test@03c1a05] (duration: 00m 12s) [14:36:28] (03PS8) 10Ottomata: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [14:37:28] (03PS7) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [14:37:30] (03CR) 10CI reject: [V: 04-1] [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [14:37:34] (03CR) 10Ori: New service: function-evaluator (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:38:27] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye [14:38:36] (03CR) 10CI reject: [V: 04-1] New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:38:36] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye [14:41:12] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:41:53] (03PS8) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [14:42:27] (03CR) 10Ori: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:51:21] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) This is still an ongoing issue. I tried reimaging to Bullseye, but the installer cannot detect any hard drives. DC Ops, are you able to take a look? [14:52:39] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye [14:52:47] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed... [14:53:27] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Cmjohnson) 05Open→03Resolved Disk replaced [14:54:36] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) @btullis I am a little behind but can we do this now? [15:01:04] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Cmjohnson) @Eevans Are you ready for these moves yet? [15:03:52] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) @marostegui row C is very tight and I can only move 2 of the several servers that need to go. I would like to move these 2 of yours first. Can we schedule this for tomorrow 14 Ju... [15:07:29] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) @Cmjohnson that works for me. I will get those two hosts ready for you [15:09:45] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Marostegui) Thanks! [15:10:32] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9edd1ab]: Deploy [airflow-dags/analytics_test@9edd1ab] [15:10:40] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9edd1ab]: Deploy [airflow-dags/analytics_test@9edd1ab] (duration: 00m 08s) [15:11:02] (03CR) 10Nskaggs: [C: 03+1] wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [15:11:40] (03PS3) 10Ori: [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) [15:11:50] (03CR) 10Ori: [C: 03+2] [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) (owner: 10Ori) [15:12:07] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9edd1ab]: Deploy [airflow-dags/analytics@9edd1ab] [15:12:18] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9edd1ab]: Deploy [airflow-dags/analytics@9edd1ab] (duration: 00m 10s) [15:13:02] (03Merged) 10jenkins-bot: [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) (owner: 10Ori) [15:18:27] (03PS9) 10Ottomata: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [15:19:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:20:17] (03CR) 10CI reject: [V: 04-1] [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [15:20:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:20:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:21:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:22:57] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) a:03Papaul [15:24:55] (03PS25) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [15:25:25] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Cmjohnson) [15:32:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Bethany) Seeking access to "All of the above" - LDAP membership in the wmf or nda LDAP group. - Shell (posix) membership in the `analytics-privatedat... [15:49:00] (03PS26) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [15:49:21] (03PS27) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [15:54:58] (03CR) 10Elukey: [C: 03+2] ml-services: test event production for editquality in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/813633 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [15:55:47] (03PS10) 10Mark Bergsma: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) [15:56:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2040.codfw.wmnet with OS bullseye [15:56:11] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye [15:56:59] (03PS1) 10Majavah: prometheus: blackbox: add support for matching status codes [puppet] - 10https://gerrit.wikimedia.org/r/813649 [15:57:01] (03PS1) 10Majavah: P:toolforge: update web_domain default value [puppet] - 10https://gerrit.wikimedia.org/r/813650 [15:57:03] (03PS1) 10Majavah: P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 [15:58:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:58:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:58:58] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:59:16] (03CR) 10CI reject: [V: 04-1] P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 (owner: 10Majavah) [15:59:24] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2040.codfw.wmnet with OS bullseye [15:59:33] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye executed... [15:59:47] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:03] (03PS2) 10Majavah: P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 [16:13:06] (03PS1) 10Majavah: prometheus: blackbox_exporter: remove un-managed module files [puppet] - 10https://gerrit.wikimedia.org/r/813654 [16:17:24] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@e58e61d]: (no justification provided) [16:17:31] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) the install error message said ` No root file system │ │ No root file system is defined. │ │... [16:17:34] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@e58e61d]: (no justification provided) (duration: 00m 10s) [16:20:56] (03CR) 10David Caro: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah) [16:21:07] (03CR) 10David Caro: [C: 03+2] P:toolforge: update web_domain default value [puppet] - 10https://gerrit.wikimedia.org/r/813650 (owner: 10Majavah) [16:21:58] (03CR) 10Majavah: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah) [16:22:20] (03CR) 10David Caro: [C: 03+2] P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 (owner: 10Majavah) [16:24:01] (03CR) 10David Caro: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah) [16:24:03] (03CR) 10David Caro: [C: 03+2] prometheus: blackbox: add support for matching status codes [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah) [16:24:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Ottomata) Approved. [16:26:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Reedy) [16:28:59] (03PS1) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) [16:32:06] (03CR) 10BryanDavis: [C: 03+1] "This probably has the effect of keeping this script from ever returning non-zero, but that's not large step from the prior behavior. A fol" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro) [16:38:47] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:41:53] (03CR) 10Zabe: CampaignEvents: backport extension for Jul 18 beta deploy (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani) [16:42:28] (03PS2) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819 [16:43:10] (03CR) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro) [16:44:05] (03PS3) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819 [16:44:08] (03CR) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro) [16:48:53] (03CR) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani) [16:56:30] (03CR) 10BryanDavis: [C: 03+1] "untested, but the code reads much better and seems to have more robust error handling which was the goal :)" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro) [16:56:38] Posted in #wikimedia-tech but probably also relevant here. The code coverage report for patches seems to be taking a extraordinary amount of time to complete. [16:57:02] Some patches are at 42 mins completed with 0 mins left to go according to https://integration.wikimedia.org/zuul/ [16:57:31] Answered in the tech channel [16:58:52] Dreamy_Jazz: #wikimedia-releng would be where anyone likely to fix it typically hangs out. Looking at https://integration.wikimedia.org/ci/ I think there is some general zuul<->jenkins sadness happening as there are lots of free executors. [17:03:42] (03CR) 10David Caro: [C: 03+2] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro) [17:04:07] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:23] (03PS2) 10David Caro: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274 [17:12:25] (03CR) 10David Caro: wmcs: Add novafullstack alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [17:25:44] (03CR) 10David Caro: wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [17:35:38] (03CR) 10Nskaggs: [C: 03+1] "I asked some "teach me more about how WMF uses puppet" style questions, nothing blocking." [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [17:37:27] (03PS7) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [17:37:29] (03PS7) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [17:37:31] (03PS3) 10David Caro: ceph: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 [17:37:33] (03PS3) 10David Caro: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 [17:37:35] (03PS3) 10David Caro: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 [17:37:37] (03PS1) 10David Caro: wmcs: move openstack/__init__.py to openstack/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813661 [17:37:39] (03PS1) 10David Caro: wmcs: move wmcs/__init__.py to wmcs/libs/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813662 [17:48:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [17:48:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmne... [17:48:20] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [17:48:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet wi... [17:48:45] (03PS3) 10DDesouza: QuickSurveys: Disable 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [17:52:17] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:51] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:56] (03CR) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [17:54:30] !log upload dnsdist_1.7.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589 [17:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:34] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [17:56:37] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8075107, @Cmjohnson wrote: > @Eevans Are you ready for these moves yet? We're no closer to having those new servers up than we were at my last updat... [17:58:28] (03PS4) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [17:58:37] (03CR) 10Nskaggs: [C: 03+1] novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [17:59:27] (03PS5) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [18:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T1800) [18:04:29] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:47] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:15:27] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:16:20] (03CR) 10DDesouza: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:20:32] !log upload pdns-recursor_4.6.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589 [18:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:36] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [18:21:08] (03PS1) 10Bartosz Dziewoński: Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828) [18:25:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Aklapper) >>! In T312827#8075269, @Bethany wrote: > - LDAP membership in the wmf or nda LDAP group. That seems to be already the case per https://ldap.t... [18:32:05] 10SRE, 10Data-Engineering, 10Discovery, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) Happened again today. There was a mediawiki.recentchange event with a 2015 timestamp. [18:33:54] hashar: i addded comments to your jsonschemaconverter patch, not sure if you saw them [18:41:41] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:45:50] (03PS2) 10David Caro: wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267 [18:45:52] (03PS3) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813275 [18:45:54] (03CR) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [18:54:51] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:17] (03PS1) 10Bartosz Dziewoński: Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) [19:08:03] (03PS1) 10Cmjohnson: Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) [19:09:28] (03CR) 10CI reject: [V: 04-1] Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) (owner: 10Cmjohnson) [19:33:15] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata) [19:35:01] 10SRE, 10ops-codfw, 10Patch-For-Review: Document codfw breakout patch panels in Netbox - https://phabricator.wikimedia.org/T304710 (10Papaul) 05Open→03Resolved a:03Papaul This is complete https://netbox.wikimedia.org/dcim/devices/3582/rear-ports/ https://netbox.wikimedia.org/dcim/devices/3581/front-ports/ [19:41:35] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [19:44:31] 10SRE, 10SRE Observability (FY2022/2023-Q1): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) [19:54:35] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) @Papaul the DRAC does not detect any hard drives. I checked under "Storage" in the Web UI, and it says " RAC0501: There are no physical disks to be displayed. 1. Che... [19:59:23] MatmaRex: I think you accidentaly removed my change on the deployment page. :) I restored it. [19:59:34] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2040.codfw.wmnet with OS bullseye [19:59:43] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye [19:59:44] If it's due to other reason It can be deployed later. [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T2000). [20:00:05] zabe, danisztls, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] danisztls: oh oops! sorry. i wonder how i managed to do that [20:00:33] no problem [20:00:53] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:53] hey o/ [20:01:00] just asked because maybe it had a reason [20:08:31] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:24] anyone deploying? [20:12:31] I asked Lucas as they're around if they can [20:13:16] he's coming [20:14:18] o/ [20:14:54] still lots of annoying UploadFromChunks errors in logspam-watch, I see [20:14:56] meh [20:14:57] MatmaRex, zabe, danisztls: ^ [20:15:07] o/ [20:15:37] (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:16:45] thanks [20:17:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:18:09] * Lucas_WMDE looks up what this extension used to do [20:18:27] wow [20:18:31] (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:18:56] I feel like that wouldn’t be built as a MediaWiki extension today [20:19:03] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2040.codfw.wmnet with reason: host reimage [20:19:42] zabe: first CongressLookup undeploy is on mwdebug1001 [20:19:48] I guess it can’t really be tested [20:19:56] * Lucas_WMDE quickly checks the wiki isn’t totally broken [20:19:56] nah, I don't see how [20:20:16] I will keep an eye on logstash after the sync [20:20:52] whoop, almost synced the wrong file though [20:20:59] CommonSettings, not InitialiseSettings ^^ [20:21:28] (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:22:37] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2040.codfw.wmnet with reason: host reimage [20:23:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:813338|Undeploy CongressLookup (part 1) (T312894)]] (duration: 03m 04s) [20:23:53] T312894: Undeploy CongressLookup - https://phabricator.wikimedia.org/T312894 [20:23:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "`grep -rF wmgUseCongressLookup` confirms nothing else uses this global 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:24:43] (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:24:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:25:00] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:25:07] second change is on mwdebug1001 [20:25:39] doesn’t look like anything is broken there either, let’s sync [20:25:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:25:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:00] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:26:34] API spike seems to have gone down again already, though still above previous levels [20:26:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:58] (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:28:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813339|Undeploy CongressLookup (part 2) (T312894)]] (duration: 02m 53s) [20:29:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:30:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:30:40] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:31:42] (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe) [20:31:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:44] pulled the third change to mwdebug1001 [20:32:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:18] nothing seems to break [20:33:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:47] syncing [20:34:14] (I looked at the last undeployed extension just in case the extension-list removal should wait for a train or anything like that, but it looks like CodeReview was also undeployed in one day, so should be fine) [20:35:56] let’s start the gate-and-submit for MatmaRex’ backport already [20:36:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński) [20:36:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:813340|Undeploy CongressLookup (part 3) (T312894)]] (duration: 03m 00s) [20:36:49] T312894: Undeploy CongressLookup - https://phabricator.wikimedia.org/T312894 [20:38:15] extension-list is used to build the LocalisationCache and since the extension is no longer used it should be fine to no longer build the messages for it [20:38:29] yeah, and I expect that’ll take effect with the next train [20:38:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:47] danisztls: I’m trying to figure out if QuickSurvey removals usually also disable the extension again [20:39:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:39:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:07] Lucas_WMDE: that's not documented AFAIK [20:40:50] I’d feel more comfortable with a config change to just remove the surveys but keep the extension, tbh [20:40:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:41:03] ok, will patch it [20:41:11] like in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/656116 [20:41:28] (i.e., for the actual survey entry, remove jawiki completely instead of setting it to [] ?) [20:43:03] (03Merged) 10jenkins-bot: Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński) [20:43:16] ok, in the meantime, let’s merge that backport, which went quicker than expected [20:43:18] ok [20:43:23] (than I expected, at least ^^) [20:44:16] MatmaRex: the localised digits backport should be on mwdebug1001, can you test it? [20:44:31] (03PS6) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [20:44:33] yeah [20:44:40] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2040.codfw.wmnet with OS bullseye [20:44:51] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye completed... [20:45:05] Lucas_WMDE: seems good [20:45:18] (03CR) 10Lucas Werkmeister (WMDE): QuickSurveys: Undeploy 'research-incentive' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:45:22] ok thanks [20:46:07] syncing [20:48:26] danisztls: I left a comment on the commit message, otherwise it looks good [20:48:29] Lucas_WMDE: related discussion https://phabricator.wikimedia.org/T213459, in the past QS extension caused a performance hit even when wiki had no surveys enabled but that looks to be fixed [20:48:44] I see [20:48:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/DiscussionTools/modules/CommentItem.js: Backport: [[gerrit:813666|Avoid localized digits in internal timestamps in JS (T312828)]] (duration: 02m 49s) [20:48:58] T312828: "Could not find the comment you're replying to on the page" (caused by bugs with changed timestamp format) - https://phabricator.wikimedia.org/T312828 [20:49:00] (03PS7) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [20:49:07] I wouldn’t mind a follow-up change to disable the extension again [20:49:15] just don’t feel comfortable deploying that now, with not much time left in the window [20:49:15] (03CR) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:49:30] Lucas_WMDE: ok [20:49:37] fixed the commit msg [20:49:41] since I didn’t quickly spot any changes like that in the git log [20:49:50] thanks! [20:50:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "The extension could also be undeployed from jawiki later, but we dropped that from this change because I didn’t feel as confident about it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:51:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:52:36] (03PS8) 10Lucas Werkmeister (WMDE): QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:52:40] (03CR) 10Lucas Werkmeister (WMDE): "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:53:07] I thought Zuul/Gerrit auto-rebased config changes? but this time it didn’t work [20:53:17] meh [20:53:27] Sometimes it randomly says no [20:53:35] (03Merged) 10jenkins-bot: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:53:39] jgit sucks [20:53:44] Not all them times are actual conflicts [20:54:05] What Reedy says and Reedy can I bribe you into a CR while you're here [20:55:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:55:11] danisztls: okay, config change is on mwdebug1001, can you test it? [20:55:17] Lucas_WMDE: sure [20:55:24] jouncebot: next [20:55:24] In 9 hour(s) and 4 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T0600) [20:55:39] alright, then we’ll just run over a bit to do MatmaRex’ last config change as well [20:55:49] if nothing else is happening after this window [20:56:00] Lucas_WMDE: lgtm, wiki still opens, survey dont show [20:56:06] alright, thanks [20:58:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:59:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:812377|QuickSurveys: Undeploy 'research-incentive' (T311015)]] (1/2, prod) (duration: 02m 48s) [20:59:21] T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015 [20:59:41] Lucas_WMDE: thanks! [21:00:34] (03PS2) 10Lucas Werkmeister (WMDE): Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [21:01:07] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:24] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:812377|QuickSurveys: Undeploy 'research-incentive' (T311015)]] (2/2, beta) (duration: 02m 58s) [21:02:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [21:03:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:18] (03Merged) 10jenkins-bot: Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [21:04:31] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:00] MatmaRex: mediawikiwiki DiscussionTools beta disablement is on mwdebug1001, please test [21:05:44] Lucas_WMDE: yep, seems good [21:05:49] ok! [21:06:08] syncing [21:06:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:06:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:06:34] Lucas_WMDE: thanks for showing so late to help [21:06:44] np :) [21:08:30] (03PS1) 10Krinkle: ResourceLoader: Remove DependencyStore::renew [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813670 (https://phabricator.wikimedia.org/T113916) [21:08:53] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813691|Disable DiscussionTools beta feature at mediawikiwiki (T310960)]] (duration: 02m 47s) [21:08:58] T310960: [Config Change] Make all DiscussionTools available by default at mediawiki.org - https://phabricator.wikimedia.org/T310960 [21:09:05] !log UTC late backport+config window done [21:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:14] thanks Lucas_WMDE [21:10:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:17:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:17:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:18:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:44:13] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) @bking the IDRAC will only detect hard drivers under storage in the Web UI if the system has a HW raid controller. [21:44:23] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:31] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:56:54] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) I can check on the physical disks when i am on site tomorrow. [22:09:03] !log bking@elastic2055 staging NIC firmware updates for elastic2055-2060 [22:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:18] !log bking@elastic2055 successfully staged NIC firmware updates for elastic2055-2060 [22:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:50] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) Discussed some of the sre.reimage cookbook failure scenarios with @RKemper and @EBernhardson today: - After a reimage failure, th... [22:32:09] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:45] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:40] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Papaul) >>! In T289135#8076562, @bking wrote: > Discussed some of the sre.reimage cookbook failure scenarios with @RKemper and @EBernhardson... [22:46:53] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:36] (03PS1) 10Cwhite: profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826) [23:11:27] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:21:14] 10ops-eqiad, 10DC-Ops: Please verify location of an-worker1111.eqiad.wmnet - https://phabricator.wikimedia.org/T298785 (10wiki_willy) Hi @BTullis - I'm just coming across this request now. It was missing the "ops-eqiad" project tag, so looks like it fell through the cracks. I'll add the appropriate tag and e... [23:22:12] 10ops-eqiad, 10DC-Ops: Please verify location of an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T298621 (10wiki_willy) Adding "ops-eqiad' project tag [23:27:54] 10ops-eqiad, 10DC-Ops: Failed disk on analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T293111 (10wiki_willy) Looks like this was missing the "ops-eqiad" project tag, so it fell through the cracks. @BTullis - since the hardware was installed to refresh this host in T293922, do you still need this... [23:30:17] 10ops-eqiad, 10DC-Ops: Relabel db1183 to be dbstore1007 - https://phabricator.wikimedia.org/T284126 (10wiki_willy) Looks like this one fell through the cracks, so adding the "ops-eqiad" project tag. @Cmjohnson or @Jclark-ctr - can one of you guys see if this one already has the correct label on it? Thanks, W... [23:34:50] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10wiki_willy) a:03Papaul Looks like this one fell through the cracks without the "ops-codfw" project tag, so adding it back in. cc @Papaul [23:51:58] (03PS1) 10Cwhite: hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) [23:52:35] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10RobH)