[00:07:15] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-05 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:13:10] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-05 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:20:29] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[00:28:25] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-05 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:10:13] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-12 00:00:01 (3256 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:21] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:52:45] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:02:49] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-12 00:00:01 (3235 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:14:39] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:11] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:20:09] <icinga-wm>	 PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:03] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-12 00:00:02 (3235 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:22:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:32:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:51:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:56:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:03:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:08:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:09:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:14:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:18:33] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:19:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:21:27] <wikibugs>	 (03PS3) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[04:29:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:34:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:35:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:40:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:44:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:49:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:52:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) Thank you Chris - just started db1137 again.
[04:52:39] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[05:00:42] <wikibugs>	 (03PS1) 10Marostegui: db2162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813361 (https://phabricator.wikimedia.org/T311493)
[05:03:32] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) No, that's probably from the overload a few days ago that made it stall (all the queries are there hanging). Just fixed...
[05:04:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813361 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:06:49] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813362 (https://phabricator.wikimedia.org/T311493)
[05:10:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813362 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:12:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2162 in s8 T311493', diff saved to https://phabricator.wikimedia.org/P31023 and previous config saved to /var/cache/conftool/dbconfig/20220713-051239-marostegui.json
[05:12:43] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[05:13:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813307
[05:16:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813307 (owner: 10Marostegui)
[05:17:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31024 and previous config saved to /var/cache/conftool/dbconfig/20220713-051701-root.json
[05:32:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31025 and previous config saved to /var/cache/conftool/dbconfig/20220713-053205-root.json
[05:47:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31026 and previous config saved to /var/cache/conftool/dbconfig/20220713-054709-root.json
[06:02:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31027 and previous config saved to /var/cache/conftool/dbconfig/20220713-060213-root.json
[06:16:38] <aqu>	 !log analytics/refinery deployment
[06:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:58] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67]: Regular analytics weekly train [analytics/refinery@bd39e67]
[06:17:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31028 and previous config saved to /var/cache/conftool/dbconfig/20220713-061717-root.json
[06:23:44] <wikibugs>	 (03PS1) 10Ayounsi: interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577
[06:32:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31029 and previous config saved to /var/cache/conftool/dbconfig/20220713-063221-root.json
[06:33:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/813257 (owner: 10Jbond)
[06:34:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though the single to double quote conversion makes it hard to clearly see what's going on" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[06:39:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[06:44:00] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67]: Regular analytics weekly train [analytics/refinery@bd39e67] (duration: 27m 02s)
[06:45:11] <aqu>	 !log analytics/refinery deploy aborted, no more space to deploy in /srv on an-launcher1002 eqiad
[06:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31030 and previous config saved to /var/cache/conftool/dbconfig/20220713-064725-root.json
[06:56:28] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) >>! In T311690#8068456, @gerritbot wrote: > Change 811933 **merged** by Filippo Giunchedi: > %%%[operations/puppet@production] thanos: trim raw samples retention to 54 weeks%%% > https://...
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:02] <wikibugs>	 (03CR) 10Majavah: wmcs: Add novafullstack alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[07:02:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31031 and previous config saved to /var/cache/conftool/dbconfig/20220713-070229-root.json
[07:16:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for jm [puppet] - 10https://gerrit.wikimedia.org/r/813582
[07:18:40] <wikibugs>	 (03CR) 10Majavah: wmcs: don't page for most checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[07:20:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jm [puppet] - 10https://gerrit.wikimedia.org/r/813582 (owner: 10Muehlenhoff)
[07:50:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[07:51:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[07:51:19] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661)
[07:51:21] <wikibugs>	 (03PS7) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661)
[07:51:23] <wikibugs>	 (03PS5) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661)
[07:52:19] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[07:52:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[07:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:03:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'm 99% sure the package had definitely gotten installed before, probably it got lost in the refactor towards the define, not " [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond)
[08:05:02] <jayme>	 !log  'systemctl restart rsyslog' on kubernetes2007.codfw.wmnet,kubernetes2010.codfw.wmnet,kubernetes2014.codfw.wmnet,kubernetes2020.codfw.wmnet,kubernetes2009.codfw.wmnet
[08:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:13:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to _security IRC channel for TheresNoTime - https://phabricator.wikimedia.org/T312771 (10Joe) 05Open→03Resolved a:03Joe
[08:15:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) a:03Joe
[08:20:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) Hi @Aline_Bruenger_WMDE can you please provide the email with which you've registered your developer account on wikitech?  Thanks!
[08:22:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Joe) p:05Triage→03Medium
[08:22:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31032 and previous config saved to /var/cache/conftool/dbconfig/20220713-082236-ladsgroup.json
[08:30:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add ddw to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/813588 (https://phabricator.wikimedia.org/T312675)
[08:32:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add ddw to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/813588 (https://phabricator.wikimedia.org/T312675) (owner: 10Giuseppe Lavagetto)
[08:37:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31033 and previous config saved to /var/cache/conftool/dbconfig/20220713-083740-ladsgroup.json
[08:38:32] <wikibugs>	 (03PS1) 10Ayounsi: Interface description: handle patch panels properly [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710)
[08:38:37] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:05dr0ptp4kt→03Joe @Ddwaal-WMF you should now be able to access all those resources.
[08:38:41] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:40:10] <wikibugs>	 (03CR) 10Ayounsi: "Example diff:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[08:45:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) Hi @Bethany can you confirm you've signed the L3 document and you've read it?
[08:47:10] <wikibugs>	 (03CR) 10David Caro: wmcs: don't page for most checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[08:47:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe)
[08:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31034 and previous config saved to /var/cache/conftool/dbconfig/20220713-085244-ladsgroup.json
[08:56:01] <wikibugs>	 (03CR) 10David Caro: wmcs: Add novafullstack alerts (039 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[09:01:23] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::proxy: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813590
[09:07:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "thanks! <3" [puppet] - 10https://gerrit.wikimedia.org/r/813590 (owner: 10Majavah)
[09:07:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31035 and previous config saved to /var/cache/conftool/dbconfig/20220713-090748-ladsgroup.json
[09:08:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM lets give it a go" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi)
[09:10:05] <wikibugs>	 (03PS2) 10Jbond: spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256
[09:10:16] <wikibugs>	 (03PS2) 10Jbond: idp: add spdx headers to idp role and profile [puppet] - 10https://gerrit.wikimedia.org/r/813257
[09:15:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:aptrepo: install python3-apt required by validate_cmd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond)
[09:17:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256 (owner: 10Jbond)
[09:17:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: add spdx headers to idp role and profile [puppet] - 10https://gerrit.wikimedia.org/r/813257 (owner: 10Jbond)
[09:21:11] <wikibugs>	 (03PS23) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635)
[09:21:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond)
[09:46:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[09:46:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi)
[09:46:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi)
[09:49:44] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920)
[09:51:58] <wikibugs>	 (03PS1) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710)
[09:51:59] <wikibugs>	 (03PS1) 10Zabe: acme_chief: Add SPDX headers to acme_chief profile [puppet] - 10https://gerrit.wikimedia.org/r/813595 (https://phabricator.wikimedia.org/T308013)
[09:52:01] <wikibugs>	 (03PS1) 10Zabe: admin: Add SPDX headers to admin profile [puppet] - 10https://gerrit.wikimedia.org/r/813596 (https://phabricator.wikimedia.org/T308013)
[09:52:04] <wikibugs>	 (03PS1) 10Zabe: airflow: Add SPDX headers to airflow profile [puppet] - 10https://gerrit.wikimedia.org/r/813597 (https://phabricator.wikimedia.org/T308013)
[09:52:05] <wikibugs>	 (03PS1) 10Zabe: alertmanager: Add SPDX headers to alertmanager profile [puppet] - 10https://gerrit.wikimedia.org/r/813598 (https://phabricator.wikimedia.org/T308013)
[09:52:07] <wikibugs>	 (03PS1) 10Zabe: apt: Add SPDX headers to apt profile [puppet] - 10https://gerrit.wikimedia.org/r/813599 (https://phabricator.wikimedia.org/T308013)
[09:52:09] <wikibugs>	 (03PS1) 10Zabe: aqs: Add SPDX headers to aqs profile [puppet] - 10https://gerrit.wikimedia.org/r/813600 (https://phabricator.wikimedia.org/T308013)
[09:52:11] <wikibugs>	 (03PS1) 10Zabe: archiva: Add SPDX headers to archiva profile [puppet] - 10https://gerrit.wikimedia.org/r/813601 (https://phabricator.wikimedia.org/T308013)
[09:52:13] <wikibugs>	 (03PS1) 10Zabe: base: Add SPDX headers to base profile [puppet] - 10https://gerrit.wikimedia.org/r/813602 (https://phabricator.wikimedia.org/T308013)
[09:52:15] <wikibugs>	 (03PS1) 10Zabe: bird: Add SPDX headers to bird profile [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013)
[09:52:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi)
[09:52:55] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817)
[09:53:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi)
[09:54:09] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817)
[09:55:49] <wikibugs>	 (03PS3) 10JMeybohm: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333
[09:57:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] New service: function-evaluator (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[10:02:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi)
[10:05:01] <icinga-wm>	 PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:06:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE))
[10:12:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[10:23:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2012.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[10:23:21] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[10:23:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2012.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[10:25:18] <moritzm>	 !log draining ganeti1028 T311686
[10:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:48] <Lucas_WMDE>	 oh no, wikibugs left
[10:32:44] <wikibugs>	 (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813609 (https://phabricator.wikimedia.org/T306016) (owner: 10Michael Große)
[10:38:02] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67]: Regular analytics weekly train (2nd try. --force) [analytics/refinery@bd39e67]
[10:42:54] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67]: Regular analytics weekly train (2nd try. --force) [analytics/refinery@bd39e67] (duration: 04m 52s)
[11:13:37] <wikibugs>	 (03PS1) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104)
[11:31:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apt: Add SPDX headers to apt profile [puppet] - 10https://gerrit.wikimedia.org/r/813599 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:32:51] <wikibugs>	 (03CR) 10Ayounsi: "Example diff before:" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[11:43:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: codfw s8 sanitarium master switch
[11:43:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: codfw s8 sanitarium master switch
[11:49:03] <wikibugs>	 (03PS1) 10Marostegui: mariadb: db2082 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/813619 (https://phabricator.wikimedia.org/T311493)
[11:50:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: db2082 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/813619 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[11:58:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi)
[11:59:57] <wikibugs>	 (03Merged) 10jenkins-bot: interface_automation: refresh the interfaces after deleting cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/813577 (owner: 10Ayounsi)
[12:06:24] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:06:33] <wikibugs>	 (03CR) 10David Caro: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[12:08:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2018.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[12:08:48] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[12:08:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2018.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[12:12:11] <moritzm>	 !log draining ganeti2028 T311686
[12:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Ottomata) Approved once we have the details.  Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? for help...
[12:59:45] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "+1, this looks like a rather straightforward shuffle, with no logic changes. Thanks for moving code out of __init__.py into grid.py. I too" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T1300).
[13:00:04] <jouncebot>	 Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:16] <Lucas_WMDE>	 still in a meeting, I can deploy in 5 minutes
[13:01:22] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro)
[13:03:15] <Lucas_WMDE>	 alright, let’s go
[13:04:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920) (owner: 10Lucas Werkmeister (WMDE))
[13:04:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2049.codfw.wmnet
[13:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: Configure $wgBabelCategoryNames on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813594 (https://phabricator.wikimedia.org/T312920) (owner: 10Lucas Werkmeister (WMDE))
[13:05:15] <Lucas_WMDE>	 testing on mwdebug1001
[13:05:23] <inflatador>	 !log bking@elastic2049 rebooting for read-only fs
[13:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:37] <Lucas_WMDE>	 yup, does what it’s supposed to
[13:06:47] <wikibugs>	 (03CR) 10David Caro: wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro)
[13:08:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813594|Configure $wgBabelCategoryNames on Test Wikidata (T312920)]] (duration: 02m 51s)
[13:08:51] <stashbot>	 T312920: Configure $wgBabelCategoryNames on Test Wikidata - https://phabricator.wikimedia.org/T312920
[13:09:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:10:09] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441)
[13:10:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:10:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:10:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:11:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[13:12:10] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE))
[13:12:58] <wikibugs>	 (03PS24) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635)
[13:13:54] <wikibugs>	 (03Merged) 10jenkins-bot: Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE))
[13:14:15] <Lucas_WMDE>	 this one can’t be tested, I’ll sync it directly
[13:16:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:17:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:17:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:17:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790399|Configure wgLexemeLexicalCategoryItemIds on Wikidata (T307441)]] (duration: 02m 45s)
[13:17:24] <stashbot>	 T307441: configure correct Q-IDs for lexical categories in production for deployment - https://phabricator.wikimedia.org/T307441
[13:17:44] <Lucas_WMDE>	 I think that’s it, unless anyone else has something to deploy?
[13:18:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:20:07] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host elastic2049.codfw.wmnet
[13:20:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36265/console" [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond)
[13:20:35] <Lucas_WMDE>	 !log UTC afternoon backport window done
[13:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond)
[13:25:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:26:46] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[13:26:58] <icinga-wm>	 PROBLEM - SSH on elastic2049 is CRITICAL: connect to address 10.192.32.179 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:27:26] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[13:30:38] <wikibugs>	 (03PS6) 10Jbond: initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395)
[13:33:38] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on elastic2049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King alert should have been suppressed, my apologies https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:33:38] <icinga-wm>	 ACKNOWLEDGEMENT - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% Brian_King alert should have been suppressed, my apologies
[13:35:30] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:37:35] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye
[13:37:44] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye
[13:37:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (https://phabricator.wikimedia.org/T267395) (owner: 10Jbond)
[13:39:50] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:42:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:42:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:43:20] <XioNoX>	 looking
[13:43:27] <godog>	 here too
[13:43:35] <Amir1>	 here
[13:43:38] <bblack>	 here
[13:43:40] <Emperor>	 here if needed
[13:44:04] * jbond here
[13:44:12] <moritzm>	 around
[13:44:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from x1 master', diff saved to https://phabricator.wikimedia.org/P31037 and previous config saved to /var/cache/conftool/dbconfig/20220713-134413-marostegui.json
[13:44:25] <XioNoX>	 https://librenms.wikimedia.org/graphs/to=1657719600/id=17846/type=port_bits/from=1657633200/ brief spike
[13:45:06] <vgutierrez>	 yep.. upload is getting some love
[13:45:20] <vgutierrez>	 https://grafana.wikimedia.org/d/uz11QGcnk/haproxy-tls-cluster-view?orgId=1&var-cluster=upload&var-site=eqsin&viewPanel=6&from=now-3h&to=now
[13:47:06] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye
[13:47:14] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed...
[13:47:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:47:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[14:00:39] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67] (thin): Regular analytics weekly train THIN [analytics/refinery@bd39e67]
[14:00:47] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67] (thin): Regular analytics weekly train THIN [analytics/refinery@bd39e67] (duration: 00m 07s)
[14:01:05] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@bd39e67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bd39e67]
[14:04:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye
[14:04:38] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye
[14:08:47] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@bd39e67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bd39e67] (duration: 07m 42s)
[14:11:23] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye
[14:11:30] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed...
[14:15:44] <wikibugs>	 (03PS1) 10Elukey: ml-services: test event production for editquality in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/813633 (https://phabricator.wikimedia.org/T301878)
[14:18:59] <aqu>	 !log Deployed refinery using scap, then deployed onto hdfs
[14:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:28] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@03c1a05]: Deploy [airflow-dags/analytics_test@03c1a05]
[14:34:41] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@03c1a05]: Deploy [airflow-dags/analytics_test@03c1a05] (duration: 00m 12s)
[14:36:28] <wikibugs>	 (03PS8) 10Ottomata: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[14:37:28] <wikibugs>	 (03PS7) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698)
[14:37:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[14:37:34] <wikibugs>	 (03CR) 10Ori: New service: function-evaluator (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[14:38:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2049.codfw.wmnet with OS bullseye
[14:38:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[14:38:36] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye
[14:41:12] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:41:53] <wikibugs>	 (03PS8) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698)
[14:42:27] <wikibugs>	 (03CR) 10Ori: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[14:51:21] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) This is still an ongoing issue. I tried reimaging to Bullseye, but the installer cannot detect any hard drives.   DC Ops, are you able to take a look?
[14:52:39] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2049.codfw.wmnet with OS bullseye
[14:52:47] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2049.codfw.wmnet with OS bullseye executed...
[14:53:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Cmjohnson) 05Open→03Resolved Disk replaced
[14:54:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) @btullis I am a little behind but can we do this now?
[15:01:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Cmjohnson) @Eevans Are you ready for these moves yet?
[15:03:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) @marostegui row C is very tight and I can only move 2 of the several servers that need to go.  I would like to move these 2 of yours first.  Can we schedule this for tomorrow 14 Ju...
[15:07:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) @Cmjohnson that works for me. I will get those two hosts ready for you
[15:09:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Marostegui) Thanks!
[15:10:32] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9edd1ab]: Deploy [airflow-dags/analytics_test@9edd1ab]
[15:10:40] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9edd1ab]: Deploy [airflow-dags/analytics_test@9edd1ab] (duration: 00m 08s)
[15:11:02] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro)
[15:11:40] <wikibugs>	 (03PS3) 10Ori: [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644)
[15:11:50] <wikibugs>	 (03CR) 10Ori: [C: 03+2] [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) (owner: 10Ori)
[15:12:07] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9edd1ab]: Deploy [airflow-dags/analytics@9edd1ab]
[15:12:18] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9edd1ab]: Deploy [airflow-dags/analytics@9edd1ab] (duration: 00m 10s)
[15:13:02] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta] PoolCounter configuration for Wikilambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) (owner: 10Ori)
[15:18:27] <wikibugs>	 (03PS9) 10Ottomata: [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[15:19:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:20:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Build spark yarn archive for Spark 3 from conda-analytics package [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[15:20:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:20:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:21:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:22:57] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) a:03Papaul
[15:24:55] <wikibugs>	 (03PS25) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635)
[15:25:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Cmjohnson)
[15:32:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Bethany) Seeking access to "All of the above"   - LDAP membership in the wmf or nda LDAP group.   - Shell (posix) membership in the `analytics-privatedat...
[15:49:00] <wikibugs>	 (03PS26) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635)
[15:49:21] <wikibugs>	 (03PS27) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635)
[15:54:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: test event production for editquality in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/813633 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[15:55:47] <wikibugs>	 (03PS10) 10Mark Bergsma: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765)
[15:56:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2040.codfw.wmnet with OS bullseye
[15:56:11] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye
[15:56:59] <wikibugs>	 (03PS1) 10Majavah: prometheus: blackbox: add support for matching status codes [puppet] - 10https://gerrit.wikimedia.org/r/813649
[15:57:01] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: update web_domain default value [puppet] - 10https://gerrit.wikimedia.org/r/813650
[15:57:03] <wikibugs>	 (03PS1) 10Majavah: P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651
[15:58:06] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:58:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:58:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:59:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 (owner: 10Majavah)
[15:59:24] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2040.codfw.wmnet with OS bullseye
[15:59:33] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye executed...
[15:59:47] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:03] <wikibugs>	 (03PS2) 10Majavah: P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651
[16:13:06] <wikibugs>	 (03PS1) 10Majavah: prometheus: blackbox_exporter: remove un-managed module files [puppet] - 10https://gerrit.wikimedia.org/r/813654
[16:17:24] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@e58e61d]: (no justification provided)
[16:17:31] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) the install error message said  `              No root file system               │               │ No root file system is defined.                 │               │...
[16:17:34] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@e58e61d]: (no justification provided) (duration: 00m 10s)
[16:20:56] <wikibugs>	 (03CR) 10David Caro: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah)
[16:21:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge: update web_domain default value [puppet] - 10https://gerrit.wikimedia.org/r/813650 (owner: 10Majavah)
[16:21:58] <wikibugs>	 (03CR) 10Majavah: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah)
[16:22:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge:k8s:haproxy: add k8s-ingress monitoring [puppet] - 10https://gerrit.wikimedia.org/r/813651 (owner: 10Majavah)
[16:24:01] <wikibugs>	 (03CR) 10David Caro: prometheus: blackbox: add support for matching status codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah)
[16:24:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] prometheus: blackbox: add support for matching status codes [puppet] - 10https://gerrit.wikimedia.org/r/813649 (owner: 10Majavah)
[16:24:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users  for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Ottomata) Approved.
[16:26:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Reedy)
[16:28:59] <wikibugs>	 (03PS1) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752)
[16:32:06] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "This probably has the effect of keeping this script from ever returning non-zero, but that's not large step from the prior behavior. A fol" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro)
[16:38:47] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:41:53] <wikibugs>	 (03CR) 10Zabe: CampaignEvents: backport extension for Jul 18 beta deploy (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani)
[16:42:28] <wikibugs>	 (03PS2) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819
[16:43:10] <wikibugs>	 (03CR) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro)
[16:44:05] <wikibugs>	 (03PS3) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819
[16:44:08] <wikibugs>	 (03CR) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro)
[16:48:53] <wikibugs>	 (03CR) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani)
[16:56:30] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "untested, but the code reads much better and seems to have more robust error handling which was the goal :)" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro)
[16:56:38] <Dreamy_Jazz>	 Posted in #wikimedia-tech but probably also relevant here. The code coverage report for patches seems to be taking a extraordinary amount of time to complete.
[16:57:02] <Dreamy_Jazz>	 Some patches are at 42 mins completed with 0 mins left to go according to https://integration.wikimedia.org/zuul/
[16:57:31] <Dreamy_Jazz>	 Answered in the tech channel
[16:58:52] <bd808>	 Dreamy_Jazz: #wikimedia-releng would be where anyone likely to fix it typically hangs out. Looking at https://integration.wikimedia.org/ci/ I think there is some general zuul<->jenkins sadness happening as there are lots of free executors.
[17:03:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/812819 (owner: 10David Caro)
[17:04:07] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:12:23] <wikibugs>	 (03PS2) 10David Caro: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274
[17:12:25] <wikibugs>	 (03CR) 10David Caro: wmcs: Add novafullstack alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro)
[17:25:44] <wikibugs>	 (03CR) 10David Caro: wmcs: use run_* instead of run_sync/run_async (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro)
[17:35:38] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "I asked some "teach me more about how WMF uses puppet" style questions, nothing blocking." [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro)
[17:37:27] <wikibugs>	 (03PS7) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914
[17:37:29] <wikibugs>	 (03PS7) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915
[17:37:31] <wikibugs>	 (03PS3) 10David Caro: ceph: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900
[17:37:33] <wikibugs>	 (03PS3) 10David Caro: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901
[17:37:35] <wikibugs>	 (03PS3) 10David Caro: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902
[17:37:37] <wikibugs>	 (03PS1) 10David Caro: wmcs: move openstack/__init__.py to openstack/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813661
[17:37:39] <wikibugs>	 (03PS1) 10David Caro: wmcs: move wmcs/__init__.py to wmcs/libs/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813662
[17:48:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[17:48:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmne...
[17:48:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye
[17:48:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet wi...
[17:48:45] <wikibugs>	 (03PS3) 10DDesouza: QuickSurveys: Disable 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015)
[17:52:17] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:51] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:56] <wikibugs>	 (03CR) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro)
[17:54:30] <sukhe>	 !log upload dnsdist_1.7.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589
[17:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:34] <stashbot>	 T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589
[17:56:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8075107, @Cmjohnson wrote: > @Eevans Are you ready for these moves yet?   We're no closer to having those new servers up than we were at my last updat...
[17:58:28] <wikibugs>	 (03PS4) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015)
[17:58:37] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro)
[17:59:27] <wikibugs>	 (03PS5) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015)
[18:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T1800)
[18:04:29] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:47] <wikibugs>	 (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[18:15:27] <wikibugs>	 (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[18:16:20] <wikibugs>	 (03CR) 10DDesouza: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[18:20:32] <sukhe>	 !log upload pdns-recursor_4.6.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589
[18:20:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:36] <stashbot>	 T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589
[18:21:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828)
[18:25:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Aklapper) >>! In T312827#8075269, @Bethany wrote: >   - LDAP membership in the wmf or nda LDAP group. That seems to be already the case per https://ldap.t...
[18:32:05] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) Happened again today. There was a mediawiki.recentchange event with a 2015 timestamp.
[18:33:54] <ottomata>	 hashar:  i addded comments to your jsonschemaconverter patch, not sure if you saw them
[18:41:41] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:45:50] <wikibugs>	 (03PS2) 10David Caro: wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267
[18:45:52] <wikibugs>	 (03PS3) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813275
[18:45:54] <wikibugs>	 (03CR) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro)
[18:54:51] <icinga-wm>	 PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960)
[19:08:03] <wikibugs>	 (03PS1) 10Cmjohnson: Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414)
[19:09:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) (owner: 10Cmjohnson)
[19:33:15] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata)
[19:35:01] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: Document codfw breakout patch panels in Netbox - https://phabricator.wikimedia.org/T304710 (10Papaul) 05Open→03Resolved a:03Papaul This is complete https://netbox.wikimedia.org/dcim/devices/3582/rear-ports/ https://netbox.wikimedia.org/dcim/devices/3581/front-ports/
[19:41:35] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata)
[19:44:31] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q1): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata)
[19:54:35] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10bking) @Papaul the DRAC does not detect any hard drives.   I checked under "Storage" in the Web UI, and it says " RAC0501: There are no physical disks to be displayed. 1. Che...
[19:59:23] <danisztls>	 MatmaRex: I think you accidentaly removed my change on the deployment page. :) I restored it.
[19:59:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2040.codfw.wmnet with OS bullseye
[19:59:43] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye
[19:59:44] <danisztls>	 If it's due to other reason It can be deployed later.
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220713T2000).
[20:00:05] <jouncebot>	 zabe, danisztls, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:22] <MatmaRex>	 danisztls: oh oops! sorry. i wonder how i managed to do that
[20:00:33] <danisztls>	 no problem 
[20:00:53] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:53] <zabe>	 hey o/
[20:01:00] <danisztls>	 just asked because maybe it had a reason
[20:08:31] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:09:24] <MatmaRex>	 anyone deploying?
[20:12:31] <RhinosF1>	 I asked Lucas as they're around if they can
[20:13:16] <RhinosF1>	 he's coming
[20:14:18] <Lucas_WMDE>	 o/
[20:14:54] <Lucas_WMDE>	 still lots of annoying UploadFromChunks errors in logspam-watch, I see
[20:14:56] <Lucas_WMDE>	 meh
[20:14:57] <RhinosF1>	 MatmaRex, zabe, danisztls: ^
[20:15:07] <danisztls>	 o/
[20:15:37] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:16:45] <MatmaRex>	 thanks
[20:17:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:18:09] * Lucas_WMDE looks up what this extension used to do
[20:18:27] <Lucas_WMDE>	 wow
[20:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:18:56] <Lucas_WMDE>	 I feel like that wouldn’t be built as a MediaWiki extension today
[20:19:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2040.codfw.wmnet with reason: host reimage
[20:19:42] <Lucas_WMDE>	 zabe: first CongressLookup undeploy is on mwdebug1001
[20:19:48] <Lucas_WMDE>	 I guess it can’t really be tested
[20:19:56] * Lucas_WMDE quickly checks the wiki isn’t totally broken
[20:19:56] <zabe>	 nah, I don't see how
[20:20:16] <zabe>	 I will keep an eye on logstash after the sync
[20:20:52] <Lucas_WMDE>	 whoop, almost synced the wrong file though
[20:20:59] <Lucas_WMDE>	 CommonSettings, not InitialiseSettings ^^
[20:21:28] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:22:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2040.codfw.wmnet with reason: host reimage
[20:23:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:813338|Undeploy CongressLookup (part 1) (T312894)]] (duration: 03m 04s)
[20:23:53] <stashbot>	 T312894: Undeploy CongressLookup - https://phabricator.wikimedia.org/T312894
[20:23:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "`grep -rF wmgUseCongressLookup` confirms nothing else uses this global 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:24:43] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:24:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:25:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:25:07] <Lucas_WMDE>	 second change is on mwdebug1001
[20:25:39] <Lucas_WMDE>	 doesn’t look like anything is broken there either, let’s sync
[20:25:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:25:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:26:00] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:26:34] <Lucas_WMDE>	 API spike seems to have gone down again already, though still above previous levels
[20:26:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:26:58] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:28:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813339|Undeploy CongressLookup (part 2) (T312894)]] (duration: 02m 53s)
[20:29:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:30:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:30:40] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:31:42] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) (owner: 10Zabe)
[20:31:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:32:44] <Lucas_WMDE>	 pulled the third change to mwdebug1001
[20:32:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:32:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:33:18] <zabe>	 nothing seems to break
[20:33:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:33:47] <Lucas_WMDE>	 syncing
[20:34:14] <Lucas_WMDE>	 (I looked at the last undeployed extension just in case the extension-list removal should wait for a train or anything like that, but it looks like CodeReview was also undeployed in one day, so should be fine)
[20:35:56] <Lucas_WMDE>	 let’s start the gate-and-submit for MatmaRex’ backport already
[20:36:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński)
[20:36:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:813340|Undeploy CongressLookup (part 3) (T312894)]] (duration: 03m 00s)
[20:36:49] <stashbot>	 T312894: Undeploy CongressLookup - https://phabricator.wikimedia.org/T312894
[20:38:15] <zabe>	 extension-list is used to build the LocalisationCache and since the extension is no longer used it should be fine to no longer build the messages for it
[20:38:29] <Lucas_WMDE>	 yeah, and I expect that’ll take effect with the next train
[20:38:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:39:47] <Lucas_WMDE>	 danisztls: I’m trying to figure out if QuickSurvey removals usually also disable the extension again
[20:39:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:39:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:40:07] <danisztls>	 Lucas_WMDE: that's not documented AFAIK
[20:40:50] <Lucas_WMDE>	 I’d feel more comfortable with a config change to just remove the surveys but keep the extension, tbh
[20:40:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:41:03] <danisztls>	 ok, will patch it
[20:41:11] <Lucas_WMDE>	 like in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/656116
[20:41:28] <Lucas_WMDE>	 (i.e., for the actual survey entry, remove jawiki completely instead of setting it to [] ?)
[20:43:03] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid localized digits in internal timestamps in JS [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813666 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński)
[20:43:16] <Lucas_WMDE>	 ok, in the meantime, let’s merge that backport, which went quicker than expected
[20:43:18] <danisztls>	 ok
[20:43:23] <Lucas_WMDE>	 (than I expected, at least ^^)
[20:44:16] <Lucas_WMDE>	 MatmaRex: the localised digits backport should be on mwdebug1001, can you test it?
[20:44:31] <wikibugs>	 (03PS6) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015)
[20:44:33] <MatmaRex>	 yeah
[20:44:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2040.codfw.wmnet with OS bullseye
[20:44:51] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2040.codfw.wmnet with OS bullseye completed...
[20:45:05] <MatmaRex>	 Lucas_WMDE: seems good
[20:45:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): QuickSurveys: Undeploy 'research-incentive' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:45:22] <Lucas_WMDE>	 ok thanks
[20:46:07] <Lucas_WMDE>	 syncing
[20:48:26] <Lucas_WMDE>	 danisztls: I left a comment on the commit message, otherwise it looks good
[20:48:29] <danisztls>	 Lucas_WMDE: related discussion https://phabricator.wikimedia.org/T213459, in the past QS extension caused a performance hit even when wiki had no surveys enabled but that looks to be fixed
[20:48:44] <Lucas_WMDE>	 I see
[20:48:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/DiscussionTools/modules/CommentItem.js: Backport: [[gerrit:813666|Avoid localized digits in internal timestamps in JS (T312828)]] (duration: 02m 49s)
[20:48:58] <stashbot>	 T312828: "Could not find the comment you're replying to on the page" (caused by bugs with changed timestamp format) - https://phabricator.wikimedia.org/T312828
[20:49:00] <wikibugs>	 (03PS7) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015)
[20:49:07] <Lucas_WMDE>	 I wouldn’t mind a follow-up change to disable the extension again
[20:49:15] <Lucas_WMDE>	 just don’t feel comfortable deploying that now, with not much time left in the window
[20:49:15] <wikibugs>	 (03CR) 10DDesouza: QuickSurveys: Undeploy 'research-incentive' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:49:30] <danisztls>	 Lucas_WMDE: ok
[20:49:37] <danisztls>	 fixed the commit msg
[20:49:41] <Lucas_WMDE>	 since I didn’t quickly spot any changes like that in the git log
[20:49:50] <Lucas_WMDE>	 thanks!
[20:50:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "The extension could also be undeployed from jawiki later, but we dropped that from this change because I didn’t feel as confident about it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:51:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:52:36] <wikibugs>	 (03PS8) 10Lucas Werkmeister (WMDE): QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:52:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:53:07] <Lucas_WMDE>	 I thought Zuul/Gerrit auto-rebased config changes? but this time it didn’t work
[20:53:17] <Lucas_WMDE>	 meh
[20:53:27] <RhinosF1>	 Sometimes it randomly says no
[20:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys: Undeploy 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:53:39] <Reedy>	 jgit sucks
[20:53:44] <RhinosF1>	 Not all them times are actual conflicts
[20:54:05] <RhinosF1>	 What Reedy says and Reedy can I bribe you into a CR while you're here
[20:55:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:55:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:55:11] <Lucas_WMDE>	 danisztls: okay, config change is on mwdebug1001, can you test it?
[20:55:17] <danisztls>	 Lucas_WMDE: sure
[20:55:24] <Lucas_WMDE>	 jouncebot: next
[20:55:24] <jouncebot>	 In 9 hour(s) and 4 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T0600)
[20:55:39] <Lucas_WMDE>	 alright, then we’ll just run over a bit to do MatmaRex’ last config change as well
[20:55:49] <Lucas_WMDE>	 if nothing else is happening after this window
[20:56:00] <danisztls>	 Lucas_WMDE: lgtm, wiki still opens, survey dont show
[20:56:06] <Lucas_WMDE>	 alright, thanks
[20:58:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:59:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:812377|QuickSurveys: Undeploy 'research-incentive' (T311015)]] (1/2, prod) (duration: 02m 48s)
[20:59:21] <stashbot>	 T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015
[20:59:41] <danisztls>	 Lucas_WMDE: thanks!
[21:00:34] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński)
[21:01:07] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:812377|QuickSurveys: Undeploy 'research-incentive' (T311015)]] (2/2, beta) (duration: 02m 58s)
[21:02:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński)
[21:03:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:04:18] <wikibugs>	 (03Merged) 10jenkins-bot: Disable DiscussionTools beta feature at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813691 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński)
[21:04:31] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:00] <Lucas_WMDE>	 MatmaRex: mediawikiwiki DiscussionTools beta disablement is on mwdebug1001, please test
[21:05:44] <MatmaRex>	 Lucas_WMDE: yep, seems good
[21:05:49] <Lucas_WMDE>	 ok!
[21:06:08] <Lucas_WMDE>	 syncing
[21:06:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:06:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:06:34] <RhinosF1>	 Lucas_WMDE: thanks for showing so late to help
[21:06:44] <Lucas_WMDE>	 np :)
[21:08:30] <wikibugs>	 (03PS1) 10Krinkle: ResourceLoader: Remove DependencyStore::renew [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813670 (https://phabricator.wikimedia.org/T113916)
[21:08:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813691|Disable DiscussionTools beta feature at mediawikiwiki (T310960)]] (duration: 02m 47s)
[21:08:58] <stashbot>	 T310960: [Config Change] Make all DiscussionTools available by default at mediawiki.org - https://phabricator.wikimedia.org/T310960
[21:09:05] <Lucas_WMDE>	 !log UTC late backport+config window done
[21:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:14] <MatmaRex>	 thanks Lucas_WMDE
[21:10:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:15:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:17:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:17:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:18:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:44:13] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) @bking the IDRAC will only detect hard drivers under storage in the Web UI if the system has a HW raid controller.
[21:44:23] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:31] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:56:54] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) I can check on the physical disks when i am on site tomorrow.
[22:09:03] <inflatador>	 !log bking@elastic2055 staging NIC firmware updates for elastic2055-2060
[22:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:18] <inflatador>	 !log bking@elastic2055 successfully staged NIC firmware updates for elastic2055-2060
[22:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:50] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) Discussed some of the sre.reimage cookbook failure scenarios with  @RKemper  and @EBernhardson  today:     - After a reimage failure, th...
[22:32:09] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:45] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:40] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Papaul) >>! In T289135#8076562, @bking wrote: > Discussed some of the sre.reimage cookbook failure scenarios with  @RKemper  and @EBernhardson...
[22:46:53] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:01:36] <wikibugs>	 (03PS1) 10Cwhite: profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826)
[23:11:27] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:21:14] <wikibugs>	 10ops-eqiad, 10DC-Ops: Please verify location of an-worker1111.eqiad.wmnet - https://phabricator.wikimedia.org/T298785 (10wiki_willy) Hi @BTullis - I'm just coming across this request now.  It was missing the "ops-eqiad" project tag, so looks like it fell through the cracks.  I'll add the appropriate tag and e...
[23:22:12] <wikibugs>	 10ops-eqiad, 10DC-Ops: Please verify location of an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T298621 (10wiki_willy) Adding "ops-eqiad' project tag
[23:27:54] <wikibugs>	 10ops-eqiad, 10DC-Ops: Failed disk on analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T293111 (10wiki_willy) Looks like this was missing the "ops-eqiad" project tag, so it fell through the cracks.  @BTullis - since the hardware was installed to refresh this host in T293922, do you still need this...
[23:30:17] <wikibugs>	 10ops-eqiad, 10DC-Ops: Relabel db1183 to be dbstore1007 - https://phabricator.wikimedia.org/T284126 (10wiki_willy) Looks like this one fell through the cracks, so adding the "ops-eqiad" project tag.  @Cmjohnson or @Jclark-ctr - can one of you guys see if this one already has the correct label on it?  Thanks, W...
[23:34:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10wiki_willy) a:03Papaul Looks like this one fell through the cracks without the "ops-codfw" project tag, so adding it back in.  cc @Papaul
[23:51:58] <wikibugs>	 (03PS1) 10Cwhite: hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826)
[23:52:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10RobH)