[00:05:08] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:50] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json
[00:17:20] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[00:17:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) p:05Triage→03High
[00:21:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) p:05Triage→03Medium
[00:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
[00:33:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[00:36:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:38:18] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:39:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:40:28] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:42:40] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:46] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
[00:53:00] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10tstarling) I grepped for rsvg in exec.log and found nothing, going back to May, so it looks like T260504 is sufficiently complete that we don't have to upgrade librsvg on the...
[00:54:18] <jinxer-wm>	 (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:56:34] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10ori) We got alerts about the Beta Cluster cert being close to expiry...
[01:02:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
[01:02:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[01:02:39] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[01:02:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[01:02:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:03:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:03:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
[01:35:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10cmooney) a:03cmooney Thanks @ayounsi   > One surprising point though is that the path through the...
[01:35:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney)
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:50] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:03:00] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:54] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:08:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[02:13:46] <jinxer-wm>	 (Emergency syslog message) firing: (2) Alert for device lsw1-f2-eqiad.mgmt.eqiad.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[02:17:18] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:46] <jinxer-wm>	 (Emergency syslog message) resolved: (2) Device lsw1-f2-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) I ended up issuing this command: ` request app-engine service restart packet-forwarding-engin...
[02:43:08] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:53:28] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:48] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:26] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:06:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:22] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:52] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:26:10] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:30:52] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:14] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:58:56] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:18] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:42] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:06] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:22] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:46] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:24:44] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:36] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:46] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:38] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:58] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:54] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:44] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:14:14] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Joe) 05Open→03Stalled a:05Joe→03None Hi, any news on this front?  I'll release this bug as its completion doesn't dep...
[05:16:02] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:16:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:23:04] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:10] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:16] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
[05:59:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10RKemper) (Following is just related to bringing these hosts back into service)  Pooled the hosts:  ` r...
[06:01:19] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
[06:02:30] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:28:18] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) >>! In T314840#8144986, @fgiunchedi wrote: > Thank you for vopsbot, looks really good and useful! >  > A perhaps silly/minor thing: I think we should be using `-` ins...
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220812T0700)
[07:01:54] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:03:44] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:14:22] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:00] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:44] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:27:42] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:02] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:04] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:59:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10dcaro) @Cmjohnson Hi!  While trying to setup the first of the hosts here, we noticed that it had only 7 1.8T non-os hard drives, but in t...
[08:15:06] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:06] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:46] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:06] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "please check with" [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[08:52:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTm assuming odimitrijevic re-approves on task" [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[08:59:30] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:50] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:09] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[09:14:08] <wikibugs>	 (03PS7) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749
[09:20:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes::mediawiki::releases: allow scap users to write releases files [puppet] - 10https://gerrit.wikimedia.org/r/822610
[09:22:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM. I'll deploy this on Tuesday (bank holiday on Monday)" [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:46:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36717/console" [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto)
[09:49:56] <wikibugs>	 10SRE, 10Traffic, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez)
[09:59:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "I think the patch does what we want it to, but I'd wait for you to be around so we can run some tests." [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto)
[10:04:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10soworu) SSH key: SHA256:9+15cZ0xKHi7PqAzF0LR1NXfsD5ex8PbiojwKfqoLSk soworu@wmf2559  @Vgutierrez Just analytics if fine. I need it to view the extent of use of the plugin.  @O...
[10:07:50] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:09:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) > Netbox drives the infrastructure, and not the other way around.  Fully agree that's best.  But unfortunate...
[10:12:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) Having thought about it in more detail I think it's best to keep the multihop for the iBGP EVPN sessions.  Reason being that even if a Leaf loses a Spine lin...
[10:13:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) @ayounsi be interested if you've any thoughts on that.
[10:19:20] <wikibugs>	 (03PS3) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:20:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:22:23] <wikibugs>	 (03PS4) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:23:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:24:04] <wikibugs>	 (03PS5) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:28:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[10:39:12] <wikibugs>	 (03CR) 10Jcrespo: "Hey, some comments here- most are actually my fault for the initial commit (copy & paste). Let me know what you think of the others. Some," [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[10:41:44] <wikibugs>	 (03CR) 10Jcrespo: "addendum for the latest update." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[11:08:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[11:15:54] <wikibugs>	 (03PS8) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749
[11:20:22] <wikibugs>	 (03PS9) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749
[11:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:46:29] <wikibugs>	 (03CR) 10Jaime Nuche: "Incorporated the changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/807510" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche)
[11:47:50] <wikibugs>	 (03CR) 10Jaime Nuche: "I ended up adding these changes here https://gerrit.wikimedia.org/r/c/operations/puppet/+/820749" [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond)
[12:07:58] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:48] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[12:18:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10ayounsi) yeah I agree +1 on having a stable iBGP capable of handling link failure.  The OSPF adjacency check should be used but IIRC it assumes there are as many v4 s...
[12:21:00] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 1622.02 ms
[12:22:32] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > Would a cookbook be an idea possibly? That we could run ourselves to update a specific network port to mat...
[12:31:56] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:46] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:38:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[12:53:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:08:16] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:08:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:28:40] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063)
[13:37:13] <wikibugs>	 (03CR) 10ArielGlenn: "Given the comment in https://phabricator.wikimedia.org/T305064#7818583 I am reluctant to just add the table wholesale like this. I think s" [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup)
[13:40:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) If you just need to view a superset dashboard, you do not need ssh access. LDAP + group membership in analytics-privatedata-users is sufficient.
[13:41:54] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[13:41:59] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[13:47:16] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 11.05 ms
[13:47:42] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
[13:47:48] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1063.eqiad.wmnet with OS bullseye
[13:49:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) 05Open→03Resolved
[13:49:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew)
[13:53:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:54:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:54:57] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:00:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) I just noticed that this server is still marked as 'failed' in netbox; shall I switch it back to 'active'?
[14:02:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
[14:05:45] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
[14:07:48] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854)
[14:09:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854)
[14:09:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[14:12:02] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:13:18] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 0.82 ms
[14:15:58] <wikibugs>	 (03PS2) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854)
[14:16:00] <wikibugs>	 (03PS2) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854)
[14:17:42] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column tl_namespace in field list on query. Default database: itwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:18:32] <wikibugs>	 (03PS3) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854)
[14:18:34] <wikibugs>	 (03PS3) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854)
[14:21:02] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:21:52] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86720.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:22:47] <wikibugs>	 (03PS4) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854)
[14:22:49] <wikibugs>	 (03PS4) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854)
[14:23:22] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:24:13] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
[14:24:20] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1063.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[14:24:34] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:24:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[14:27:04] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[14:28:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
[14:28:57] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
[14:29:04] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1061.eqiad.wmnet with OS bullseye
[14:29:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
[14:40:57] <wikibugs>	 (03Abandoned) 10Jbond: P:mediawiki::scap_client: add parameter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond)
[14:41:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh)
[14:43:35] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
[14:46:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
[14:46:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
[14:46:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
[14:46:29] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
[14:49:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10BTullis) >>! In T304289#8031526, @Volans wrote: > Also freeipmi is installed fleetwide  Thanks @Volans - I've confirmed that this worked on an unresponsive `druid1006.mgmt`. ` sudo bmc-dev...
[14:49:57] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:51:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[14:53:41] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854)
[14:54:31] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854)
[14:56:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[14:59:13] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "I see that group approver is still needed on the ticket but the code/commit message looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn)
[15:04:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:04:15] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
[15:04:21] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1061.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[15:07:12] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
[15:07:14] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
[15:09:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:09:23] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall)
[15:09:27] <wikibugs>	 (03PS4) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472)
[15:11:23] <wikibugs>	 (03PS1) 10Andrew Bogott: acme_chief: permit access to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/822646 (https://phabricator.wikimedia.org/T306854)
[15:12:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: permit access to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/822646 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[15:13:07] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[15:13:22] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[15:13:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[15:18:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic I am happy to hop on a call with ITS to explore solutions, let me know how you want to proceed when you return.
[15:19:22] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648
[15:23:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 (owner: 10Jbond)
[15:24:29] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648
[15:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:28:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 (owner: 10Jbond)
[15:31:33] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
[15:31:41] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
[15:31:59] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "Whoops! despite the non-chronological naming, cloudcontrol2003-dev is actually the server in need of replacement. So, I'll submit a new pa" [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[15:36:37] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10jhathaway) As background we don't use the upstream templates because th...
[15:37:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
[15:38:16] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudcontrol2003-dev with cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/822651 (https://phabricator.wikimedia.org/T315089)
[15:39:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudcontrol2003-dev with cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/822651 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott)
[15:42:07] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on dbstore1007 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:42:40] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove unused config for Echo notification emails (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) (owner: 10Bartosz Dziewoński)
[15:43:55] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
[15:44:02] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1067.eqiad.wmnet with OS bullseye
[15:44:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Renumber some arbitrary cloudcontrol200x-dev settings [puppet] - 10https://gerrit.wikimedia.org/r/822652
[15:46:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Renumber some arbitrary cloudcontrol200x-dev settings [puppet] - 10https://gerrit.wikimedia.org/r/822652 (owner: 10Andrew Bogott)
[15:47:08] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: Rearrange rabbitmq cnames for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/822653 (https://phabricator.wikimedia.org/T315089)
[15:48:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: Rearrange rabbitmq cnames for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/822653 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott)
[15:53:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Move cloudcontrol2003-dev to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/822654 (https://phabricator.wikimedia.org/T315089)
[15:53:51] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[15:55:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcontrol2003-dev to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/822654 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott)
[15:56:28] <wikibugs>	 10SRE, 10Citoid, 10Editing-team, 10Patch-For-Review: Migrate citoid and zotero production services to node12 - https://phabricator.wikimedia.org/T290753 (10Mvolz)
[15:57:08] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Citoid, 10Editing-team: Upgrade deployment-docker-citoid01 host to Buster - https://phabricator.wikimedia.org/T306049 (10Mvolz)
[15:58:18] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
[15:59:11] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:03:07] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
[16:04:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BCornwall)
[16:05:26] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 (10BCornwall) 05In progress→03Resolved Changes have been deployed for all three continents!
[16:07:29] <wikibugs>	 (03PS1) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904)
[16:08:20] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
[16:11:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
[16:12:26] <wikibugs>	 (03CR) 10Samtar: "Unsure of the CI failure, but it appears to be non-voting 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[16:16:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[16:17:16] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling)
[16:17:41] <wikibugs>	 (03PS4) 10Krinkle: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling)
[16:17:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Andrew) cloudcontrol2005-dev and clouddb2002-dev are now in service.  I don't feel confident setting up...
[16:21:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:21:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
[16:21:43] <wikibugs>	 (03PS1) 10MVernon: swift: move swift ring manager repo [puppet] - 10https://gerrit.wikimedia.org/r/822659
[16:23:54] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Andrew) a:05Andrew→03Papaul
[16:24:00] <wikibugs>	 (03PS1) 10Papaul: Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867)
[16:25:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove references to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/822661 (https://phabricator.wikimedia.org/T315089)
[16:26:30] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
[16:26:36] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1067.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[16:26:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:27:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/822661 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott)
[16:30:43] <wikibugs>	 (03CR) 10Krinkle: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[16:31:55] <wikibugs>	 (03PS2) 10Papaul: Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867)
[16:32:29] <wikibugs>	 (03CR) 10Samtar: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[16:34:07] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867) (owner: 10Papaul)
[16:38:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul)
[16:40:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Use service names for codfw1dev rabbitmq servers [puppet] - 10https://gerrit.wikimedia.org/r/822662
[16:42:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
[16:42:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Use service names for codfw1dev rabbitmq servers [puppet] - 10https://gerrit.wikimedia.org/r/822662 (owner: 10Andrew Bogott)
[16:42:34] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host netmon2002.wikimedia.org with OS bullseye
[16:46:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:49:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:57:38] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: If the system is new reboot with redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/822665
[16:57:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:58:39] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 361 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:01:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[17:01:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: If the system is new reboot with redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/822665 (owner: 10Jbond)
[17:02:53] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
[17:05:37] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:06:26] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche)
[17:11:31] <wikibugs>	 (03PS1) 10Krinkle: Remove reference to unreachable eventlogging-procesor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230)
[17:11:53] <wikibugs>	 (03PS4) 10Krinkle: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto)
[17:13:28] <wikibugs>	 (03PS2) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230)
[17:13:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] kubernetes::mediawiki::releases: allow scap users to write releases files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto)
[17:16:23] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:19:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
[17:19:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host netmon2002.wikimedia.org with OS bullseye completed: - netmon2002 (*...
[17:21:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
[17:21:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
[17:24:32] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
[17:24:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul)
[17:24:42] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1064.eqiad.wmnet with OS bullseye
[17:25:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete
[17:26:19] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 55 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:26:39] <wikibugs>	 (03PS3) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230)
[17:39:04] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
[17:39:50] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:42:39] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
[17:55:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[18:00:43] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
[18:00:48] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1064.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[18:00:51] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:13] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
[18:08:19] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1066.eqiad.wmnet with OS bullseye
[18:21:07] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:22:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10thcipriani) >>! In T315048#8147180, @Dzahn wrote: > @thcipriani You are group approver for this shell group.  Approved!
[18:22:35] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
[18:25:32] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
[18:27:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn)
[18:27:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approval was added on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn)
[18:29:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn) 05Open→03Resolved a:03Dzahn @demon You have shell and root again on gerrit servers. now they are`gerrit1001.wikimedia.org` and `gerrit2002.w...
[18:40:25] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:42:57] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10BCornwall) a:03Vgutierrez @Vgutierrez since this was merged, can this ticket be closed?
[18:48:36] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
[18:48:43] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1066.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[18:49:33] <icinga-wm>	 PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:52:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
[18:52:50] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[18:54:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
[18:54:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1054.eqiad.wmnet with OS bullseye
[18:54:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) @Ottomata, @soworu: In this case, shall I alter the access request to membership to analytics-privatedata-users? And if so, @Ottomata, do you approve?
[18:58:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10soworu) @BCornwall, If that is the case, please do the as needed, subject to @Ottomata approval. Thanks.
[18:58:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
[18:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
[18:59:42] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "looks good for when it's moved." [puppet] - 10https://gerrit.wikimedia.org/r/822659 (owner: 10MVernon)
[19:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
[19:09:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
[19:12:53] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
[19:16:19] <icinga-wm>	 RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:15] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:21:19] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:22:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
[19:23:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[19:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:30:41] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:33:11] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
[19:33:17] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1054.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[19:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
[19:38:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[19:38:05] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[19:38:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[19:38:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
[19:40:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:42:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
[19:42:14] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1048.eqiad.wmnet with OS bullseye
[19:42:32] <wikibugs>	 (03PS1) 10BCornwall: admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213)
[19:43:29] <icinga-wm>	 RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:49:05] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "yes, this all looks good to me. user needs to be upgraded from ldap_only to shell section but does not need actual shell.. so no SSH key. " [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall)
[19:53:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
[19:55:55] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
[20:12:03] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
[20:12:09] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1048.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[20:23:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall)
[20:23:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[20:24:48] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
[20:24:54] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1053.eqiad.wmnet with OS bullseye
[20:33:24] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: install wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822683 (https://phabricator.wikimedia.org/T310795)
[20:36:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: install wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822683 (https://phabricator.wikimedia.org/T310795) (owner: 10Andrew Bogott)
[20:39:38] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
[20:39:40] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: update ref to wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822685 (https://phabricator.wikimedia.org/T310795)
[20:42:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: update ref to wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822685 (https://phabricator.wikimedia.org/T310795) (owner: 10Andrew Bogott)
[20:42:59] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
[20:50:09] <wikibugs>	 (03CR) 10Dzahn: "Unable to execute query for alias dse-k8s: Unexpected boolean operator 'or' with hosts ''" [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis)
[20:50:57] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
[20:52:19] <wikibugs>	 (03CR) 10Dzahn: Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis)
[21:04:42] <wikibugs>	 (03PS1) 10Sergio Gimeno: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018)
[21:06:03] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
[21:06:10] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1053.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[21:06:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
[21:10:07] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
[21:12:41] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
[21:12:50] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1071.eqiad.wmnet with OS bullseye
[21:13:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[21:21:37] <icinga-wm>	 PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:58] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) Thanks, all! I've created {T315119} and have already starte...
[21:25:03] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
[21:27:47] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
[21:28:00] <wikibugs>	 (03PS1) 10Brennen Bearnes: scap: add permission mangling, reorder checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953)
[21:45:12] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
[21:47:33] <icinga-wm>	 RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:48:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
[21:49:01] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1071.eqiad.wmnet with OS bullseye completed: - elastic1063 (...
[21:52:21] <icinga-wm>	 PROBLEM - puppet last run on wcqs2003 is CRITICAL: CRITICAL: Puppet has been disabled for 604901 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:52:39] <icinga-wm>	 PROBLEM - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604919 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:53:31] <icinga-wm>	 PROBLEM - puppet last run on wcqs2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604971 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:55:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:55:09] <icinga-wm>	 PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605069 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:55:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[21:55:35] <icinga-wm>	 PROBLEM - puppet last run on wcqs1003 is CRITICAL: CRITICAL: Puppet has been disabled for 605095 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:55:35] <icinga-wm>	 PROBLEM - puppet last run on wcqs1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605095 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:55:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05Stalled→03In progress
[21:58:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) a:05soworu→03Ottomata
[22:00:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) a:05MRaishWMF→03odimitrijevic
[22:01:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:11] <wikibugs>	 (03CR) 10BCornwall: admin: Add SSH key to mraish user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[22:14:00] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[22:14:04] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[22:15:28] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[22:17:48] <wikibugs>	 (03CR) 10Dzahn: Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[22:19:10] <icinga-wm>	 RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:35] <wikibugs>	 (03CR) 10Dzahn: "looks mostly good. a nitpick inline about the $realm check though. also https://puppet-compiler.wmflabs.org/pcc-worker1002/36727/ and do y" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[22:24:06] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:27:24] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms
[22:31:48] <wikibugs>	 (03CR) 10Dzahn: "compiled this in the puppet compiler and host list: 'C:spamassassin'. so it's used by lists,mx,otrs and tools-mail. the result surprised m" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[22:33:17] <wikibugs>	 (03CR) 10Dzahn: "something is odd with the compiler. there is not even output at https://puppet-compiler.wmflabs.org/pcc-worker1001/36728/otrs1001.eqiad.wm" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[22:39:29] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "systemd::timer::job does not have a parameter called 'owner'," [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[22:50:23] <wikibugs>	 (03CR) 10Dzahn: define osm::planet_sync move from cron to systemd timers. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[23:00:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "seems good. some nitpicks/comments inline, compiles like this: https://puppet-compiler.wmflabs.org/pcc-worker1001/36730/maps1009.eqiad.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[23:07:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:28] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[23:14:56] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.81 ms
[23:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:28:26] <jem>	 Krinkle and mutante: I've just created T315121 about the statistics problem in new wikis, as you asked about two weeks ago, with both of you as suscribers... thanks in advance
[23:28:26] <stashbot>	 T315121: After new wikis are created/imported from Incubator, statistics should be updated - https://phabricator.wikimedia.org/T315121
[23:38:41] <mutante>	 !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer T315121
[23:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:45] <stashbot>	 T315121: After new wikis are created/imported from Incubator, statistics should be updated - https://phabricator.wikimedia.org/T315121
[23:41:22] <mutante>	 !log wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121
[23:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:35] <mutante>	 jem: please reload for comments and ..stats should be fixed?^ :)
[23:41:52] <mutante>	 one action in prod and one in cloud
[23:42:42] <mutante>	 blk:  total="3377",good="804",edits="15316",users="203",activeusers="29",admins="0",images="0"
[23:42:58] <mutante>	 kcg: total="1994",good="452",edits="15911",users="420",activeusers="16",admins="1",images="0"
[23:43:10] <mutante>	 blk needs an admin I suppose