[00:05:08] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:50] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json [00:17:20] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:17:30] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) p:05Triage→03High [00:21:36] 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) p:05Triage→03Medium [00:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json [00:33:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [00:36:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:18] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:39:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:40:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:42:40] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:46] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json [00:53:00] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10tstarling) I grepped for rsvg in exec.log and found nothing, going back to May, so it looks like T260504 is sufficiently complete that we don't have to upgrade librsvg on the... [00:54:18] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:56:34] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10ori) We got alerts about the Beta Cluster cert being close to expiry... [01:02:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json [01:02:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [01:02:39] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [01:02:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [01:02:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:03:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json [01:35:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10cmooney) a:03cmooney Thanks @ayounsi > One surprising point though is that the path through the... [01:35:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:50] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:00] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:54] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [02:13:46] (Emergency syslog message) firing: (2) Alert for device lsw1-f2-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [02:17:18] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:46] (Emergency syslog message) resolved: (2) Device lsw1-f2-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) I ended up issuing this command: ` request app-engine service restart packet-forwarding-engin... [02:43:08] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:53:28] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:48] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:26] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:44] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:10] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:22] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:44] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:52] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:10] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:30:52] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:42] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:06] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:22] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:46] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:44] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:36] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:46] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:38] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:58] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:54] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:44] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:14] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Joe) 05Open→03Stalled a:05Joe→03None Hi, any news on this front? I'll release this bug as its completion doesn't dep... [05:16:02] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:04] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:10] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:16] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=elastic110.* [05:59:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10RKemper) (Following is just related to bringing these hosts back into service) Pooled the hosts: ` r... [06:01:19] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].* [06:02:30] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:18] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) >>! In T314840#8144986, @fgiunchedi wrote: > Thank you for vopsbot, looks really good and useful! > > A perhaps silly/minor thing: I think we should be using `-` ins... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220812T0700) [07:01:54] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:44] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:22] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:00] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:44] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:44] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:27:42] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:02] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:44] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10dcaro) @Cmjohnson Hi! While trying to setup the first of the hosts here, we noticed that it had only 7 1.8T non-os hard drives, but in t... [08:15:06] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:06] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:46] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:06] (03CR) 10Jbond: [C: 04-1] "please check with" [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [08:52:22] (03CR) 10Jbond: [C: 03+1] "LGTm assuming odimitrijevic re-approves on task" [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [08:59:30] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:50] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:09] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [09:14:08] (03PS7) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [09:20:17] (03PS1) 10Giuseppe Lavagetto: kubernetes::mediawiki::releases: allow scap users to write releases files [puppet] - 10https://gerrit.wikimedia.org/r/822610 [09:22:13] (03CR) 10Vgutierrez: [C: 03+1] "LGTM. I'll deploy this on Tuesday (bank holiday on Monday)" [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:46:08] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36717/console" [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [09:49:56] 10SRE, 10Traffic, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) [09:59:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "I think the patch does what we want it to, but I'd wait for you to be around so we can run some tests." [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [10:04:57] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10soworu) SSH key: SHA256:9+15cZ0xKHi7PqAzF0LR1NXfsD5ex8PbiojwKfqoLSk soworu@wmf2559 @Vgutierrez Just analytics if fine. I need it to view the extent of use of the plugin. @O... [10:07:50] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:09:42] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) > Netbox drives the infrastructure, and not the other way around. Fully agree that's best. But unfortunate... [10:12:27] 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) Having thought about it in more detail I think it's best to keep the multihop for the iBGP EVPN sessions. Reason being that even if a Leaf loses a Spine lin... [10:13:25] 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) @ayounsi be interested if you've any thoughts on that. [10:19:20] (03PS3) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:20:04] (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:22:23] (03PS4) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:23:14] (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:24:04] (03PS5) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:28:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [10:39:12] (03CR) 10Jcrespo: "Hey, some comments here- most are actually my fault for the initial commit (copy & paste). Let me know what you think of the others. Some," [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [10:41:44] (03CR) 10Jcrespo: "addendum for the latest update." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [11:08:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [11:15:54] (03PS8) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [11:20:22] (03PS9) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [11:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:46:29] (03CR) 10Jaime Nuche: "Incorporated the changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/807510" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [11:47:50] (03CR) 10Jaime Nuche: "I ended up adding these changes here https://gerrit.wikimedia.org/r/c/operations/puppet/+/820749" [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond) [12:07:58] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:13:48] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:18:01] 10SRE, 10Infrastructure-Foundations, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10ayounsi) yeah I agree +1 on having a stable iBGP capable of handling link failure. The OSPF adjacency check should be used but IIRC it assumes there are as many v4 s... [12:21:00] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 1622.02 ms [12:22:32] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:23] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > Would a cookbook be an idea possibly? That we could run ourselves to update a specific network port to mat... [12:31:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:46] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:53:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:08:16] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:28:40] (03PS1) 10Ladsgroup: snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) [13:37:13] (03CR) 10ArielGlenn: "Given the comment in https://phabricator.wikimedia.org/T305064#7818583 I am reluctant to just add the table wholesale like this. I think s" [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup) [13:40:25] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) If you just need to view a superset dashboard, you do not need ssh access. LDAP + group membership in analytics-privatedata-users is sufficient. [13:41:54] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [13:41:59] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [13:47:16] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 11.05 ms [13:47:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye [13:47:48] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1063.eqiad.wmnet with OS bullseye [13:49:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) 05Open→03Resolved [13:49:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) [13:53:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:54:45] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:57] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:00:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) I just noticed that this server is still marked as 'failed' in netbox; shall I switch it back to 'active'? [14:02:10] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage [14:05:45] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage [14:07:48] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:09:08] (03PS1) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) [14:09:10] (03PS1) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) [14:09:58] (03CR) 10CI reject: [V: 04-1] Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [14:12:02] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:13:18] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 0.82 ms [14:15:58] (03PS2) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) [14:16:00] (03PS2) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) [14:17:42] PROBLEM - MariaDB Replica SQL: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column tl_namespace in field list on query. Default database: itwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:18:32] (03PS3) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) [14:18:34] (03PS3) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) [14:21:02] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:21:52] PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86720.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:22:47] (03PS4) 10Andrew Bogott: Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) [14:22:49] (03PS4) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) [14:23:22] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:24:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye [14:24:20] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1063.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [14:24:34] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:24:36] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol2005-dev a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/822637 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [14:27:04] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:28:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint [14:28:57] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye [14:29:04] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1061.eqiad.wmnet with OS bullseye [14:29:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint [14:40:57] (03Abandoned) 10Jbond: P:mediawiki::scap_client: add parameter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond) [14:41:14] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) [14:43:35] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage [14:46:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls [14:46:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be [14:46:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe [14:46:29] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage [14:49:23] 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10BTullis) >>! In T304289#8031526, @Volans wrote: > Also freeipmi is installed fleetwide Thanks @Volans - I've confirmed that this worked on an unresponsive `druid1006.mgmt`. ` sudo bmc-dev... [14:49:57] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:51:19] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudcontrol2001-dev with cloudcontrol2005-dev. [puppet] - 10https://gerrit.wikimedia.org/r/822638 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [14:53:41] (03PS1) 10Andrew Bogott: wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) [14:54:31] (03PS2) 10Andrew Bogott: wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) [14:56:07] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [14:59:13] (03CR) 10BCornwall: [C: 03+1] "I see that group approver is still needed on the ticket but the code/commit message looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn) [15:04:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye [15:04:21] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1061.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [15:07:12] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org [15:07:14] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org [15:09:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:23] (03CR) 10BCornwall: [C: 03+2] geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [15:09:27] (03PS4) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472) [15:11:23] (03PS1) 10Andrew Bogott: acme_chief: permit access to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/822646 (https://phabricator.wikimedia.org/T306854) [15:12:23] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: permit access to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/822646 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [15:13:07] (03CR) 10Herron: [C: 03+1] logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:13:22] (03CR) 10Herron: [C: 03+1] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [15:13:39] (03CR) 10Herron: [C: 03+1] logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:18:27] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic I am happy to hop on a call with ITS to explore solutions, let me know how you want to proceed when you return. [15:19:22] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 [15:23:38] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 (owner: 10Jbond) [15:24:29] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 [15:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:28:37] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Add new flag [cookbooks] - 10https://gerrit.wikimedia.org/r/822648 (owner: 10Jbond) [15:31:33] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org'] [15:31:41] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org'] [15:31:59] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:56] (03CR) 10Andrew Bogott: [C: 03+2] "Whoops! despite the non-chronological naming, cloudcontrol2003-dev is actually the server in need of replacement. So, I'll submit a new pa" [dns] - 10https://gerrit.wikimedia.org/r/822643 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [15:36:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10jhathaway) As background we don't use the upstream templates because th... [15:37:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org'] [15:38:16] (03PS1) 10Andrew Bogott: Replace cloudcontrol2003-dev with cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/822651 (https://phabricator.wikimedia.org/T315089) [15:39:34] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudcontrol2003-dev with cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/822651 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott) [15:42:07] RECOVERY - MariaDB Replica SQL: s2 on dbstore1007 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:42:40] (03CR) 10Krinkle: [C: 03+1] Remove unused config for Echo notification emails (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) (owner: 10Bartosz Dziewoński) [15:43:55] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye [15:44:02] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1067.eqiad.wmnet with OS bullseye [15:44:19] (03PS1) 10Andrew Bogott: Renumber some arbitrary cloudcontrol200x-dev settings [puppet] - 10https://gerrit.wikimedia.org/r/822652 [15:46:34] (03CR) 10Andrew Bogott: [C: 03+2] Renumber some arbitrary cloudcontrol200x-dev settings [puppet] - 10https://gerrit.wikimedia.org/r/822652 (owner: 10Andrew Bogott) [15:47:08] (03PS1) 10Andrew Bogott: wikimediacloud.org: Rearrange rabbitmq cnames for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/822653 (https://phabricator.wikimedia.org/T315089) [15:48:20] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: Rearrange rabbitmq cnames for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/822653 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott) [15:53:27] (03PS1) 10Andrew Bogott: Move cloudcontrol2003-dev to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/822654 (https://phabricator.wikimedia.org/T315089) [15:53:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [15:55:19] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcontrol2003-dev to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/822654 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott) [15:56:28] 10SRE, 10Citoid, 10Editing-team, 10Patch-For-Review: Migrate citoid and zotero production services to node12 - https://phabricator.wikimedia.org/T290753 (10Mvolz) [15:57:08] 10SRE, 10Beta-Cluster-Infrastructure, 10Citoid, 10Editing-team: Upgrade deployment-docker-citoid01 host to Buster - https://phabricator.wikimedia.org/T306049 (10Mvolz) [15:58:18] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage [15:59:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:03:07] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage [16:04:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BCornwall) [16:05:26] 10SRE, 10Traffic, 10Patch-For-Review: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 (10BCornwall) 05In progress→03Resolved Changes have been deployed for all three continents! [16:07:29] (03PS1) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) [16:08:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org'] [16:11:56] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org [16:12:26] (03CR) 10Samtar: "Unsure of the CI failure, but it appears to be non-voting 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [16:16:42] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [16:17:16] (03CR) 10Krinkle: [C: 03+1] Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling) [16:17:41] (03PS4) 10Krinkle: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling) [16:17:43] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Andrew) cloudcontrol2005-dev and clouddb2002-dev are now in service. I don't feel confident setting up... [16:21:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org [16:21:43] (03PS1) 10MVernon: swift: move swift ring manager repo [puppet] - 10https://gerrit.wikimedia.org/r/822659 [16:23:54] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Andrew) a:05Andrew→03Papaul [16:24:00] (03PS1) 10Papaul: Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867) [16:25:59] (03PS1) 10Andrew Bogott: Remove references to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/822661 (https://phabricator.wikimedia.org/T315089) [16:26:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye [16:26:36] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1067.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [16:26:44] RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:27:08] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/822661 (https://phabricator.wikimedia.org/T315089) (owner: 10Andrew Bogott) [16:30:43] (03CR) 10Krinkle: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [16:31:55] (03PS2) 10Papaul: Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867) [16:32:29] (03CR) 10Samtar: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [16:34:07] (03CR) 10Papaul: [C: 03+2] Add netmon2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/822660 (https://phabricator.wikimedia.org/T313867) (owner: 10Papaul) [16:38:49] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) [16:40:27] (03PS1) 10Andrew Bogott: Use service names for codfw1dev rabbitmq servers [puppet] - 10https://gerrit.wikimedia.org/r/822662 [16:42:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye [16:42:27] (03CR) 10Andrew Bogott: [C: 03+2] Use service names for codfw1dev rabbitmq servers [puppet] - 10https://gerrit.wikimedia.org/r/822662 (owner: 10Andrew Bogott) [16:42:34] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host netmon2002.wikimedia.org with OS bullseye [16:46:21] PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:57] RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:37] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:57:38] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: If the system is new reboot with redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/822665 [16:57:45] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:58:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 361 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:01:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [17:01:32] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: If the system is new reboot with redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/822665 (owner: 10Jbond) [17:02:53] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [17:05:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:06:26] (03CR) 10Ahmon Dancy: [C: 03+1] scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [17:11:31] (03PS1) 10Krinkle: Remove reference to unreachable eventlogging-procesor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) [17:11:53] (03PS4) 10Krinkle: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [17:13:28] (03PS2) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) [17:13:53] (03CR) 10Ahmon Dancy: [C: 04-1] kubernetes::mediawiki::releases: allow scap users to write releases files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [17:16:23] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:19:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye [17:19:50] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host netmon2002.wikimedia.org with OS bullseye completed: - netmon2002 (*... [17:21:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org [17:21:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org [17:24:32] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye [17:24:33] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) [17:24:42] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1064.eqiad.wmnet with OS bullseye [17:25:13] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete [17:26:19] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 55 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:26:39] (03PS3) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) [17:39:04] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage [17:39:50] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:42:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage [17:55:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [18:00:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye [18:00:48] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1064.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [18:00:51] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:13] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye [18:08:19] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1066.eqiad.wmnet with OS bullseye [18:21:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:22:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10thcipriani) >>! In T315048#8147180, @Dzahn wrote: > @thcipriani You are group approver for this shell group. Approved! [18:22:35] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage [18:25:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage [18:27:15] (03CR) 10Dzahn: [C: 03+2] admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn) [18:27:23] (03CR) 10Dzahn: [C: 03+2] "approval was added on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) (owner: 10Dzahn) [18:29:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn) 05Open→03Resolved a:03Dzahn @demon You have shell and root again on gerrit servers. now they are`gerrit1001.wikimedia.org` and `gerrit2002.w... [18:40:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:42:57] 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10BCornwall) a:03Vgutierrez @Vgutierrez since this was merged, can this ticket be closed? [18:48:36] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye [18:48:43] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1066.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [18:49:33] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json [18:52:50] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [18:54:01] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye [18:54:07] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1054.eqiad.wmnet with OS bullseye [18:54:55] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) @Ottomata, @soworu: In this case, shall I alter the access request to membership to analytics-privatedata-users? And if so, @Ottomata, do you approve? [18:58:09] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10soworu) @BCornwall, If that is the case, please do the as needed, subject to @Ottomata approval. Thanks. [18:58:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint [18:58:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint [18:59:42] (03CR) 10Ladsgroup: [C: 03+1] "looks good for when it's moved." [puppet] - 10https://gerrit.wikimedia.org/r/822659 (owner: 10MVernon) [19:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json [19:09:21] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage [19:12:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage [19:16:19] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:15] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:21:19] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:22:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json [19:23:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [19:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:30:41] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:33:11] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye [19:33:17] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1054.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [19:38:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json [19:38:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:38:05] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:38:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:38:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json [19:40:03] PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:07] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye [19:42:14] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1048.eqiad.wmnet with OS bullseye [19:42:32] (03PS1) 10BCornwall: admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) [19:43:29] RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:05] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "yes, this all looks good to me. user needs to be upgraded from ldap_only to shell section but does not need actual shell.. so no SSH key. " [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall) [19:53:15] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage [19:55:55] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage [20:12:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye [20:12:09] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1048.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [20:23:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) [20:23:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [20:24:48] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye [20:24:54] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1053.eqiad.wmnet with OS bullseye [20:33:24] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: install wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822683 (https://phabricator.wikimedia.org/T310795) [20:36:41] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: install wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822683 (https://phabricator.wikimedia.org/T310795) (owner: 10Andrew Bogott) [20:39:38] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage [20:39:40] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: update ref to wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822685 (https://phabricator.wikimedia.org/T310795) [20:42:33] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: update ref to wmf-mariadb104 [puppet] - 10https://gerrit.wikimedia.org/r/822685 (https://phabricator.wikimedia.org/T310795) (owner: 10Andrew Bogott) [20:42:59] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage [20:50:09] (03CR) 10Dzahn: "Unable to execute query for alias dse-k8s: Unexpected boolean operator 'or' with hosts ''" [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [20:50:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye [20:52:19] (03CR) 10Dzahn: Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [21:04:42] (03PS1) 10Sergio Gimeno: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) [21:06:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye [21:06:10] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1053.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [21:06:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage [21:10:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage [21:12:41] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye [21:12:50] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1071.eqiad.wmnet with OS bullseye [21:13:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [21:21:37] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:58] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) Thanks, all! I've created {T315119} and have already starte... [21:25:03] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage [21:27:47] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage [21:28:00] (03PS1) 10Brennen Bearnes: scap: add permission mangling, reorder checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) [21:45:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye [21:47:33] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye [21:49:01] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1071.eqiad.wmnet with OS bullseye completed: - elastic1063 (... [21:52:21] PROBLEM - puppet last run on wcqs2003 is CRITICAL: CRITICAL: Puppet has been disabled for 604901 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:52:39] PROBLEM - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604919 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:53:31] PROBLEM - puppet last run on wcqs2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604971 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:55:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:55:09] PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605069 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:55:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [21:55:35] PROBLEM - puppet last run on wcqs1003 is CRITICAL: CRITICAL: Puppet has been disabled for 605095 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:55:35] PROBLEM - puppet last run on wcqs1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605095 seconds, message: ebernhardson - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:55:55] PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05Stalled→03In progress [21:58:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) a:05soworu→03Ottomata [22:00:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) a:05MRaishWMF→03odimitrijevic [22:01:24] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:11] (03CR) 10BCornwall: admin: Add SSH key to mraish user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [22:14:00] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [22:14:04] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [22:15:28] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:17:48] (03CR) 10Dzahn: Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [22:19:10] RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:35] (03CR) 10Dzahn: "looks mostly good. a nitpick inline about the $realm check though. also https://puppet-compiler.wmflabs.org/pcc-worker1002/36727/ and do y" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [22:24:06] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:27:24] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [22:31:48] (03CR) 10Dzahn: "compiled this in the puppet compiler and host list: 'C:spamassassin'. so it's used by lists,mx,otrs and tools-mail. the result surprised m" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [22:33:17] (03CR) 10Dzahn: "something is odd with the compiler. there is not even output at https://puppet-compiler.wmflabs.org/pcc-worker1001/36728/otrs1001.eqiad.wm" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [22:39:29] (03CR) 10Dzahn: [C: 04-1] "systemd::timer::job does not have a parameter called 'owner'," [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [22:50:23] (03CR) 10Dzahn: define osm::planet_sync move from cron to systemd timers. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [23:00:11] (03CR) 10Dzahn: [C: 03+1] "seems good. some nitpicks/comments inline, compiles like this: https://puppet-compiler.wmflabs.org/pcc-worker1001/36730/maps1009.eqiad.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [23:07:22] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:28] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:14:56] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.81 ms [23:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:28:26] Krinkle and mutante: I've just created T315121 about the statistics problem in new wikis, as you asked about two weeks ago, with both of you as suscribers... thanks in advance [23:28:26] T315121: After new wikis are created/imported from Incubator, statistics should be updated - https://phabricator.wikimedia.org/T315121 [23:38:41] !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer T315121 [23:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:45] T315121: After new wikis are created/imported from Incubator, statistics should be updated - https://phabricator.wikimedia.org/T315121 [23:41:22] !log wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121 [23:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:35] jem: please reload for comments and ..stats should be fixed?^ :) [23:41:52] one action in prod and one in cloud [23:42:42] blk: total="3377",good="804",edits="15316",users="203",activeusers="29",admins="0",images="0" [23:42:58] kcg: total="1994",good="452",edits="15911",users="420",activeusers="16",admins="1",images="0" [23:43:10] blk needs an admin I suppose