[00:13:04] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-08-02 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:13:24] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:14:06] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-08-02 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:16:50] RECOVERY - SSH on cp1089.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:16:52] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [00:31:20] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:31:38] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-08-02 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:32:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [00:33:42] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-08-02 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:34:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [00:37:07] 10SRE, 10Cloud-VPS, 10Performance-Team (Radar), 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Krinkle) [00:43:14] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:26] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:50] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:06:26] 10SRE, 10serviceops, 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7 - https://phabricator.wikimedia.org/T216676 (10Krinkle) [01:16:56] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [01:21:02] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:22:16] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:28] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:10] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-09 00:00:02 (3361 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:36:18] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-09 00:00:01 (3361 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:05:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:28] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.43 ms [02:06:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:07:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.24 [core] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/821806 [02:07:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.24 [core] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/821806 (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.24 [core] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/821806 (owner: 10TrainBranchBot) [02:32:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:33:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:33:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:34:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:43:48] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:46] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-09 00:00:02 (3340 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:57:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:16] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:24] 10SRE, 10Cloud-VPS, 10Performance-Team (Radar), 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10ori) >>! In T225713#8130251, @tstarling wrote: > The performance impact of setting scaling_governor to `performance` is indeed significant. The median se... [04:10:10] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:50] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:08] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:48] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-09 00:00:01 (3340 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:24:42] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:40] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:36] (03CR) 10Tim Starling: "Timo asked for a benchmark due to concerns about T116550, although we pretty well convinced ourselves that it doesn't affect the deployed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling) [04:52:42] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:06] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:06] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance [05:05:24] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance [05:06:27] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance [05:06:41] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance [05:09:04] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance [05:09:18] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance [05:09:35] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance [05:09:54] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance [05:12:34] <_joe_> !log starting to shut down servers in codfw for the PDU maintenance [05:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:58] (03PS2) 10KartikMistry: Enable SectionTranslation on testwiki with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821170 (https://phabricator.wikimedia.org/T313296) [05:19:07] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance [05:19:23] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance [05:23:22] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:10] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance [05:24:24] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance [05:32:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:33:58] (KubernetesCalicoDown) firing: kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:38:58] (KubernetesCalicoDown) firing: (2) kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:44:51] <_joe_> uhm the downtime cookbook doesn't downtime alerts on alertmanager [05:46:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:15:05] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe) >>! In T314842#8139835, @RhinosF1 wrote: > For the IRC side, it's probably better to check cloak or account: > > - 1: a nickname on IRC normally has a short period of time (although... [06:23:16] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10RhinosF1) > Basically, you're asking to base the bot's reactions on a state that's completely managed by an external source (nickserv/chanserv) and that we don't get with every IRC message... [06:42:30] (03CR) 10Jdlrobson: [C: 03+1] "Feel free to self merge Clare!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming) [06:54:40] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220810T0700). [07:00:05] kart_ and aharoni: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:20] 10SRE-swift-storage, 10User-fgiunchedi: Expand thanos-swift sd[ab]3 SSDs - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete, thanos-swift cluster has been expanded online. ms cluster has not been expanded, though we had no troubles with ssd space... [07:00:30] * kart_ is here. [07:01:46] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:13] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on testwiki with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821170 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [07:02:29] * kart_ is deploying first patch ie self deploy. [07:03:09] (03Merged) 10jenkins-bot: Enable SectionTranslation on testwiki with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821170 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [07:05:35] (03PS1) 10Amire80: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821732 (https://phabricator.wikimedia.org/T291737) [07:07:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:11:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:45] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170|Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s) [07:11:49] T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296 [07:15:43] aharoni: I'll deploy your patch once CI is done. Please be here to test it. [07:15:53] I'm here and ready [07:15:59] thanks [07:20:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10dcaro) Related to T314847 [07:25:11] aharoni: oh, and it will take more 15 min to merge the patch after +2 and more 20 min to sync-world (we need to run full scap for namespace updates!) [07:25:50] Are you sure a full scap is needed? I _think_ that a maintenance script for namespace is enough, although I might be wrong. [07:26:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ayounsi) @Cmjohnson there is an outstanding diff: ` Changes for 1 devices: ['lsw1-f3-eqiad.mgmt.eqiad.wmnet'] [edit interfaces ge-0/0/40] - des... [07:26:40] (03CR) 10KartikMistry: [C: 03+2] "Backport to wmf.23" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821732 (https://phabricator.wikimedia.org/T291737) (owner: 10Amire80) [07:27:13] aharoni: Do you know which script to run exactly? I'm not sure. [07:27:32] We did full scap last time. [07:28:26] If I recall correctly, it's namespaceDupes.php , and its description sounds right. [07:30:26] OK. Let's do that! [07:30:43] We need to run normal scap + script after that. [07:31:20] (I never learned what does scap do exactly.) [07:32:03] aharoni: It syncs code/backport changes to all servers. Simple. [07:32:06] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:07] !log depool thanos-fe2001 for debugging [07:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:40] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [07:39:08] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:42] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:41:03] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi) Thank you @dcausse for diving deep into this issue and mitigating it! I can confirm that the space has stopped growing at the same r... [07:45:44] (03Merged) 10jenkins-bot: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821732 (https://phabricator.wikimedia.org/T291737) (owner: 10Amire80) [07:46:52] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [07:48:51] aharoni: Possible to test it via mwdebug1001? [07:49:10] yes [07:49:14] aharoni: I've synced there. Not sure if that requires script run. But, can you try it. [07:49:36] aharoni: also, let me know - how you're testing. I'll keep a note of that. [07:50:19] 10SRE-swift-storage, 10ops-codfw: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10fgiunchedi) [07:50:34] aharoni: There is currently a known issue with namespaceDupes.php. https://phabricator.wikimedia.org/T314711 [07:51:17] Since you are changing the name of the Template namespace this might be a problem. [07:51:36] To test, I look at Special:AllPages, open the Namespaces dropdown, and check that the names are now. [07:51:44] It works even with ?uselang=en [07:51:52] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:51:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:52:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:52:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:52:40] I still don't see that they changed. [07:52:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:52:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:54:13] You need a full scap aharoni [07:54:27] You can't test namespace changes without rebuild [07:54:42] OK, I can wait patiently [07:54:46] RhinosF1: Got it. Let me run full scap then. [07:54:58] kart_: note what PleaseStand said though [07:55:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:55:57] Although, I don't believe the script not running is a blocker. It just might make a few pages if they are any using the new names inaccessible [07:56:05] Cc aharoni PleaseStand [07:56:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json [07:57:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json [07:57:15] (03CR) 10Jbond: [C: 03+1] sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [07:58:53] I think the issue would be that when a template is updated, MediaWiki may not refresh all pages that use it. That, and also maybe cascade protection? [07:59:03] (03CR) 10Jbond: [C: 03+1] sre.network.debug: allow referencing directly an interface (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [07:59:10] !log kartik@deploy1002 Started scap: Backport: [[gerrit:821732|arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] [07:59:14] T291737: Request adding and updating namespaces on arywiki - https://phabricator.wikimedia.org/T291737 [08:00:34] PleaseStand: I'm not sure that's affected. Do we need to hold? [08:01:00] Now I see the change with and without mwdebug1001 [08:01:44] aharoni: kart_ started scap [08:01:57] Although I'm not sure if there's concerns about namespaceDupes [08:02:07] It's been forgot before so I don't think it's that bad [08:02:58] There are some 'Host verification failed' errors in scap. Is it known? [08:03:15] kart_: any specific server [08:03:29] RhinosF1: arywiki isn't terribly big, so refreshLinks.php may work if need be [08:03:47] PleaseStand: ok [08:04:04] RhinosF1: labweb1001 and labweb1002 [08:04:26] kart_: they are fine I believe. I'll poke cloud services [08:04:37] Thanks RhinosF1 [08:04:42] PleaseStand: we could also bribe someone to merge Amir's patch [08:04:43] (03CR) 10Ayounsi: [C: 03+2] sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [08:04:45] Also I see `08:02:28 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deploy2002.codfw.wmnet', 'deploy1002.eqiad.wmnet', 'deploy1002.eqiad.wmnet'] (ran as mwdeploy@mw2289.codfw.wmnet) returned [255]: ssh: connect to host mw2289.codfw.wmnet port 22: Connection timed out` [08:04:47] So now the pages are listed in Special:AllPages with the new name, but when you actually click a link, you see the old namespace name. [08:04:54] (03CR) 10Ayounsi: [C: 03+2] sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [08:04:57] If you do a null edit, then it changes to the new name. [08:05:04] aharoni: scap is still running [08:05:07] The script is supposed to change all these names completely. [08:05:12] aharoni: wait wait :) [08:05:18] And we haven't even ran any script yet [08:05:20] I'm not sure that scap alone does it. [08:05:29] Yes, it's the usual behavior. [08:05:46] kart_: not sure why 2289 would timeout [08:06:01] RhinosF1: mw2289.codfw connection time out. [08:06:11] Oh, already pasted the message. Sorry. [08:06:30] kart_: probably needs a depool from serviceops if it's broken. [08:08:33] kart_: https://phabricator.wikimedia.org/T313861 is the labweb issue [08:08:43] I left a note in cloud admin channel [08:09:06] (03Merged) 10jenkins-bot: sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [08:09:08] (03Merged) 10jenkins-bot: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [08:09:47] !log kartik@deploy1002 Finished scap: Backport: [[gerrit:821732|arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s) [08:09:51] T291737: Request adding and updating namespaces on arywiki - https://phabricator.wikimedia.org/T291737 [08:10:17] aharoni: scap has now finished. Can you confirm state? [08:10:52] Special:AllPages looks good [08:11:07] But untouched pages still have the old prefix [08:11:26] Ok. kart_ can hopefully run refreshLinks given the normal way is broken. [08:11:29] RhinosF1: Thanks! [08:11:35] kart_: np [08:11:43] RhinosF1: OK. Let me see that. [08:11:45] I left serviceops a message about mw2289 too [08:12:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json [08:12:36] I need to go have breakfast but PleaseStand filed the bug so they should have advice if they are still here [08:13:20] !log restart replication on db1117:m1 T309074 [08:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:23] T309074: Put netmon1003 in service - https://phabricator.wikimedia.org/T309074 [08:14:04] <_joe_> kart_: the server has been shut down and should also be set to pooled=inactive, let me check why it's still in the distribution list [08:14:26] RhinosF1: Actually, maybe it's not a problem. The purpose of namespaceDupes.php is to rename pages that start with the new namespace prefix. [08:14:29] <_joe_> ah sigh [08:14:32] <_joe_> it's a scap proxy [08:14:39] <_joe_> how didn't I notice [08:14:42] <_joe_> sorry kart_ my bad [08:15:01] RhinosF1: We can run namespaceDupes.php, and if it turns out to be necessary, we have the option of running refreshLinks.php [08:15:16] kart_: ^ [08:15:16] tl_namespace and tl_title columns have not been dropped yet on arywiki AFAIK [08:16:18] (03PS1) 10Giuseppe Lavagetto: scap: temporarily remove proxy for ongoing maintenance [puppet] - 10https://gerrit.wikimedia.org/r/822036 [08:16:22] 10SRE, 10SRE-swift-storage: Bump memcache connections and swift-proxy limits - https://phabricator.wikimedia.org/T314914 (10fgiunchedi) [08:16:52] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap: temporarily remove proxy for ongoing maintenance [puppet] - 10https://gerrit.wikimedia.org/r/822036 (owner: 10Giuseppe Lavagetto) [08:17:23] _joe_: Thanks! [08:18:10] <_joe_> kart_: if you have further patches to merge, please wait a couple minutes while puppet runs [08:18:43] (03CR) 10Jbond: "couple of minor issues" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [08:19:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [08:22:56] _joe_: No more patches for me. Thanks! [08:23:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) [08:23:23] !log Run: mwscript namespaceDupes.php arywiki --fix (T291737) [08:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] T291737: Request adding and updating namespaces on arywiki - https://phabricator.wikimedia.org/T291737 [08:23:34] aharoni: See if that's fix now? [08:24:02] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) [08:24:12] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:24:22] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822037 (https://phabricator.wikimedia.org/T312865) [08:24:25] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:25:26] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) [08:25:30] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:25:41] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:26:33] I still see old namespace prefix on https://ary.wikipedia.org/wiki/%D9%85%D9%88%D8%B6%D9%8A%D9%84:Cite_report , for example [08:27:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json [08:27:49] 10SRE-swift-storage: / full on ms-be2028 - https://phabricator.wikimedia.org/T314915 (10MatthewVernon) [08:27:55] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:28:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:28:29] aharoni: OK. Let me check refreshlinks. [08:28:37] jouncebot: nowandnext [08:28:37] No deployments scheduled for the next 4 hour(s) and 31 minute(s) [08:28:37] In 4 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220810T1300) [08:28:57] kart_: can I deploy something? [08:28:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full / [08:29:12] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full / [08:29:16] 10SRE-swift-storage: / full on ms-be2028 - https://phabricator.wikimedia.org/T314915 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=018253d8-e4ec-431d-a247-bbafb27e0dc4) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Trying to fix full / ` ms-be2028.... [08:30:34] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822037 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:31:23] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822037 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:31:33] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap [08:31:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) >>! In T314563#8140631, @BCornwall wrote: > Hi, @Siko_WMDE I'll need you to: > > * Sign the L3 Acknowledgement of Wikimedia Server Access Responsibilitie... [08:31:46] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap [08:32:16] !log power off gitlab-runner2004 [08:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:17] (03PS1) 10Filippo Giunchedi: memcached: point to active/used configuration options [puppet] - 10https://gerrit.wikimedia.org/r/822039 (https://phabricator.wikimedia.org/T314914) [08:34:19] (03PS1) 10Filippo Giunchedi: swift: bump proxy memcache max connections [puppet] - 10https://gerrit.wikimedia.org/r/822040 (https://phabricator.wikimedia.org/T314914) [08:35:20] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037|Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s) [08:35:24] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:35:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:36:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:36:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:36:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:37:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:37:30] Amir1: yeah :) [08:37:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:37:56] cool [08:37:57] kart_: is there anything left from aharoni's deploy? [08:39:00] Well, I still see the old namespaces prefixes on untouched pages :) [08:39:29] It's probably not a very big deal, but if I recall correctly, the script is supposed to fix that [08:39:36] aharoni: you probably need to run namespaceDupe maint script to clean it up [08:39:44] but it's broken (and I broke it) [08:39:47] Oh :) [08:39:53] aharoni: I did that. [08:40:01] Amir1: who can we bribe to merge it [08:40:03] RhinosF1: Should I run refreshLinks? [08:40:15] Amir1: would that work ^ [08:40:22] seen as you are here now [08:40:26] I doubt it [08:40:34] it's a new wiki in s5, right? [08:40:54] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/821783 [08:41:07] This would fix it for templatelinks [08:41:18] kart_: i sent a patch for the labweb issue. probably won't get fixed until US wakes up so deployers might see the warning for today. [08:41:36] kart_: please don't run it again on s5, it'll break replication [08:42:01] Amir1: OK. I'm not doing it. Thanks! [08:42:16] if someone reviews and merges the patch, we can backport it [08:42:20] Amir1: arywiki isn't listed on https://noc.wikimedia.org/db.php?dc=eqiad , so it should be in s3 anyway? [08:42:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json [08:42:54] PleaseStand: oh good, new wikis go to s5, that's why I thought it might be there [08:44:13] (03CR) 10Btullis: [V: 03+1] "This has no impact on existing etcd clusters." [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:45:26] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:28] to be clear, I'm not sure if the Amir's problem is directly related to the templatelinks migration and the script being broken but it might be [08:47:10] (03CR) 10Btullis: [V: 03+1] Use the chained certificate for the etcd cfssl option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:47:25] (03PS1) 10Ladsgroup: maintenance: Add support for links migration to namespaceDupes.php [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821734 (https://phabricator.wikimedia.org/T314711) [08:47:32] (03CR) 10Ladsgroup: [C: 03+2] maintenance: Add support for links migration to namespaceDupes.php [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821734 (https://phabricator.wikimedia.org/T314711) (owner: 10Ladsgroup) [08:48:16] 10SRE-swift-storage: / full on ms-be2028 - https://phabricator.wikimedia.org/T314915 (10MatthewVernon) 05Open→03Resolved This was a "swift is trying to fill `/` instead of a storage device" problem, fixed following the procedure here: https://wikitech.wikimedia.org/wiki/Swift/How_To#Cleanup_fully_used_root_f... [08:48:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet [08:48:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet [08:48:51] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:49:02] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2 [08:49:28] !log shutdown dbprov2003 before pdu upgrade T310146 [08:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:31] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [08:50:24] Amir1: I guess you are mainly doing this to avoid the possibility of someone breaking s5 replication? Are you dropping the columns on replicas before the primary server? [08:51:45] PleaseStand: yeah, I'm backporting so in the off-chance of someone running it on s5, it won't break replication. [08:52:04] We always have to run them in replicas and then master switchover and then on the old master [08:55:06] Amir1: isn't there a task for regularly running it too [08:56:20] yeah but I haven't started dropping it on any section until today [09:03:05] I need to go. Looks Aharoni's issue still persists and I'll followup on that later today. [09:03:16] RECOVERY - Disk space on ms-be2028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [09:06:43] (03Merged) 10jenkins-bot: maintenance: Add support for links migration to namespaceDupes.php [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821734 (https://phabricator.wikimedia.org/T314711) (owner: 10Ladsgroup) [09:09:03] (03CR) 10Jbond: [C: 03+1] Use the chained certificate for the etcd cfssl option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool D5 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json [09:10:42] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [09:10:42] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [09:11:50] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:14:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:14:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:14:27] (03CR) 10Jbond: [C: 03+1] memcached: point to active/used configuration options [puppet] - 10https://gerrit.wikimedia.org/r/822039 (https://phabricator.wikimedia.org/T314914) (owner: 10Filippo Giunchedi) [09:15:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146) [09:15:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146) [09:15:43] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [09:15:43] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [09:15:56] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734|maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s) [09:15:59] T314711: Add support for links migration to namespaceDupes.php - https://phabricator.wikimedia.org/T314711 [09:16:38] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:54] PROBLEM - Host schema2003 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 6357.61 ms [09:16:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:17:24] RECOVERY - Host schema2003 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [09:28:42] !log shutdown backup2007 before pdu upgrade T310146 [09:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:46] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [09:29:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:29:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:29:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [09:30:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [09:30:09] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe) [09:30:42] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 107.33, 102.22, 93.60 https://wikitech.wikimedia.org/wiki/Swift [09:30:46] (03CR) 10Ayounsi: PeeringDB API: initial commit (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:31:01] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) 05Resolved→03Open Hi @Papaul I may be missing something obvious, but I don't think the storage is quite right here - as far as I can see there isn't a new disk visible, and if... [09:31:21] (03PS25) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [09:31:39] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe) [09:31:52] !log depool services in codfw for upcoming PDU replacement - T309956 [09:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] T309956: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 [09:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool D6 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json [09:34:36] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [09:34:37] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [09:34:46] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) Perhaps relatedly, but perhaps not, kern.log is unhappy about /dev/sdz since sdc was removed: ` Aug 3 15:18:02 ms-be2067 kernel: [2595942.387928] sd 0:2:2:0: SCSI device is re mo... [09:36:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146) [09:36:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146) [09:38:38] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:38:48] (03CR) 10Jbond: "LGTM some minor or nits, which also cover the CI errors" [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [09:39:10] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:04] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:43:02] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the chained certificate for the etcd cfssl option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:43:46] (03CR) 10Cathal Mooney: "Thanks for the feedback will refactor and see how I get on." [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [09:47:22] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 102.71, 100.41, 98.59 https://wikitech.wikimedia.org/wiki/Swift [09:49:09] 10SRE, 10observability: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10Vgutierrez) [09:49:45] 10SRE, 10Traffic, 10observability: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10Vgutierrez) Adding Traffic as it's affecting to several traffic metrics [09:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool D8 DBs for PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json [09:51:04] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [09:51:04] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [09:52:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146) [09:53:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146) [09:57:49] (03PS1) 10David Caro: cloudnet.show: add router info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822050 [09:58:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) [09:59:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) I removed db2181 and db2192 from `D8` list because they have been decommissioned recently (after creation of this task): {T311623} and {T313003} [10:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:01:15] (03PS1) 10Vgutierrez: mtail: Add a -1 bucket as a workaround for T314922 [puppet] - 10https://gerrit.wikimedia.org/r/822051 (https://phabricator.wikimedia.org/T314922) [10:02:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance [10:02:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance [10:03:40] (03CR) 10CI reject: [V: 04-1] cloudnet.show: add router info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822050 (owner: 10David Caro) [10:03:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet [10:06:08] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:25] 10SRE, 10Traffic, 10observability, 10Patch-For-Review: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10fgiunchedi) Reported upstream as https://github.com/google/mtail/issues/675 [10:08:46] 10SRE, 10Traffic, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10Vgutierrez) [10:09:48] (03CR) 10Vgutierrez: [C: 03+2] mtail: Add a -1 bucket as a workaround for T314922 [puppet] - 10https://gerrit.wikimedia.org/r/822051 (https://phabricator.wikimedia.org/T314922) (owner: 10Vgutierrez) [10:13:03] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10ayounsi) Thanks for this task and the clear write-up. I agree with the overall problem statement and ideas to solve it. Adding some th... [10:15:48] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:18:03] (03PS1) 10Btullis: Add a new intermediate CA for use with etcd [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) [10:18:12] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:18:16] RECOVERY - very high load average likely xfs on ms-be2028 is OK: OK - load average: 61.96, 65.66, 77.62 https://wikitech.wikimedia.org/wiki/Swift [10:19:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance [10:19:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance [10:20:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36671/console" [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:20:18] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:20:20] (03PS2) 10FNegri: Add cloudcephosd1025 to Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) [10:20:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance [10:20:53] (03CR) 10FNegri: [V: 03+2] Add cloudcephosd1025 to Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:20:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance [10:21:03] (03PS26) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [10:21:36] (03PS1) 10Vgutierrez: mtail: Tune histogram buckets for trafficserver plugin time metrics [puppet] - 10https://gerrit.wikimedia.org/r/822054 (https://phabricator.wikimedia.org/T309651) [10:22:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [10:23:38] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance [10:23:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance [10:24:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) >>! In T310146#8142200, @Ladsgroup wrote: > I removed db2181 and db2182 from `D8` list because they have been decommissioned recently (after creation of... [10:24:18] (03CR) 10CI reject: [V: 04-1] mtail: Tune histogram buckets for trafficserver plugin time metrics [puppet] - 10https://gerrit.wikimedia.org/r/822054 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [10:24:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance [10:24:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet [10:24:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance [10:25:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet [10:25:54] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet [10:26:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet [10:26:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance [10:26:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10elukey) [10:26:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance [10:27:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) [10:28:16] (03CR) 10Jbond: "up" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [10:29:08] (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [10:30:24] (03CR) 10Jbond: [C: 03+2] P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:31:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146) [10:31:17] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [10:31:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146) [10:31:18] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [10:31:58] (KubernetesCalicoDown) firing: ml-serve2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:34:29] (03PS1) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) [10:35:09] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) [10:37:20] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [10:37:25] (03CR) 10CI reject: [V: 04-1] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [10:38:46] (03PS1) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/822058 (https://phabricator.wikimedia.org/T303168) [10:39:31] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:40:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Siko_WMDE) Hi @jbond, I signed the L3 document. To create the access request, I did not use a form or link.. Thank you and best regards, Simon [10:41:18] (03CR) 10Btullis: [C: 03+2] Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/822058 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [10:41:56] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Ladsgroup) [10:42:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a new intermediate CA for use with etcd [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:42:39] (03PS4) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) [10:43:42] (03CR) 10Filippo Giunchedi: "Expected to fail now" [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [10:44:31] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:44:42] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.1057 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:45:34] (03CR) 10Jbond: [C: 03+2] admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [10:45:40] (03PS3) 10Jbond: admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [10:45:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10jcrespo) [10:49:35] (03PS2) 10Vgutierrez: mtail: Tune histogram buckets for trafficserver plugin time metrics [puppet] - 10https://gerrit.wikimedia.org/r/822054 (https://phabricator.wikimedia.org/T309651) [10:50:27] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) [10:51:16] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:51:26] jbond: ^^puppet failure seems to be related to requestctl && ferm rules [10:53:12] (03CR) 10Vgutierrez: [C: 03+2] mtail: Tune histogram buckets for trafficserver plugin time metrics [puppet] - 10https://gerrit.wikimedia.org/r/822054 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [10:53:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10jbond) >>! In T314563#8140631, @BCornwall wrote: > Hi, @Siko_WMDE I'll need you to: > > * Sign the L3 Acknowledgement of Wikimedia Server Access Responsibilitie... [10:54:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:59:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-conf1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on acmechief1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti2022 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1012 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2035 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1041 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on miscweb1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1029 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1031 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1323 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1337 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1360 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1381 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1433 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1451 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2274 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2299 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2309 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2328 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2324 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2373 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2379 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2404 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir4002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on orespoolcounter2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on registry1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase-dev1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sessionstore1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:00:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase2025 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:20] (03PS1) 10Jbond: P:firewall: fix if clause [puppet] - 10https://gerrit.wikimedia.org/r/822060 [11:01:20] sorry this is me fixing [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1082 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1103 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1126 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1130 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1064 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1139 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1073 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on bast4003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:47] (03PS2) 10Jbond: P:firewall: fix if clause [puppet] - 10https://gerrit.wikimedia.org/r/822060 [11:01:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on conf2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on rdb1009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1022 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:02:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:02:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:firewall: fix if clause [puppet] - 10https://gerrit.wikimedia.org/r/822060 (owner: 10Jbond) [11:02:56] hey kart_ we need to do maintenance on codfw which brings down the proxy in front of cxserverdb and it seems it's getting connection from cx service. Can we shut it down? [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-airflow1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1084 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1109 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1141 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1070 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:08] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1068 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1066 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:10] PROBLEM - confd service on apt1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:11] PROBLEM - confd service on archiva1002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:14] oh oh [11:03:15] PROBLEM - confd service on cloudweb1004 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dumpsdata1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1074 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1083 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:35] Amir1: this can be ignored its a bad change im fixing now sorry for the noiuse [11:03:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1016 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1024 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doh5001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on htmldumper1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:40] jbond: do you need help? [11:03:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1029 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2027 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc-gp1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1035 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1310 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1325 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1329 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2037 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1344 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1354 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1090 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1124 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:48] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1376 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1399 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1431 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1437 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1456 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:52] Amir1: can yuo silence the icinga bot? [11:03:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2266 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2312 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2351 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2381 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1078 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:00] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on parse2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:00] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1015 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1027 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:03] I wish I could [11:04:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1013 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on eventlog1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:15] icinga-wm: silence [11:04:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1034 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2025 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1350 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1395 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1421 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1417 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1406 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sessionstore1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:18] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:18] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2392 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:19] didn't work [11:04:36] ack hoefully we will be out of the woods shortly [11:04:56] and then the shower of recover [11:04:58] :D [11:05:06] indeed :/ [11:05:14] kart_: it's in D6, T310146 [11:05:14] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [11:05:15] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1105 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1107 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1132 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on contint2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dragonfly-supernode2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1057 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum6002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1026 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on install1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on irc2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:03] PROBLEM - confd service on gerrit2002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:06:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:03] PROBLEM - confd service on install5001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:06:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kubetcd1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1032 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:10] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2031 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1332 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1349 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1387 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1390 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1367 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1369 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1384 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:14] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1392 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:14] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1411 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1408 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1439 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1446 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2277 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1112 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on pki-root1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on pki2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir5002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on prometheus5001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on puppetmaster2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on urldownloader1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1028 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on urldownloader2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:30] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1039 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:33] PROBLEM - confd service on doh2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:06:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1017 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1021 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2033 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1321 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1318 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1324 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2298 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2319 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2375 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mwmaint2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:56] (03PS1) 10Jbond: P:firewall: add if gaurd back [puppet] - 10https://gerrit.wikimedia.org/r/822061 [11:08:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on alert1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1059 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1075 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns3002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1060 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doc2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doh4001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum4001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1022 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on krb2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:45] PROBLEM - confd service on irc2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:47] PROBLEM - confd service on labstore1007 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on install4001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2024 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1042 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1039 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-cache1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps2009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2021 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2031 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1362 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1341 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1436 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1453 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1443 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2301 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2279 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2296 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2326 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2338 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2365 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on orespoolcounter1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on phab1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on planet2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on apifeatureusage2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on puppetmaster1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1017 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1021 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1026 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on snapshot1011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thanos-be1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on db1108 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1032 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1042 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1077 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-cache1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2020 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-etcd2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1336 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2403 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2418 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mwdebug2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on stat1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thanos-be2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:10:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1067 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:34] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1050 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:34] PROBLEM - confd service on labstore1006 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:11:34] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2029 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1334 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-staging-etcd2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-fe2009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1418 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2303 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2361 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on planet1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2399 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2407 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2413 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1030 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:12:46] PROBLEM - confd service on contint2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:12:58] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:02] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doh2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:02] PROBLEM - confd service on doh3001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:04] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum5002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:04] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:04] PROBLEM - confd service on install1003 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1026 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:08] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1051 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2032 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2038 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1346 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1383 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:14] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1450 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2273 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2330 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:18] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2394 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir3001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1058 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:24] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on conf2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:28] PROBLEM - confd service on urldownloader1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:29] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:29] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1038 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:29] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:30] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1054 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2386 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafkamon1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1032 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:39] PROBLEM - confd service on install3001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:45] jbond: are you sure this is being fixed? [11:13:51] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10RhinosF1) > 12:13:30 PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2006 is CRITICAL: File not fo... [11:14:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on acmechief2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-master1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1083 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:40] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1114 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:40] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1063 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1077 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:48] Amir1: yes the problem is we need puppet tro run first on the host then on alerts [11:15:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [11:16:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on bast2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on pybal-test2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:11] PROBLEM - confd service on doh1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:16:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc-gp2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2022 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2027 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum4002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:51] PROBLEM - confd service on install4001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:16:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ldap-corp1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1033 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1037 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1030 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1398 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1396 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2293 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:02] (03PS1) 10Ladsgroup: Run clean ups with removeOrphanedEvents in major batches [extensions/Echo] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821735 (https://phabricator.wikimedia.org/T310428) [11:17:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-conf1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mwdebug1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on apt2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on phab2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1016 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on chartmuseum1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1037 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1066 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doh6001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on schema2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1027 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1118 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1028 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-logging2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be1038 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1414 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:39] (03CR) 10Ladsgroup: [C: 03+2] Run clean ups with removeOrphanedEvents in major batches [extensions/Echo] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821735 (https://phabricator.wikimedia.org/T310428) (owner: 10Ladsgroup) [11:18:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2272 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2362 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores2007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1036 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2036 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1131 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1067 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1074 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on authdns2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2292 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1020 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:59] PROBLEM - confd service on doh2002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:19:01] PROBLEM - confd service on doh4001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:19:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum3001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:01] PROBLEM - confd service on lists1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:19:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1348 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1447 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2323 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2364 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:09] PROBLEM - confd service on apt2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:19:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on prometheus6001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase2024 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-coord1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1086 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1108 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:58] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1102 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:58] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1136 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on contint1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:13] PROBLEM - confd service on ldap-corp1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:21:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1052 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps2008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-etcd1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-etcd2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1345 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2023 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1429 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2320 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2382 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2380 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mwmaint1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1115 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on chartmuseum2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on snapshot1012 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1075 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1025 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2025 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:26] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2300 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:26] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2262 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:26] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2410 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:26] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on gerrit2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on parse2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1100 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-airflow1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on bast1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2012 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on cloudweb1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1056 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on doh4002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on furud is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti2015 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on grafana2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1035 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps1009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1040 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-etcd2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1444 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2304 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:16] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2333 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on search-loader2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2397 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2372 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2414 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir6001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on prometheus3001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on moscovium is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1359 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on cloudweb1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1052 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti2023 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1027 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1333 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1385 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1391 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2267 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2377 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on archiva1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2398 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2411 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ncredir2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on pki1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on puppetmaster2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1014 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1047 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on deploy1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1068 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - confd service on doh6001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns5002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kubestagetcd1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kubetcd1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1044 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1335 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1426 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2291 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2354 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on netmon2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1028 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1351 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2415 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns4001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1014 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:59] PROBLEM - confd service on doh4002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:24:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2028 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1343 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1366 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1375 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1428 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1424 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2359 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2376 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:17] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase2016 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1389 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1119 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on elastic1053 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on durum3002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1023 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2023 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2039 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1371 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mwlog1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2335 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:26:12] PROBLEM - confd service on netmon2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:26:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:27:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:27:51] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1403 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2269 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1093 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:21] PROBLEM - confd service on seaborgium is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:23] PROBLEM - confd service on doh3002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2029 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2259 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on parse2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2037 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:33] PROBLEM - confd service on serpens is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash1031 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-main2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2316 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2383 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2371 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thanos-be1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2395 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2366 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti1020 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:29:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kubestagetcd2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:07] PROBLEM - confd service on install6001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:30:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1393 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:11] PROBLEM - confd service on urldownloader1002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:30:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kubestagetcd1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1127 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1140 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:51] PROBLEM - confd service on dborch1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:30:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on dns4002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1339 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1397 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1435 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1449 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on poolcounter2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on releases2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on prometheus4001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2030 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2356 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1404 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:13] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:34] hmm first recovery, don't know if it's related [11:31:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1111 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:53] PROBLEM - confd service on doh6002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:31:55] PROBLEM - confd service on irc1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:31:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ganeti2019 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc2019 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:55] PROBLEM - confd service on ldap-corp2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:31:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1314 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:02] Amir1: we shouldn;g get recoveries for theses, the check will just get removed [11:32:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on pybal-test2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sessionstore2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on krb1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:49] noted [11:32:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1095 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1087 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:03] PROBLEM - confd service on gerrit1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:33:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-logging1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ml-etcd1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on logstash2026 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ms-be2030 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1364 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1440 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2332 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2363 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on orespoolcounter2004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on registry2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thanos-be2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs2008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:29] RECOVERY - confd service on doh2001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:33:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2387 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1029 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:33:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1041 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:34:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2405 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:34:49] PROBLEM - confd service on doh5001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:35:11] (03CR) 10Nikerabbit: Enable message bundle on MetaWiki for WikiLearn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [11:35:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on deploy2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1048 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on maps2007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1419 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2268 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2321 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:03] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on ores1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on puppetdb2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thanos-be1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on urldownloader1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1040 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:15] PROBLEM - confd service on cloudweb1003 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:36:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2264 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on gerrit1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:27] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc-gp2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:36:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on zookeeper-test1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:37:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1407 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:37:11] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wtp1030 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:37:46] (03Merged) 10jenkins-bot: Run clean ups with removeOrphanedEvents in major batches [extensions/Echo] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821735 (https://phabricator.wikimedia.org/T310428) (owner: 10Ladsgroup) [11:37:57] (03CR) 10Jbond: [C: 03+2] P:firewall: add if gaurd back [puppet] - 10https://gerrit.wikimedia.org/r/822061 (owner: 10Jbond) [11:38:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:15] PROBLEM - confd service on doh2001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:29] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on install2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1053 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1386 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw1442 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2260 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2265 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2367 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase-dev1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on wdqs1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on bast6001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on parse2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mc1045 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on mw2308 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-logging2002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:39:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on restbase1025 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:39:59] PROBLEM - confd service on doh1002 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:41:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:57] PROBLEM - confd service on cloudweb2002-dev is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:43:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:45:27] PROBLEM - confd service on contint1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:57] new error jbond ^ [11:46:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:46:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:46:58] Amir1: ack thanks im looking at that now [11:47:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:53:59] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:03] PROBLEM - Cassandra instance data free space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7206 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [11:57:49] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7274 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:00:23] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7186 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:02:25] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11377 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:05:40] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735|Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s) [12:05:44] T310428: removeOrphanedEvents.php is slow - https://phabricator.wikimedia.org/T310428 [12:06:09] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:09:59] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005923 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:10:19] <_joe_> hnowlan: have you seen the cassandra disk space alerts? ^^ [12:10:28] <_joe_> and well, urandom [12:12:08] sorry, we were focused on the confd issue, I misse those [12:12:25] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:14:23] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:07] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7050 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:18:57] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7070 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:19:37] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7159 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:21:57] I'm shutting down db hosts for maint [12:24:14] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe) [12:25:29] <_joe_> Amir1: I would worry about these cassandra hosts. [12:25:56] _joe_: yeah, gonna start pinging people about it very soon [12:25:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:27:37] !log remove confd from serveres that shouldn;t have it [12:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:45] D5 dbs shutdown is done now, going to ping people about cassandra [12:30:46] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [12:34:32] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6977 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:35:11] one ping down, 99 to go [12:37:39] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146 [12:37:43] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [12:37:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146 [12:38:22] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8316 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:44:52] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7309 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:45:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [12:47:38] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8337 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:53:36] PROBLEM - Cassandra instance data free space on restbase1025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8309 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:55:03] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable new topic tool on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820568 (https://phabricator.wikimedia.org/T313699) (owner: 10Esanders) [12:55:50] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [12:56:22] PROBLEM - Cassandra instance data free space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8262 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:57:40] PROBLEM - Cassandra instance data free space on restbase1025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8172 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:58:28] PROBLEM - Cassandra instance data free space on restbase1024 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8046 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:58:38] PROBLEM - Cassandra instance data free space on restbase1029 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7412 MB (18% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:58:38] PROBLEM - Cassandra instance data free space on restbase1030 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8136 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:58:46] <_joe_> ok, this is escalating it seems [12:58:57] <_joe_> I guess this is due to the nodes down in codfw [13:00:05] RoanKattouw, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220810T1300). Please do the needful. [13:00:05] phuedx, MdsShakil, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:00:59] <_joe_> urandom: around? [13:01:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:01:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json [13:01:09] hi [13:01:11] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:01:29] MatmaRex: hello [13:01:59] _joe_: I'd not usually expect them around for another hour or so... [13:02:00] hi, i can deploy today [13:02:18] _joe_: i see some issues, should i wait with (MW) deployment? [13:02:32] _joe_: hnowlan is the other Cassandra person who actually knows stuff... [13:02:48] <_joe_> ok, hugh isn't here [13:02:54] <_joe_> so ok, I'll work on this [13:03:04] <_joe_> urbanecm: no idea sorry, ask the oncall people I guess [13:03:15] okay [13:03:16] PROBLEM - Cassandra instance data free space on restbase1022 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8274 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:03:21] (03PS1) 10David Caro: osd.bootstrap_and_add: fix not-needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822072 [13:03:47] jbond: ^^ (some icinga alerts re cassandra happening, i'd like to confirm carrying MW deployment out is ok) [13:04:26] urbanecm: we are trying to figure it out, bug we are unable to contact the people on the know [13:04:32] *but [13:04:55] (03PS2) 10David Caro: cloudnet.show: add router info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822050 [13:04:57] (03PS2) 10David Caro: osd.bootstrap_and_add: fix not-needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822072 [13:05:02] PROBLEM - Cassandra instance data free space on restbase1028 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7898 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:05:09] <_joe_> we don't have grafana data of individual partitions? [13:05:44] urbanecm: fyi if you do get to deploying. The labweb hosts were showing errors earlier. As far as we worked out, they can be ignored. I'm waiting for cloud services to remove them from scap. They are mid decom. [13:05:46] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight) [13:06:09] jynus: ack. i'll wait for now to be on the safe side. let me know if i can resume. [13:06:10] _joe_ afaics from restbase1022 the instance commit log dir shows a lot of files for today [13:06:16] *instances [13:06:43] _joe_: do you mean something like this? https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=restbase1028&var-datasource=thanos&var-cluster=restbase&from=now-30m&to=now&viewPanel=12 [13:07:11] so it is the instance-data partition filling up afaics [13:07:12] /srv/cassandra/instance-data [13:07:18] <_joe_> elukey: yes [13:07:33] <_joe_> elukey: I think it's due to turning off too many hosts in codfw [13:07:39] <_joe_> too soon [13:07:40] yeah it could be possible [13:07:51] a lot of data is being written to cassandra eqiad it seems [13:07:52] <_joe_> yeah I'm going to power them back on [13:07:58] PROBLEM - Cassandra instance data free space on restbase1029 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7988 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:07:58] it is going down quite fast- some host at 11% [13:08:17] <_joe_> can someone wake eric up and call hugh on the phone? thanks [13:08:20] +1 to power them up, blocking maintenance if needed [13:08:27] I will try [13:08:28] <_joe_> yes maintenance is blocked as of now [13:09:02] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7937 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:09:44] _joe_ lemme know if you want help in bringing up hosts [13:09:51] <_joe_> elukey: yes please [13:09:57] <_joe_> let's coordinate in private [13:10:01] ack [13:10:14] btw I see 81 instances up (between eqiad and codfw) [13:10:17] and 12 down [13:10:42] eric not answering, will try hugh now [13:10:54] (03PS1) 10Jbond: sretest: enable etcd defs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/822075 [13:11:40] (03CR) 10CI reject: [V: 04-1] osd.bootstrap_and_add: fix not-needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822072 (owner: 10David Caro) [13:12:00] PROBLEM - Cassandra instance data free space on restbase1028 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8323 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:12:06] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:33] <_joe_> !log powering on restbase2023 [13:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:52] (03CR) 10Jbond: [C: 03+2] sretest: enable etcd defs on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/822075 (owner: 10Jbond) [13:12:58] !log powering on restbase2026 [13:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:07] thanks jynus! [13:13:18] (03PS2) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) [13:13:20] (03PS1) 10Filippo Giunchedi: mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) [13:13:22] (03PS1) 10Vgutierrez: Release 9.1.3-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/822078 (https://phabricator.wikimedia.org/T309651) [13:13:28] huge will be here in 7 minutes [13:13:33] *hugh [13:14:13] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Not sure we need to mention the ticket number in two comments. 😋" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight) [13:14:58] PROBLEM - Cassandra instance data free space on restbase1029 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7940 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:16:29] (03CR) 10CI reject: [V: 04-1] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [13:16:56] (03PS1) 10Ssingh: Revert "Revert "Revert "Revert "Depool codfw for PDU upgrade"""" [dns] - 10https://gerrit.wikimedia.org/r/821742 [13:17:04] !log powering on restbase2027 [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] (03PS2) 10Ssingh: Depool codfw for PDU upgrade (row D) [dns] - 10https://gerrit.wikimedia.org/r/821742 (https://phabricator.wikimedia.org/T310146) [13:20:42] PROBLEM - Cassandra instance data free space on restbase1021 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8199 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:22:06] PROBLEM - Cassandra instance data free space on restbase1023 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7747 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:22:48] PROBLEM - Cassandra instance data free space on restbase1020 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7920 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:22:59] o/ Sorry. Am here now [13:23:48] phuedx: deployment is on hold anyway [13:23:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:24:03] RhinosF1: Thanks [13:24:04] PROBLEM - Cassandra instance data free space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7568 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:24:14] PROBLEM - Cassandra instance data free space on restbase1030 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8285 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:24:14] PROBLEM - Cassandra instance data free space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8034 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:24:27] (03CR) 10Herron: [C: 03+1] o11y: fix logstash alerts to use 'datasource' grafana variable [alerts] - 10https://gerrit.wikimedia.org/r/821601 (owner: 10Filippo Giunchedi) [13:26:25] (03PS1) 10Jbond: P:base::firewall: enable defs on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822079 [13:26:47] (03CR) 10FNegri: [C: 03+2] osd.bootstrap_and_add: fix not-needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822072 (owner: 10David Caro) [13:26:49] (03CR) 10Herron: [C: 03+1] logstash: clean up unneeded filters [puppet] - 10https://gerrit.wikimedia.org/r/820569 (owner: 10Cwhite) [13:27:42] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8330 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:27:53] (03PS1) 10Vgutierrez: mtail: Fix trafficserver_backend_client_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/822080 (https://phabricator.wikimedia.org/T309651) [13:27:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:48] PROBLEM - Cassandra instance data free space on restbase1024 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7322 MB (18% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:30:02] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: T310146 [13:30:07] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [13:30:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: T310146 [13:31:17] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu [13:31:20] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) 05Open→03In progress p:05Triage→03Medium [13:31:29] RECOVERY - Cassandra instance data free space on restbase1030 is OK: DISK OK - free space: /srv/cassandra/instance-data 13524 MB (33% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:31:30] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu [13:31:50] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu [13:32:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu [13:32:33] PROBLEM - Cassandra instance data free space on restbase1022 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8264 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:32:37] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu [13:32:51] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu [13:34:05] (03CR) 10Jbond: [C: 03+2] P:base::firewall: enable defs on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822079 (owner: 10Jbond) [13:34:10] (03PS1) 10Btullis: Failback hive to primary server [dns] - 10https://gerrit.wikimedia.org/r/822081 (https://phabricator.wikimedia.org/T303168) [13:34:41] PROBLEM - Cassandra instance data free space on restbase1025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8157 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:36:20] urandom: hnowlan ^ [13:37:29] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:05] Amir1: ack [13:38:27] Thanks [13:39:21] PROBLEM - Cassandra instance data free space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8246 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:39:41] PROBLEM - Cassandra instance data free space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8290 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:42:06] (03CR) 10Btullis: [C: 03+2] Failback hive to primary server [dns] - 10https://gerrit.wikimedia.org/r/822081 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [13:42:14] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: fix logstash alerts to use 'datasource' grafana variable [alerts] - 10https://gerrit.wikimedia.org/r/821601 (owner: 10Filippo Giunchedi) [13:45:56] Deployment will happening today? [13:46:56] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi) >>! In T314835#8141914, @fgiunchedi wrote: > Thank you @dcausse for diving deep into this issue and mitigating it! I can confirm tha... [13:47:29] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10fgiunchedi) I suspect this being related to {T309074} cc @andrea.denisse [13:47:41] PROBLEM - Cassandra instance data free space on restbase1023 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8353 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:49:29] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7945 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:50:26] going to shut down D6 dbs now [13:50:26] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [13:50:49] PROBLEM - Cassandra instance data free space on restbase1022 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8062 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:50:51] PROBLEM - Cassandra instance data free space on restbase1021 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8251 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:52:11] PROBLEM - Cassandra instance data free space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8236 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:52:48] !log powered up restbase2018 [13:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:13] PROBLEM - Cassandra instance data free space on restbase1024 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8150 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:54:13] we didn't have the backport/config deployment, did we? [13:54:25] i guess the outage is ongoing [13:54:45] PROBLEM - Cassandra instance data free space on restbase1021 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8135 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:55:18] MatmaRex: no deployment happened [13:55:46] 10SRE, 10Performance-Team, 10Traffic-Icebox, 10Performance-Team-publish: Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10Krinkle) [13:55:51] 10SRE, 10Performance-Team, 10Traffic-Icebox, 10Performance-Team-publish: Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10Krinkle) a:03Krinkle [13:55:56] (03PS1) 10Jbond: P:base::firewall add confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/822085 [13:55:57] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) a:05cmooney→03None [13:56:07] 10SRE, 10Performance-Team, 10Traffic-Icebox, 10Performance-Team-publish: Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10Krinkle) 05Stalled→03Declined [13:58:01] (03PS1) 10Btullis: Configure the new intermediate CA for etcd use [puppet] - 10https://gerrit.wikimedia.org/r/822086 (https://phabricator.wikimedia.org/T313129) [13:59:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36672/console" [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [13:59:30] (03CR) 10Jbond: [C: 03+2] P:base::firewall add confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/822085 (owner: 10Jbond) [13:59:43] (03CR) 10FNegri: [C: 03+2] "Tested, works nicely. 👍" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822050 (owner: 10David Caro) [13:59:45] PROBLEM - Cassandra instance data free space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8190 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:00:17] (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36673/console" [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:00:33] PROBLEM - Cassandra instance data free space on restbase1029 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8236 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:02:39] RECOVERY - Cassandra instance data free space on restbase1016 is OK: DISK OK - free space: /srv/cassandra/instance-data 19023 MB (53% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:03:14] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) a:03andrea.denisse [14:03:43] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Thanks @cmooney and @fgiunchedi , I'll work on this today. [14:04:37] PROBLEM - Cassandra instance data free space on restbase1028 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7561 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:05:00] !log flushing tables, restbase1016 [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] (03Merged) 10jenkins-bot: cloudnet.show: add router info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822050 (owner: 10David Caro) [14:06:44] (03Merged) 10jenkins-bot: osd.bootstrap_and_add: fix not-needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822072 (owner: 10David Caro) [14:07:35] PROBLEM - Cassandra instance data free space on restbase1027 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8105 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:08:07] RECOVERY - Cassandra instance data free space on restbase1018 is OK: DISK OK - free space: /srv/cassandra/instance-data 26235 MB (74% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:08:27] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 25930 MB (73% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:08:35] (03CR) 10Btullis: [V: 03+1 C: 03+2] "These pcc runs accidentally targeted the wrong hosts." [puppet] - 10https://gerrit.wikimedia.org/r/822053 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:09:07] RECOVERY - Cassandra instance data free space on restbase1026 is OK: DISK OK - free space: /srv/cassandra/instance-data 29341 MB (73% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:09:07] RECOVERY - Cassandra instance data free space on restbase1024 is OK: DISK OK - free space: /srv/cassandra/instance-data 29301 MB (73% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:09:39] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8044 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:09:51] RECOVERY - Cassandra instance data free space on restbase1028 is OK: DISK OK - free space: /srv/cassandra/instance-data 30608 MB (76% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:10:11] RECOVERY - Cassandra instance data free space on restbase1029 is OK: DISK OK - free space: /srv/cassandra/instance-data 29190 MB (73% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:11:26] !log flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029 [14:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4 [14:12:25] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [14:12:37] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:37] PROBLEM - Cassandra instance data free space on restbase1019 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7954 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:12:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4 [14:13:04] !log flushing Cassandra tables, restbase1019 [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:11] PROBLEM - Cassandra instance data free space on restbase1030 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7946 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:13:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4 [14:13:54] !log flushing Cassandra tables, restbase1030 [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4 [14:14:09] RECOVERY - Cassandra instance data free space on restbase1019 is OK: DISK OK - free space: /srv/cassandra/instance-data 29659 MB (74% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:14:15] RECOVERY - Cassandra instance data free space on restbase1025 is OK: DISK OK - free space: /srv/cassandra/instance-data 27998 MB (70% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:14:34] (03PS1) 10Btullis: Add a dummy private key for the etcd intermediate CA [labs/private] - 10https://gerrit.wikimedia.org/r/822089 (https://phabricator.wikimedia.org/T313129) [14:14:45] RECOVERY - Cassandra instance data free space on restbase1030 is OK: DISK OK - free space: /srv/cassandra/instance-data 30819 MB (77% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:15:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp20[39|40]\.codfw\.wmnet,service=ats-tls [14:15:42] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy private key for the etcd intermediate CA [labs/private] - 10https://gerrit.wikimedia.org/r/822089 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:17:24] (03CR) 10Herron: [C: 03+1] logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [14:17:55] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:18:17] PROBLEM - Cassandra instance data free space on restbase1022 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7971 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:19:19] PROBLEM - Host ores2008 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:23] RECOVERY - Cassandra instance data free space on restbase1023 is OK: DISK OK - free space: /srv/cassandra/instance-data 31124 MB (78% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:19:31] ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Brian_King PDU maintenance T310146 https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:19:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/822086 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:19:39] the ores2008 alert is downtime expired [14:19:53] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:20:03] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:20:05] RECOVERY - Cassandra instance data free space on restbase1022 is OK: DISK OK - free space: /srv/cassandra/instance-data 30621 MB (76% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:21:09] PROBLEM - Host ores2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:18] (03PS1) 10Elukey: ml-services: update articlequality's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/822090 (https://phabricator.wikimedia.org/T313915) [14:21:29] RECOVERY - Cassandra instance data free space on restbase1021 is OK: DISK OK - free space: /srv/cassandra/instance-data 29704 MB (74% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:21:49] (03CR) 10Ssingh: [C: 03+1] mtail: Fix trafficserver_backend_client_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/822080 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [14:21:55] RECOVERY - Cassandra instance data free space on restbase1027 is OK: DISK OK - free space: /srv/cassandra/instance-data 30132 MB (75% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:22:09] (03PS3) 10Ssingh: Depool codfw for PDU upgrade (row D) [dns] - 10https://gerrit.wikimedia.org/r/821742 (https://phabricator.wikimedia.org/T310146) [14:22:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:23] ^ this is expected because of dns2002. should go to dns2001 as it is anycasted [14:23:33] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap [14:23:36] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap [14:23:41] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Check again and please resolve this task when done [14:23:43] PROBLEM - Host ml-serve2008 is DOWN: PING CRITICAL - Packet loss = 100% [14:23:43] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap [14:23:46] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap [14:23:49] !log depool codfw for PDU upgrade: rack D [14:23:50] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap [14:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap [14:24:05] (03CR) 10Ssingh: [C: 03+2] Depool codfw for PDU upgrade (row D) [dns] - 10https://gerrit.wikimedia.org/r/821742 (https://phabricator.wikimedia.org/T310146) (owner: 10Ssingh) [14:24:23] (03CR) 10Vgutierrez: [C: 03+2] mtail: Fix trafficserver_backend_client_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/822080 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [14:25:09] !log power off mc2033 [14:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] I have to shut down dbproxy2004, it might make cxserver to alert in codfw cc kart_ [14:25:48] !log power off mc-gp2003 [14:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance [14:27:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance [14:27:42] !log power off cp2039, cp2040 for PDU upgrade: rack D [14:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet [14:28:24] !log shutting down sessionstore2003 [14:28:25] !log power off kafka-main2004 gracefully [14:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:55] (03PS2) 10Elukey: ml-services: update articlequality's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/822090 (https://phabricator.wikimedia.org/T313915) [14:29:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10ssingh) [14:31:15] RECOVERY - Cassandra instance data free space on restbase1020 is OK: DISK OK - free space: /srv/cassandra/instance-data 27338 MB (68% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:32:03] (03PS1) 10Jbond: P:firewall: fix template errors [puppet] - 10https://gerrit.wikimedia.org/r/822091 (https://phabricator.wikimedia.org/T313825) [14:32:13] (KubernetesCalicoDown) firing: ml-serve2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:33:35] (03CR) 10Jbond: [C: 03+2] P:firewall: fix template errors [puppet] - 10https://gerrit.wikimedia.org/r/822091 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [14:33:43] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:34:03] PROBLEM - MariaDB Replica IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2173.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2173.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:35:31] I am guessing that is codfw sanitarium? [14:35:40] so we can ignore it? [14:36:01] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:36:22] (03CR) 10Elukey: [C: 03+2] ml-services: update articlequality's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/822090 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:36:30] (03CR) 10Elukey: [V: 03+2 C: 03+2] ml-services: update articlequality's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/822090 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:36:40] indeed it is, I will ack it [14:37:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:38:23] !log disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941 [14:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [14:39:18] urandom: what does the reserved space do out of curiosity? [14:40:03] (03CR) 10Andrew Bogott: [C: 03+2] dsh: remove old labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/821733 (https://phabricator.wikimedia.org/T313861) (owner: 10RhinosF1) [14:40:04] is it the ext4 5% reserved space? [14:40:11] elukey: yes [14:40:19] ahhh ack [14:40:45] hnowlan: it's reserved for the root user, it's meant to keep another user from so completely filling a volume that root can't fix it :) [14:41:24] it doesn't make sense for a volume like this that is dedicated to this purpose, and the reserved value was established a long time ago when devices were much smaller [14:41:41] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:42:07] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul sorry, I don't understand your comment, but I've rechecked, and there are still kernel log errors re `sdz` and the idrac still thinks there's one removed drive.... [14:42:15] PROBLEM - Host dns2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:29] jynus: I downtime it [14:43:15] I acked it already, not sure if you saw my message on -persistence [14:43:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146) [14:43:33] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [14:43:43] PROBLEM - Host cp2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:43] PROBLEM - Host cp2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146) [14:43:56] sorry I was checking dcops [14:44:07] PROBLEM - Host elastic2072.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:13] PROBLEM - Host mc2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:59] PROBLEM - Host mc2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:59] PROBLEM - Host mc2052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:01] PROBLEM - Host kafka-main2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:18] (03PS3) 10Andrew Bogott: Move cloudweb100[12] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/817385 (https://phabricator.wikimedia.org/T313861) [14:45:19] PROBLEM - Host mc-gp2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:19] PROBLEM - Host logstash2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:20] (03PS2) 10Andrew Bogott: Remove puppet refs to labweb100[12] [puppet] - 10https://gerrit.wikimedia.org/r/817386 (https://phabricator.wikimedia.org/T313861) [14:45:29] PROBLEM - Host mw2289.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:29] PROBLEM - Host mw2288.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:29] PROBLEM - Host mw2290.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:39] PROBLEM - Host sessionstore2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2282.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2284.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2283.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2281.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2285.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2286.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:41] PROBLEM - Host mw2287.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:49] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:55] PROBLEM - Host ores2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:47:11] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2003 is CRITICAL: 62 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [14:47:47] (03PS1) 10Jbond: P:base::firewall: add file definition [puppet] - 10https://gerrit.wikimedia.org/r/822095 [14:47:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [14:48:57] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 53 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [14:49:22] these ones are due to the host down, expected --^ [14:50:33] PROBLEM - Host elastic2085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:33] PROBLEM - Host elastic2084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:52:28] (03CR) 10Jbond: [C: 03+2] P:base::firewall: add file definition [puppet] - 10https://gerrit.wikimedia.org/r/822095 (owner: 10Jbond) [14:53:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [14:55:20] (03PS1) 10Vgutierrez: mtail: Add additional buckets for haproxy TTFB metrics [puppet] - 10https://gerrit.wikimedia.org/r/822096 [14:56:06] (03PS2) 10Vgutierrez: mtail: Add additional buckets for haproxy TTFB metrics [puppet] - 10https://gerrit.wikimedia.org/r/822096 [14:58:03] (03CR) 10Ssingh: [C: 03+1] Release 9.1.3-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/822078 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [14:58:11] (03CR) 10Vgutierrez: [C: 03+2] Release 9.1.3-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/822078 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [14:58:49] RECOVERY - Host wdqs2006.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 34.24 ms [14:58:49] RECOVERY - Host ores2008.mgmt is UP: PING WARNING - Packet loss = 60%, RTA = 34.48 ms [14:59:39] RECOVERY - Host cp2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.65 ms [14:59:59] RECOVERY - Host logstash2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms [15:00:03] thanks elukey [15:00:33] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [15:00:37] RECOVERY - Host ores2008 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [15:00:59] PROBLEM - ores on ores2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [15:01:06] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap [15:01:07] PROBLEM - ores_workers_running on ores2008 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:01:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [15:01:09] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap [15:01:16] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap [15:01:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [15:01:19] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap [15:01:28] !log power off mc2034 [15:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:37] RECOVERY - Host dns2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [15:01:53] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [15:02:25] !log power off mc2035 [15:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:45] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:47] I'm shutting down dbproxy2004 right now, this might cause cx alerts, couldn't grab hold of someone from lang team [15:02:55] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:03:07] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:03:07] RECOVERY - ores on ores2008 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [15:03:11] RECOVERY - Host cp2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.62 ms [15:03:27] RECOVERY - ores_workers_running on ores2008 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:03:29] RECOVERY - Host elastic2085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.62 ms [15:03:29] RECOVERY - Host elastic2084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [15:03:29] RECOVERY - Host mw2281.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms [15:03:35] RECOVERY - Host elastic2072.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.61 ms [15:03:47] RECOVERY - Host mw2284.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.56 ms [15:03:47] RECOVERY - Host mc2052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.23 ms [15:03:51] RECOVERY - Host sessionstore2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:04:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [15:04:13] RECOVERY - Host mw2286.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms [15:04:25] RECOVERY - Host mc2051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.72 ms [15:04:27] RECOVERY - Host kafka-main2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.63 ms [15:04:43] RECOVERY - Host mc-gp2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [15:04:43] RECOVERY - Host mc2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [15:04:51] RECOVERY - Host mw2289.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [15:04:51] RECOVERY - Host mw2288.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [15:04:51] RECOVERY - Host mw2290.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [15:04:57] RECOVERY - Host mw2282.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms [15:05:07] RECOVERY - Host mw2283.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [15:05:07] RECOVERY - Host mw2285.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [15:05:07] RECOVERY - Host mw2287.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [15:06:57] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:08:05] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/list/{tool} (Get the MT tool between two language pairs) is CRITICAL: Test Get the MT tool between two language pairs returned the unexpected status 503 (expecting: 200): /v1/list/{tool}/{from}/{to} (Get the MT tool between two language pairs) is CRITICAL: Test Get the MT tool between two language pairs returned the unexpected status 503 (expecting: 200): /v2/transl [15:08:05] m}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 503 (expecting: 200): /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an H [15:08:05] ment using TestClient, adapt the links to target language wiki. returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200 [15:08:05] o (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /_info/name (retrieve service name) is CRITICAL: Test retrieve service name returned the unexpected status 503 (expecting: 200): /_info/version (retrieve service version) is CRITICAL: Test retrieve service version returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [15:08:58] expected ^ [15:12:15] PROBLEM - Host rdb2010 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:21] (03CR) 10Vgutierrez: [C: 03+2] mtail: Add additional buckets for haproxy TTFB metrics [puppet] - 10https://gerrit.wikimedia.org/r/822096 (owner: 10Vgutierrez) [15:13:03] <_joe_> !log shutting down rdb2010,puppetmaster2002 for d5 maintenance [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:16] <_joe_> this ^^ means we'll have puppet failure [15:13:18] <_joe_> *s [15:13:27] PROBLEM - Host puppetmaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:31] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1052 site=eqiad tunnel=mc2034_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:13:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance [15:14:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance [15:14:47] <_joe_> !log power off krb2002 [15:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet [15:16:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet [15:17:11] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet [15:17:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet [15:17:23] PROBLEM - Host logstash2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:17:41] PROBLEM - Host restbase2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:17:57] PROBLEM - Host ores2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:20:03] PROBLEM - Host db2093.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:20:15] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:53] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet [15:20:53] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet [15:24:37] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10bking) [15:25:35] (03PS1) 10Jbond: hieradata: offline puppetmaster[12]002 ready for decomission [puppet] - 10https://gerrit.wikimedia.org/r/822103 [15:26:09] PROBLEM - Host ml-serve2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:27] PROBLEM - Host db2120.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:37] PROBLEM - Host db2129.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:39] PROBLEM - Host db2172.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:27:16] (03CR) 10Jbond: [C: 03+2] hieradata: offline puppetmaster[12]002 ready for decomission [puppet] - 10https://gerrit.wikimedia.org/r/822103 (owner: 10Jbond) [15:28:17] PROBLEM - Juniper alarms on asw-d-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:28:33] PROBLEM - Host parse2016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:33] PROBLEM - Host parse2017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:47] PROBLEM - Host puppetmaster2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:58] PROBLEM - Host rdb2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:09] PROBLEM - Host ganeti2017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:17] PROBLEM - Host gerrit2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:17] PROBLEM - Host gitlab-runner2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:21] PROBLEM - Host restbase2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:21] PROBLEM - Host restbase2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:35] PROBLEM - Host mc2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:35] PROBLEM - Host mc2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:00] (03PS1) 10Jbond: puppetmaster: remove puppetmaster[12]002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/822104 (https://phabricator.wikimedia.org/T314136) [15:30:01] PROBLEM - Host krb2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:17] (03CR) 10Jbond: [C: 03+2] puppetmaster: remove puppetmaster[12]002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/822104 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [15:30:47] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet [15:30:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet [15:30:49] PROBLEM - Host wdqs2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:25] PROBLEM - Host elastic2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:39] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:13] !log remove puppetmaster[12]002 from production [15:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:30] !log (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes -- T314941 [15:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:35] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [15:38:33] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [15:38:56] (03PS1) 10Jbond: P:sretest: test behaviour of empty define [puppet] - 10https://gerrit.wikimedia.org/r/822106 (https://phabricator.wikimedia.org/T313825) [15:40:47] RECOVERY - Host puppetmaster2002 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [15:40:57] RECOVERY - Host puppetmaster2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [15:41:05] RECOVERY - Host restbase2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.09 ms [15:41:13] RECOVERY - Host rdb2010 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [15:41:15] RECOVERY - Host ores2009 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [15:41:19] PROBLEM - IPMI Sensor Status on ores2009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:41:29] RECOVERY - Host parse2016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [15:41:29] RECOVERY - Host parse2017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:41:53] RECOVERY - Host rdb2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:41:59] RECOVERY - Host ml-serve2008 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [15:42:05] RECOVERY - Host ganeti2017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 50.49 ms [15:42:13] RECOVERY - Host gerrit2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.37 ms [15:42:13] RECOVERY - Host gitlab-runner2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:42:17] RECOVERY - Host restbase2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms [15:42:17] RECOVERY - Host restbase2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.89 ms [15:42:19] RECOVERY - Juniper alarms on asw-d-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:42:31] RECOVERY - Host mc2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.13 ms [15:42:31] RECOVERY - Host mc2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [15:42:45] PROBLEM - ores_workers_running on ores2009 is CRITICAL: PROCS CRITICAL: 71 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:42:45] PROBLEM - Cassandra instance data free space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7419 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:42:57] RECOVERY - Host krb2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:43:43] RECOVERY - Host wdqs2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [15:43:51] RECOVERY - Host ores2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 47.59 ms [15:43:55] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:44:17] RECOVERY - Host elastic2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [15:44:27] RECOVERY - Host db2120.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:44:52] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10MatthewVernon) I've seen auth failures with swift-ring-manager sometimes too on thanos, anecdotally associated with high load, but there's never... [15:45:05] RECOVERY - ores_workers_running on ores2009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:45:07] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance [15:45:15] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance [15:45:29] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance [15:45:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:45:35] RECOVERY - Host ml-serve2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [15:45:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance [15:45:43] PROBLEM - Host ganeti2017 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:47] 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) 05In progress→03Resolved a:05BCornwall→03jbond [15:45:52] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance [15:45:57] RECOVERY - Host db2093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.14 ms [15:46:01] RECOVERY - Host db2129.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:46:03] PROBLEM - Host ldap-replica2006 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:03] RECOVERY - Host db2172.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [15:46:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance [15:46:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4 [15:46:15] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance [15:46:21] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [15:46:21] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7446 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:46:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4 [15:46:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance [15:46:35] !log flushing tables in row A (RESTBase Cassandra cluster) -- T314941 [15:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:38] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [15:47:29] RECOVERY - Cassandra instance data free space on restbase1016 is OK: DISK OK - free space: /srv/cassandra/instance-data 25424 MB (68% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:47:47] (KubernetesCalicoDown) resolved: ml-serve2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:49:15] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7377 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:49:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance [15:49:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance [15:50:13] RECOVERY - Host ganeti2017 is UP: PING OK - Packet loss = 0%, RTA = 34.93 ms [15:50:17] PROBLEM - IPMI Sensor Status on ganeti2017 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:51:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Dzahn) The file names with colons in them are not directly defined in puppet where we could have easily renamed them. The... [15:51:31] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:51:43] (03PS1) 10Eevans: cassanrdra: Increase hint delivery throughput [puppet] - 10https://gerrit.wikimedia.org/r/822110 (https://phabricator.wikimedia.org/T314941) [15:51:56] !log flushing tables in row B (RESTBase Cassandra cluster) -- T314941 [15:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:59] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [15:52:31] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/822110 (https://phabricator.wikimedia.org/T314941) (owner: 10Eevans) [15:53:07] !log poweroff cp2041, 42 for PDU ugprade: rack D7 [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:11] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [15:53:57] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 26101 MB (70% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:54:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10ssingh) [15:54:34] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet [15:54:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet [15:55:21] PROBLEM - Host aqs2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:29] PROBLEM - Host maps2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:29] PROBLEM - Host kubernetes2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:29] PROBLEM - Host kubernetes2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:28] (KubernetesCalicoDown) firing: (2) ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:56:57] PROBLEM - Host ml-serve2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] PROBLEM - Host aqs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] PROBLEM - Host aqs2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] PROBLEM - Host aqs2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:33] PROBLEM - Host db2130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:43] PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:47] PROBLEM - Juniper alarms on asw-d-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:59:19] (03PS2) 10Eevans: cassanrdra: Increase hint delivery throughput [puppet] - 10https://gerrit.wikimedia.org/r/822110 (https://phabricator.wikimedia.org/T314941) [16:01:01] PROBLEM - Host ganeti2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:01:19] PROBLEM - Host db2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:47] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7085 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:04:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36675/console" [puppet] - 10https://gerrit.wikimedia.org/r/822086 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [16:07:27] RECOVERY - Cassandra instance data free space on restbase1018 is OK: DISK OK - free space: /srv/cassandra/instance-data 17735 MB (47% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:07:39] (03CR) 10FNegri: [C: 03+2] Move cloudweb100[12] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/817385 (https://phabricator.wikimedia.org/T313861) (owner: 10Andrew Bogott) [16:09:11] !log flushing tables in row D (RESTBase Cassandra cluster) -- T314941 [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:15] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [16:09:27] (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure the new intermediate CA for etcd use [puppet] - 10https://gerrit.wikimedia.org/r/822086 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [16:10:52] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work [16:11:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work [16:11:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ba2eda0b-8bbe-4755-9a59-5480b01ae495) set by mvernon@cumin1001 for 1 day, 0:0... [16:12:25] (03CR) 10Jbond: [C: 03+2] P:sretest: test behaviour of empty define [puppet] - 10https://gerrit.wikimedia.org/r/822106 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [16:12:49] (03CR) 10Hnowlan: [C: 03+2] cassanrdra: Increase hint delivery throughput [puppet] - 10https://gerrit.wikimedia.org/r/822110 (https://phabricator.wikimedia.org/T314941) (owner: 10Eevans) [16:12:57] RECOVERY - Juniper alarms on asw-d-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:13:25] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org [16:13:56] !log reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: T309651 [16:13:57] RECOVERY - Host ganeti2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.03 ms [16:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:59] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [16:14:15] RECOVERY - Host db2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:14:38] (03PS1) 10Krinkle: Remove redundant $wgLanguageConverterCacheType CLI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 [16:14:53] RECOVERY - Host aqs2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [16:15:01] RECOVERY - Host maps2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.03 ms [16:15:01] RECOVERY - Host kubernetes2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [16:15:01] RECOVERY - Host kubernetes2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [16:15:08] (03CR) 10Krinkle: "Demonstrative:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 (owner: 10Krinkle) [16:16:15] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:16:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet [16:16:33] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet [16:17:18] RECOVERY - Host aqs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:17:18] RECOVERY - Host aqs2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [16:17:18] RECOVERY - Host aqs2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [16:17:35] RECOVERY - Host ml-serve2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [16:17:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:57] RECOVERY - Host db2130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [16:18:07] RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [16:18:20] (03CR) 10MVernon: [C: 03+1] "Useful update, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/822039 (https://phabricator.wikimedia.org/T314914) (owner: 10Filippo Giunchedi) [16:18:23] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:45] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:08] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/822040 (https://phabricator.wikimedia.org/T314914) (owner: 10Filippo Giunchedi) [16:21:28] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:22:20] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet [16:22:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet [16:23:01] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet [16:23:01] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet [16:23:08] !log shutting down gerrit2001 [16:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:31] PROBLEM - Host elastic2067 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:08] (03PS1) 10Jbond: Revert "P:sretest: test behaviour of empty define" [puppet] - 10https://gerrit.wikimedia.org/r/821743 [16:24:17] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:25] (03PS1) 10David Caro: p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) [16:24:27] (03PS1) 10David Caro: p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) [16:24:28] (KubernetesCalicoDown) resolved: (2) kubernetes2013.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:24:29] (03PS1) 10David Caro: p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) [16:24:35] (03CR) 10Jbond: [C: 03+2] Revert "P:sretest: test behaviour of empty define" [puppet] - 10https://gerrit.wikimedia.org/r/821743 (owner: 10Jbond) [16:25:20] !log fnegri@cumin1001 START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org [16:25:23] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:47] (03PS1) 10Btullis: Use specific etcd intermediate CA to generate etcd certs in PKI mode [puppet] - 10https://gerrit.wikimedia.org/r/822118 (https://phabricator.wikimedia.org/T313129) [16:25:49] PROBLEM - Host cp2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:05] PROBLEM - Host elastic2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:05] PROBLEM - Host elastic2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:05] PROBLEM - Host elastic2060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:05] PROBLEM - Host elastic2067.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:11] PROBLEM - Host elastic2086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:26:59] PROBLEM - Host mc2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:01] PROBLEM - Host mc2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:05] PROBLEM - Host kafka-main2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:05] PROBLEM - Host ms-be2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:11] PROBLEM - Host ms-be2056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:13] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [16:27:15] PROBLEM - Host ms-be2059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36676/console" [puppet] - 10https://gerrit.wikimedia.org/r/822118 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [16:27:43] PROBLEM - Host ms-be2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:07] PROBLEM - Host thanos-be2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:18] (03CR) 10CI reject: [V: 04-1] p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:28:45] (03CR) 10CI reject: [V: 04-1] p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:29:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2003 is CRITICAL: 32 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [16:29:28] !log restarting Cassandra (RESTBase) -row A- to apply r822110 -- T314941 [16:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [16:29:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10MatthewVernon) Sorry, this is blocking on my having time to work on thanos (and that in turn is blocking on it being in a happy state to work on, complica... [16:30:13] (03PS3) 10Cathal Mooney: Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) [16:30:24] (03CR) 10CI reject: [V: 04-1] p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:30:54] !log kubectl uncordon kubernetes2013.codfw.wmnet [16:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:59] PROBLEM - Host cp2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:06] !log fnegri@cumin1001 START - Cookbook sre.dns.netbox [16:31:06] !log kubectl uncordon kubernetes2014.codfw.wmnet [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:23] PROBLEM - Host elastic2068 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:31] PROBLEM - Host elastic2068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:58] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet [16:31:59] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet [16:32:42] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:32:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org [16:32:52] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gerrit2001.wikimedia.org` - gerrit2001.wikimedia.org (**... [16:33:24] 10SRE, 10MediaWiki-General, 10Traffic, 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [16:34:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10RKemper) [16:34:52] (03CR) 10Cathal Mooney: "Thanks for the review John, hopefully in a bit better shape now." [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [16:35:03] (03PS1) 10Krinkle: Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) [16:36:45] (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:41] PROBLEM - Host wdqs2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:40:49] (03CR) 10Jbond: "LGTM, if you haven't already please test before deploying" [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [16:40:56] (03CR) 10Jbond: [C: 03+1] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [16:42:37] RECOVERY - IPMI Sensor Status on ores2009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:43:41] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 28 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [16:48:15] RECOVERY - Host elastic2053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.97 ms [16:48:51] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:17] RECOVERY - Host ms-be2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms [16:49:25] RECOVERY - Host ms-be2059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:49:59] RECOVERY - Host ms-be2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [16:50:23] RECOVERY - Host thanos-be2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [16:50:49] PROBLEM - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:54:09] RECOVERY - Host cp2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.07 ms [16:54:27] RECOVERY - Host elastic2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:54:27] RECOVERY - Host elastic2060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.97 ms [16:54:27] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use specific etcd intermediate CA to generate etcd certs in PKI mode [puppet] - 10https://gerrit.wikimedia.org/r/822118 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [16:54:49] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [16:55:21] RECOVERY - Host wdqs2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms [16:55:29] RECOVERY - Host mc2053.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 33.86 ms [16:55:31] RECOVERY - Host kafka-main2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.06 ms [16:55:37] RECOVERY - Host ms-be2056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [16:55:51] !log fnegri@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:51] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org [16:56:30] !log testing ATS 9.1.3-1wm1 on cp6016: T309651 [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:33] RECOVERY - Host elastic2067 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [16:56:33] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [16:56:37] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [16:56:47] RECOVERY - Host elastic2068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [16:57:41] RECOVERY - Host elastic2068 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [16:57:47] PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:01] (03CR) 10Ottomata: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [16:59:49] RECOVERY - Host cp2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.84 ms [17:00:07] RECOVERY - Host elastic2067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [17:00:13] RECOVERY - Host elastic2086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [17:00:43] RECOVERY - Host mc2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [17:02:07] !log testing ATS 9.1.3-1wm1 on cp6008: T309651 [17:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:11] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:02:27] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7770 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:03:12] (03PS1) 10Btullis: Remove the bootstrap param from the dse-k8s-etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/822120 (https://phabricator.wikimedia.org/T313129) [17:04:01] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host parse2018 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:34] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org [17:05:02] (03CR) 10Btullis: [C: 03+2] Remove the bootstrap param from the dse-k8s-etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/822120 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [17:05:37] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance [17:05:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance [17:06:11] PROBLEM - Host conf2006 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:49] !log flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- T314941 [17:06:50] !log testing ATS 9.1.3-1wm1 on cp4032: T309651 [17:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:53] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [17:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [17:08:08] !log otto@deploy1002 Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - T270433 - TEST [analytics/refinery@d4dd7e4] [17:08:09] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 26352 MB (70% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:08:11] T270433: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 [17:08:51] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1054 site=eqiad tunnel=mc2036_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:09:15] RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:25] PROBLEM - Host restbase2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:09:33] PROBLEM - Host conf2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:09:38] (03PS2) 10Dzahn: site: remove gerrit2001, merge gerrit1001/2002 regex [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027) [17:09:38] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [17:09:45] PROBLEM - Host mc2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:23] PROBLEM - Host krb2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:37] PROBLEM - Host db2152.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:49] PROBLEM - Host ganeti2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:11:10] PROBLEM - Host parse2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:11:10] PROBLEM - Host parse2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:11:10] PROBLEM - Host parse2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:12:06] !log testing ATS 9.1.3-1wm1 on cp4026: T309651 [17:12:27] !log otto@deploy1002 Finished deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - T270433 - TEST [analytics/refinery@d4dd7e4] (duration: 04m 19s) [17:12:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:36] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gerrit2001.wikimedia.org [17:12:50] !log otto@deploy1002 Started deploy [analytics/refinery@d4dd7e4]: Add safety limits to refinery-drop-older-than - T270433 - [analytics/refinery@d4dd7e4] [17:13:03] PROBLEM - Host theemin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:17] PROBLEM - Host db2131.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet [17:13:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet [17:13:25] PROBLEM - Host db2174.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:25] PROBLEM - Host db2173.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:25] PROBLEM - Host db2181.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:25] PROBLEM - Host db2182.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:14:15] PROBLEM - Host elastic2067 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:47] RECOVERY - Host elastic2067 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [17:15:05] PROBLEM - Host gerrit2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:15:25] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:16:57] PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:31] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:40] !log otto@deploy1002 Finished deploy [analytics/refinery@d4dd7e4]: Add safety limits to refinery-drop-older-than - T270433 - [analytics/refinery@d4dd7e4] (duration: 05m 50s) [17:18:41] RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10BCornwall) [17:18:53] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:22:05] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7183 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:24:12] (03CR) 10Dzahn: [C: 03+2] "decom cookbook ran" [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [17:24:18] !log flushing RESTBase Cassandra tables -row D- to (temporarily) free instance-data space -- T314941 [17:26:07] PROBLEM - Host ms-be2069 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:15] RECOVERY - Cassandra instance data free space on restbase1018 is OK: DISK OK - free space: /srv/cassandra/instance-data 25970 MB (69% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:26:25] PROBLEM - Host ms-be2037 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:45] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler={proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:26:57] PROBLEM - Host kubernetes2014 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:07] RECOVERY - Host ms-be2037 is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [17:27:11] RECOVERY - Host ms-be2069 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [17:27:17] RECOVERY - Host gerrit2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 351.20 ms [17:27:25] RECOVERY - Host restbase2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.68 ms [17:27:27] PROBLEM - Host apifeatureusage2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:27] RECOVERY - Host db2131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [17:27:27] RECOVERY - Host kubernetes2014 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [17:27:31] RECOVERY - Host conf2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 51.16 ms [17:27:35] RECOVERY - Host apifeatureusage2001 is UP: PING OK - Packet loss = 0%, RTA = 30.12 ms [17:27:37] RECOVERY - Host db2182.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms [17:27:43] RECOVERY - Host mc2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [17:27:51] PROBLEM - Host elastic2067 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:01] PROBLEM - Host ms-be2039 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:05] RECOVERY - Host conf2006 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [17:28:31] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:39] PROBLEM - Host elastic2068 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:43] RECOVERY - Host db2152.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [17:28:49] RECOVERY - Host krb2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.36 ms [17:28:51] RECOVERY - Host parse2018 is UP: PING OK - Packet loss = 0%, RTA = 30.10 ms [17:28:55] RECOVERY - Host ms-be2039 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [17:28:55] RECOVERY - Host elastic2067 is UP: PING OK - Packet loss = 0%, RTA = 30.09 ms [17:28:59] RECOVERY - Host ganeti2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.12 ms [17:29:05] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [17:29:11] RECOVERY - Host elastic2068 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [17:29:17] RECOVERY - Host parse2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [17:29:17] RECOVERY - Host parse2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [17:29:17] RECOVERY - Host parse2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms [17:30:29] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:30:37] !log otto@deploy1002 Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e] [17:30:42] T270433: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 [17:30:45] RECOVERY - Host theemin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [17:30:56] !log fnegri@cumin1001 START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org [17:31:09] RECOVERY - Host db2174.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [17:31:09] RECOVERY - Host db2181.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [17:31:09] RECOVERY - Host db2173.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [17:32:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:32:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson) Sorry about that I thought all the updates merged. As far as connected to ports, all of them are now connected with no port description... [17:33:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:07] (03PS1) 10BCornwall: admin: add Virginia Poundstone (vpoundstone) [puppet] - 10https://gerrit.wikimedia.org/r/822123 (https://phabricator.wikimedia.org/T314676) [17:33:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:34:29] (03CR) 10CI reject: [V: 04-1] admin: add Virginia Poundstone (vpoundstone) [puppet] - 10https://gerrit.wikimedia.org/r/822123 (https://phabricator.wikimedia.org/T314676) (owner: 10BCornwall) [17:34:52] (03PS1) 10Andrea Denisse: netmon: Use netmon1003's IP address for the librenms endpoint [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) [17:34:56] !log otto@deploy1002 Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s) [17:35:49] !log fnegri@cumin1001 START - Cookbook sre.dns.netbox [17:36:40] !log otto@deploy1002 Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e] [17:36:43] T270433: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 [17:37:24] (03PS2) 10BCornwall: admin: add Virginia Poundstone (vpoundstone) [puppet] - 10https://gerrit.wikimedia.org/r/822123 (https://phabricator.wikimedia.org/T314676) [17:39:37] (03PS1) 10Jbond: P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) [17:39:49] !log fnegri@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:39:50] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org [17:40:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36677/console" [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [17:41:56] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) For a bit of context the above patch will augment the existing vars under network.interfaces, potentially ad... [17:42:09] !log otto@deploy1002 Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e] (duration: 05m 28s) [17:42:12] T270433: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 [17:44:31] (03PS2) 10Jbond: P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) [17:46:19] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [17:46:27] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) 05In progress→03Resolved gerrit2002 is production https://gerrit-replica.wikimedia.org gerrit2001 is shut down and fully decom'ed. [17:47:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36678/console" [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [17:47:22] (03PS1) 10Andrea Denisse: netmon: Add the netmon1003 host to the alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) [17:47:33] (03CR) 10Ssingh: [C: 03+1] "Verified that the information matches." [puppet] - 10https://gerrit.wikimedia.org/r/822123 (https://phabricator.wikimedia.org/T314676) (owner: 10BCornwall) [17:48:27] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:49:21] (03CR) 10BCornwall: [C: 03+2] admin: add Virginia Poundstone (vpoundstone) [puppet] - 10https://gerrit.wikimedia.org/r/822123 (https://phabricator.wikimedia.org/T314676) (owner: 10BCornwall) [17:49:53] (03Abandoned) 10Andrea Denisse: Revert "netmon: failover to netmon1003" [dns] - 10https://gerrit.wikimedia.org/r/821727 (owner: 10Andrea Denisse) [17:50:06] (03CR) 10FNegri: [C: 03+2] Remove puppet refs to labweb100[12] [puppet] - 10https://gerrit.wikimedia.org/r/817386 (https://phabricator.wikimedia.org/T313861) (owner: 10Andrew Bogott) [17:51:04] (03PS3) 10FNegri: Remove puppet refs to labweb100[12] [puppet] - 10https://gerrit.wikimedia.org/r/817386 (https://phabricator.wikimedia.org/T313861) (owner: 10Andrew Bogott) [17:51:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10BCornwall) The changes have been accepted. Virginia, your new access should be applied shortly. Please feel free to reopen if there are still... [17:53:51] RECOVERY - IPMI Sensor Status on ganeti2017 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:55:07] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:56:19] RECOVERY - MariaDB Replica IO: s1 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:56:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10BCornwall) 05Open→03Resolved [17:57:32] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10BCornwall) p:05Triage→03Medium [17:57:42] (03PS3) 10Jbond: P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) [17:57:44] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Bump memcache connections and swift-proxy limits - https://phabricator.wikimedia.org/T314914 (10BCornwall) p:05Triage→03Medium [17:59:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36681/console" [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [17:59:42] (03PS4) 10Jbond: P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T313825) [18:00:04] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220810T1800) [18:00:35] 10SRE, 10Traffic, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10BCornwall) p:05Triage→03Medium [18:00:47] 10SRE, 10ops-eqiad, 10Traffic: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10ssingh) [18:02:11] 10SRE, 10ops-eqiad, 10Traffic: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10ssingh) p:05Triage→03Medium [18:05:20] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36680/" [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [18:05:24] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool D8 DBs after PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json [18:05:33] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [18:05:33] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [18:06:02] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [18:07:09] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet [18:07:10] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet [18:12:50] (03CR) 10Jbond: "pcc https://puppet-compiler.wmflabs.org/pcc-worker1001/36683/" [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [18:13:27] !log truncating codfw Cassandra hints (eqiad datacenter) -- T314941 [18:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:31] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [18:13:31] (03PS1) 10Jforrester: inEventSample: Avoid invalid character warning from sampling code, hash into hex [extensions/WikiEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821745 (https://phabricator.wikimedia.org/T314896) [18:16:22] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:53] 10SRE: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10Dzahn) [18:19:08] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:52] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission labweb1001 and labweb1002 - https://phabricator.wikimedia.org/T313861 (10Andrew) a:05Andrew→03Cmjohnson These are ready for DC work. Note that the decom script failed to wipe the drives in labweb1001 so they still need to be wiped if... [18:22:25] !log truncating Cassandra hints (eqiad datacenter) -- T314941 [18:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:30] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [18:22:43] (03PS1) 10Ryan Kemper: elastic: racking info for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822129 (https://phabricator.wikimedia.org/T309810) [18:22:46] (03PS1) 10Stang: trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) [18:23:56] (03CR) 10Stang: "To deployer: please run createExtensionTables.php before merging this patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang) [18:26:18] (03PS1) 10Andrew Bogott: role::wmcs::openstack::eqiad1::control: remove rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/822133 (https://phabricator.wikimedia.org/T314522) [18:26:20] (03PS1) 10Andrew Bogott: Make cloudcontrol1007 the primary glance server [puppet] - 10https://gerrit.wikimedia.org/r/822134 (https://phabricator.wikimedia.org/T313268) [18:28:50] (03PS1) 10Jbond: O:wikidough: drop wikidough abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) [18:30:27] (03PS2) 10Jbond: O:wikidough: drop wikidough abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) [18:31:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36685/console" [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) (owner: 10Jbond) [18:34:16] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [18:36:28] (03CR) 10Majavah: "consider also https://gerrit.wikimedia.org/r/c/operations/puppet/+/800949" [puppet] - 10https://gerrit.wikimedia.org/r/822134 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [18:36:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye [18:38:11] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye [18:38:25] (03PS2) 10Andrew Bogott: Make cloudcontrol1007 the primary glance server [puppet] - 10https://gerrit.wikimedia.org/r/822134 (https://phabricator.wikimedia.org/T313268) [18:38:27] (03PS2) 10Andrew Bogott: role::wmcs::openstack::eqiad1::control: remove rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/822133 (https://phabricator.wikimedia.org/T314522) [18:38:29] (03PS1) 10Andrew Bogott: cloudcontrol100[34]: move to spare role [puppet] - 10https://gerrit.wikimedia.org/r/822141 (https://phabricator.wikimedia.org/T313268) [18:38:31] (03PS1) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [18:40:48] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) disk was bad it was replaced now you need to put the replaced disk back in the raid. [18:42:16] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [18:43:46] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10BCornwall) 05In progress→03Resolved @Jclark-ctr Since there's been no activity on this ticket for some time I'm going to go ahead and close it. Pl... [18:43:59] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "all hosts using this profile are bullseye. checked with:" [puppet] - 10https://gerrit.wikimedia.org/r/820656 (owner: 10Muehlenhoff) [18:44:01] (03CR) 10Bking: [V: 03+2] elastic: racking info for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822129 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [18:44:20] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10BCornwall) [18:46:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gitlab_runner: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820656 (owner: 10Muehlenhoff) [18:46:10] PROBLEM - Host cp2042 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:02] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:07] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:48:00] (03PS1) 10Andrew Bogott: Openstack Glance: monitor service on all nodes [puppet] - 10https://gerrit.wikimedia.org/r/822145 [18:48:42] (03CR) 10CI reject: [V: 04-1] Openstack Glance: monitor service on all nodes [puppet] - 10https://gerrit.wikimedia.org/r/822145 (owner: 10Andrew Bogott) [18:49:09] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage [18:50:48] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage [18:51:50] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage [18:52:55] (03PS2) 10Andrew Bogott: Openstack Glance: monitor service on all nodes [puppet] - 10https://gerrit.wikimedia.org/r/822145 [18:54:10] 10SRE, 10ops-eqiad, 10Traffic: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10wiki_willy) a:03Cmjohnson [18:55:03] win 4 [18:55:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage [18:56:42] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Glance: monitor service on all nodes [puppet] - 10https://gerrit.wikimedia.org/r/822145 (owner: 10Andrew Bogott) [18:57:27] (03Abandoned) 10Andrew Bogott: Make cloudcontrol1007 the primary glance server [puppet] - 10https://gerrit.wikimedia.org/r/822134 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [18:57:41] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [18:59:19] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) >>! In T314789#8139432, @Legoktm wrote: > I would recommend... [19:06:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye [19:08:01] (03CR) 10Mforns: [C: 03+1] "Hi @Phuedx! Thanks for putting this together." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [19:08:21] (03PS3) 10Andrew Bogott: role::wmcs::openstack::eqiad1::control: remove rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/822133 (https://phabricator.wikimedia.org/T314522) [19:08:23] (03PS2) 10Andrew Bogott: cloudcontrol100[34]: move to spare role [puppet] - 10https://gerrit.wikimedia.org/r/822141 (https://phabricator.wikimedia.org/T313268) [19:08:25] (03PS2) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [19:09:00] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye [19:09:53] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:14:35] (03CR) 10Andrew Bogott: [C: 03+2] role::wmcs::openstack::eqiad1::control: remove rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/822133 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [19:15:39] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [19:15:59] (03PS1) 10Ssingh: Revert "Depool codfw for PDU upgrade (row D)" [dns] - 10https://gerrit.wikimedia.org/r/822147 [19:19:08] (03CR) 10Ryan Kemper: [C: 03+2] elastic: racking info for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822129 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [19:19:17] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Status update: I've reached out to techsupport@ as we had a thread on this previous to contacting SRE (#92751 on zendesk), hoping they can he... [19:21:11] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol100[34]: move to spare role [puppet] - 10https://gerrit.wikimedia.org/r/822141 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [19:22:04] (03CR) 10Bking: [V: 03+2 C: 03+1] elastic: racking info for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822129 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [19:23:33] (03CR) 10Ssingh: [C: 03+2] Revert "Depool codfw for PDU upgrade (row D)" [dns] - 10https://gerrit.wikimedia.org/r/822147 (owner: 10Ssingh) [19:25:15] (03PS3) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [19:25:17] (03PS1) 10Andrew Bogott: Remove cloudcontrol1003/1004 as openstack nodes [puppet] - 10https://gerrit.wikimedia.org/r/822167 (https://phabricator.wikimedia.org/T313268) [19:27:15] (03CR) 10Andrew Bogott: [C: 03+2] Remove cloudcontrol1003/1004 as openstack nodes [puppet] - 10https://gerrit.wikimedia.org/r/822167 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [19:28:23] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:28] !log testing ATS 9.1.3-1wm1 on cp4026: T309651 [19:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:31] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [19:30:53] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:51] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:34:23] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet [19:34:23] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet [19:34:35] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36690/" [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:35:07] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet [19:35:08] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet [19:35:53] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet [19:35:54] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet [19:36:41] (03CR) 10Phuedx: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [19:39:03] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:40:15] (03CR) 10Cwhite: [C: 03+2] logstash: clean up unneeded filters [puppet] - 10https://gerrit.wikimedia.org/r/820569 (owner: 10Cwhite) [19:40:21] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [19:48:55] (03CR) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [19:51:10] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet [19:51:11] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet [20:00:05] RoanKattouw, Urbanecm, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220810T2000) [20:00:05] zabe, ori, cjming, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hi - i can deploy [20:00:21] o/ [20:00:35] zabe: are you around? [20:01:12] (03PS2) 10Clare Ming: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820646 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:01:41] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:01] cjming, hi [20:02:12] hi! [20:02:16] (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820646 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:02:21] hi [20:03:09] (03Merged) 10jenkins-bot: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820646 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:03:41] zabe: on mwdebug1002 if it's check-able [20:03:41] (03PS1) 10Ryan Kemper: elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) [20:05:04] cjming, lgtm [20:05:10] great - syncing now [20:05:41] (03CR) 10Clare Ming: [C: 03+2] testwiki: set $wgCdnMatchParameterOrder to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822093 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [20:05:52] (03PS2) 10Clare Ming: testwiki: set $wgCdnMatchParameterOrder to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822093 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [20:07:24] cjming: my two patches can be staged together on mwdebug [20:07:35] ori - sounds good [20:07:47] (03CR) 10Clare Ming: [C: 03+2] Support CDN query parameter re-ordering [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821731 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [20:07:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:52] (03CR) 10Clare Ming: [C: 03+2] testwiki: set $wgCdnMatchParameterOrder to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822093 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [20:08:37] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646|Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s) [20:08:40] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:08:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:09:29] (03Merged) 10jenkins-bot: testwiki: set $wgCdnMatchParameterOrder to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822093 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [20:09:42] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:10:03] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:11:29] ori: your 1st patch is up on mwdebug1002 -- still waiting for CI on your 2nd patch [20:11:51] ack, thanks [20:12:45] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:38] (03PS2) 10Clare Ming: Enable sticky header edit test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573) [20:17:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:18:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:19:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:26] (03Merged) 10jenkins-bot: Support CDN query parameter re-ordering [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821731 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [20:23:50] finally :) I'll need ~2-3m to verify on mwdebug [20:24:07] ori: np - take your time [20:26:06] is the core change on mwdebug already? [20:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:26:27] ori: it is - sorry - on mwdebug1002 [20:26:43] np, thank you! will confirm in a moment [20:26:56] (03PS2) 10Ryan Kemper: elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) [20:28:15] (03CR) 10Bking: [C: 03+1] elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [20:28:27] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: rancid-differ.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:37] cjming: looks ok, but i'm tailing error.log on mwlog1002 and noticed a rash of these errors: [20:29:39] 2022-08-10 20:27:21 [dd20d49732fd6cd3f7b49d2e] mwmaint1002 nowiki 1.39.0-wmf.23 error ERROR: [dd20d49732fd6cd3f7b49d2e] [no req] PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource {"exception_url":"[no req]","reqId":"dd20d49732fd6cd3f7b49d2e","caught_by":"mwe_handler"} [20:29:41] [Exception ErrorException] (/srv/mediawiki/php-1.39.0-wmf.23/includes/libs/http/MultiHttpClient.php:292) PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource [20:29:54] can't see how it's related to my change but want to make sure it's expected? [20:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:30:23] ori: ya - i'm not sure what those are about [20:30:32] (03PS3) 10Ryan Kemper: elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) [20:30:53] ok, looks like they've been happening for a while [20:30:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:30:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:05] my changes should be good to sync [20:31:06] ori: so ok to sync both? [20:31:10] yes [20:31:12] alrighty [20:31:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:41] looks like that PHP curl warning has a ticket https://phabricator.wikimedia.org/T288624 [20:33:53] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:56] but it's kinda old [20:34:42] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093|testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s) [20:34:47] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [20:34:57] (03CR) 10Bking: [C: 03+1] elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [20:37:00] (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:37:10] ori: i'm getting this error when trying to sync your 2nd patch: [20:37:22] php lint failed: [20:37:22] Parse error: syntax error, unexpected 'public' (T_PUBLIC), expecting variable (T_VARIABLE) in /srv/mediawiki-staging/php-1.39.0-wmf.23/vendor/symfony/console/Attribute/AsCommand.php on line 21 [20:37:22] Errors parsing /srv/mediawiki-staging/php-1.39.0-wmf.23/vendor/symfony/console/Attribute/AsCommand.php [20:38:23] I don't think it's related to my change. This came up recently, let me see if I can find the reference. [20:38:49] https://phabricator.wikimedia.org/T301344 [20:39:02] (03PS4) 10Ryan Kemper: elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) [20:39:42] do i just run "scap sync-world" as the work around? [20:40:04] (03CR) 10Bking: [C: 03+1] elastic: allocate psi vs omega for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/822169 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [20:40:23] I guess so :/ [20:40:29] MatmaRex: sorry for delaying your changes [20:40:36] ^^ [20:41:08] np [20:43:09] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:25] !log [Elastic] T309810 Bringing in new eqiad elastic hosts [20:43:32] (03PS4) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [20:43:34] (03PS1) 10Andrew Bogott: Galera monitoring: set expected node count to 3 [puppet] - 10https://gerrit.wikimedia.org/r/822170 (https://phabricator.wikimedia.org/T313268) [20:43:43] !log cjming@deploy1002 Started scap: Backport: [[gerrit:821731|Support CDN query parameter re-ordering (T138093 T314868)]] [20:44:32] ori: ok - seems to be syncing now [20:46:08] (03CR) 10Clare Ming: [C: 03+2] Enable sticky header edit test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming) [20:46:57] (03CR) 10Andrew Bogott: [C: 03+2] Galera monitoring: set expected node count to 3 [puppet] - 10https://gerrit.wikimedia.org/r/822170 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [20:47:05] (03Merged) 10jenkins-bot: Enable sticky header edit test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming) [20:47:35] !log cjming@deploy1002 Finished scap: Backport: [[gerrit:821731|Support CDN query parameter re-ordering (T138093 T314868)]] (duration: 03m 52s) [20:47:54] ori: should be live! [20:48:10] zabe: meant to say your patch is live a while ago [20:48:40] thanks :) [20:48:50] cool, looks good. thank you once again [20:49:04] np! thanks for your patience [20:49:23] (03PS2) 10Clare Ming: Enable new topic tool on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820568 (https://phabricator.wikimedia.org/T313699) (owner: 10Esanders) [20:49:43] MatmaRex: onto yours [20:50:43] (03CR) 10Clare Ming: [C: 03+2] Enable new topic tool on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820568 (https://phabricator.wikimedia.org/T313699) (owner: 10Esanders) [20:51:29] (03Merged) 10jenkins-bot: Enable new topic tool on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820568 (https://phabricator.wikimedia.org/T313699) (owner: 10Esanders) [20:51:33] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:45] (03CR) 10Krinkle: [C: 03+1] Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling) [20:51:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:52:17] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:821319|Enable sticky header edit test on beta cluster (T312573)]] (duration: 03m 08s) [20:52:31] MatmaRex: 1st patch on mwdebug1002 [20:52:38] (03PS2) 10Clare Ming: Remove unused $wgEnableMWSuggest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820533 (owner: 10Bartosz Dziewoński) [20:52:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:52:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:53:18] cjming: looks good [20:53:26] cool syncing [20:53:33] cjming: the second is a no-op, nothing to test [20:53:36] 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [20:53:48] right on - i'll sync then [20:53:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:53:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:54:02] (03CR) 10Clare Ming: [C: 03+2] Remove unused $wgEnableMWSuggest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820533 (owner: 10Bartosz Dziewoński) [20:55:18] (03Merged) 10jenkins-bot: Remove unused $wgEnableMWSuggest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820533 (owner: 10Bartosz Dziewoński) [20:55:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:56:39] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568|Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s) [20:56:44] T313699: [Config Change] Enable New Topic Tool as opt-out at de.wiki (desktop) - https://phabricator.wikimedia.org/T313699 [20:57:39] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) >>! In T314789#8140256, @TheDJ wrote: > You can use lame and/or... [20:58:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:59:44] wmopbot and ircservserv-wm bots need to be restarted [20:59:54] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533|Remove unused $wgEnableMWSuggest]] (duration: 03m 04s) [20:59:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:59:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:59:58] MatmaRex: both should be live [20:59:59] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:00:01] wmopbot just joined Sario [21:00:09] PROBLEM - Check systemd state on elastic1084 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:12] thanks cjming [21:00:14] Probably best asking Lego in -cloud for ircservserv-wm_ [21:00:15] PROBLEM - Check systemd state on elastic1094 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-omega-eqiad.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:24] np! [21:00:36] !log end of UTC late backport window [21:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:01:27] PROBLEM - Check systemd state on elastic1093 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-omega-eqiad.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:56] (03CR) 10Ori: "This change is now ready to go. Order-independent CDN URL matching landed in the production MediaWiki branch, and I turned it on for testw" [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [21:05:37] PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service,elasticsearch_6@production-search-psi-eqiad.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:27] PROBLEM - Elasticsearch HTTPS for production-search-eqiad on elastic1090 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:07:21] PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:22] ^ expected, these hosts don't get happy until their second puppet run [21:07:23] downtiming [21:07:43] RECOVERY - Elasticsearch HTTPS for production-search-eqiad on elastic1090 is OK: SSL OK - Certificate search.discovery.wmnet valid until 2027-01-23 13:10:52 +0000 (expires in 1626 days) https://wikitech.wikimedia.org/wiki/Search [21:08:43] PROBLEM - Check systemd state on elastic1090 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:30] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810 [21:09:34] T309810: Service implementation for elastic1[084-102].eqiad.wmnet - https://phabricator.wikimedia.org/T309810 [21:09:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810 [21:10:16] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T309810 [21:10:19] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:10:39] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T309810 [21:14:11] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:14:25] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:59] (03PS2) 10Bartosz Dziewoński: Remove unused config for Echo notification emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) [21:15:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:16:10] (03PS1) 10Ryan Kemper: elastic: fix bad copypaste [puppet] - 10https://gerrit.wikimedia.org/r/822173 (https://phabricator.wikimedia.org/T309810) [21:17:02] (03CR) 10Bking: [C: 03+1] elastic: fix bad copypaste [puppet] - 10https://gerrit.wikimedia.org/r/822173 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [21:18:48] (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix bad copypaste [puppet] - 10https://gerrit.wikimedia.org/r/822173 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [21:20:33] RECOVERY - Check systemd state on elastic1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:45] RECOVERY - Check systemd state on elastic1094 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:22] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) @cmooney Found an interesting behavior regarding the 'rancid' user: `topranks The systemd file for rancid exports it as an environment var I think topranks... [21:22:31] (03PS1) 10Papaul: Add new PDU model for ps1-d[4-8]-codfw [puppet] - 10https://gerrit.wikimedia.org/r/822174 (https://phabricator.wikimedia.org/T310146) [21:23:03] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:23:21] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:41] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet [21:25:15] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet [21:28:03] (03CR) 10Papaul: [C: 03+2] Add new PDU model for ps1-d[4-8]-codfw [puppet] - 10https://gerrit.wikimedia.org/r/822174 (https://phabricator.wikimedia.org/T310146) (owner: 10Papaul) [21:29:17] RECOVERY - Check systemd state on elastic1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:41] RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:51] RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:45] (JobUnavailable) resolved: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:39:07] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Definitely an odd issue. For comparisons sake we can see that netmon1002 was also trying to save the host key, but it continued after the failure, whereas netmon10... [21:44:17] RECOVERY - Check systemd state on elastic1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:37] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:54] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10bcampbell) @TAndic @jhathaway Thanks for the additional background. I was unaware that we had any SMTP relay rules set up for Qualtrics, but it loo... [22:02:31] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:51] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @bcampbell from their [[ https://www.qualtrics.com/support/survey-platform/distributions-module/email-distribution/using-a-custom-from-add... [22:10:14] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10bcampbell) @jhathaway qualtrics@wikimedia.org exists as a Google Group, but not a Google user. [22:11:46] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) >>! In T314815#8144089, @bcampbell wrote: > @jhathaway qualtrics@wikimedia.org exists as a Google Group, but not a Google user. how about... [22:11:51] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:24] (03PS1) 10Mary Yang: Add proxy_url to prometheus::blackbox::check:http as a parameter. [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) [22:20:34] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) There was some discussion on irc and interesting observations from Daniel about changes to OpenSSH betwen buster and bullseye which might account for the different... [22:21:11] (03CR) 10Mary Yang: "Hello Filippo, is this what you suggested in https://phabricator.wikimedia.org/T311457? If so, I will send a followup patch for adding the" [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [22:24:38] (03PS1) 10Mary Yang: Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) [22:25:26] (03CR) 10CI reject: [V: 04-1] Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [22:30:10] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10Dzahn) Confirmed this. The behaviour changed in the newer openssh version in bullseye it seems. On buster we have 7.9, on bullseye we have 8.4 In buster we have in `ssh.c`... [22:32:39] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:46] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10bcampbell) > how about, surveys@wikimedia.org? surveys@ is also a Google Group. I see that a user survey@ exists, though. [22:42:01] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:49] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:53] (03CR) 10Ori: Use proxy for wikifunctions beta blackbox probe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:05:17] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:01] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:05] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:32:19] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:53] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:38:36] (03PS2) 10Mary Yang: Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) [23:41:13] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state