[00:19:42] <icinga-wm>	 RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops
[01:36:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:38:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:52:24] <icinga-wm>	 PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:16] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:08:34] <wikibugs>	 (03PS1) 104nn1l2: fawiki: Remove move-rootuserpages flag from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756150 (https://phabricator.wikimedia.org/T299847)
[06:36:31] <wikibugs>	 (03PS1) 104nn1l2: fawiki: Exempt draft namespace from robots control by users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756152 (https://phabricator.wikimedia.org/T299850)
[06:52:42] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Also see my comment on the task." [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T265689) (owner: 10MichaelSchoenitzer)
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220123T0800)
[08:22:52] <icinga-wm>	 PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:24:10] <icinga-wm>	 RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:51:48] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:53:08] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[11:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[11:16:34] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[11:18:58] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:29:21] <wikibugs>	 (03CR) 10MarcoAurelio: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/1164/" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio)
[11:30:30] <wikibugs>	 (03CR) 10MarcoAurelio: p::mediawiki::maintenance: Run recountCategories.php monthly on all wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio)
[15:04:08] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:05:14] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:01:04] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:04:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[19:10:10] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, wdqs1010, build2001, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[20:03:39] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder.conf: adjust backup file size based on performance testing [puppet] - 10https://gerrit.wikimedia.org/r/756176
[20:04:58] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:05:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: adjust backup file size based on performance testing [puppet] - 10https://gerrit.wikimedia.org/r/756176 (owner: 10Andrew Bogott)
[20:06:56] <icinga-wm>	 PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:57:32] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[20:59:52] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[21:08:14] <icinga-wm>	 RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:16:56] <icinga-wm>	 PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:00] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[21:17:02] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[21:17:16] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.152:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:17:16] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:17:34] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:18:00] <icinga-wm>	 PROBLEM - MD RAID on restbase2011 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[21:18:01] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on restbase2011 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T299871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[21:18:04] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on restbase2011 - https://phabricator.wikimedia.org/T299871 (10ops-monitoring-bot)
[21:18:06] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.32.154:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:18:06] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.152:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.152 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[21:18:26] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:18:54] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:26:53] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@fa62e75]: (no justification provided)
[21:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:02] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@fa62e75]: (no justification provided) (duration: 00m 09s)
[21:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:24] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, wdqs1010, labstore1006, miscweb1002, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[21:46:00] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:53:12] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:02:20] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@37937f6]: (no justification provided)
[22:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:28] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@37937f6]: (no justification provided) (duration: 00m 08s)
[22:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:12] <wikibugs>	 (03PS2) 10Hashar: gerrit: port our theme to JavaScript [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T292759)
[22:31:53] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I have got the result table to show by moving the Javascript code inside Gerrit.install(plugin => {}). Apparently the dom-module can no mo" [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar)