[00:19:42] RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [01:36:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:24] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:16] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:34] (03PS1) 104nn1l2: fawiki: Remove move-rootuserpages flag from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756150 (https://phabricator.wikimedia.org/T299847) [06:36:31] (03PS1) 104nn1l2: fawiki: Exempt draft namespace from robots control by users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756152 (https://phabricator.wikimedia.org/T299850) [06:52:42] (03CR) 10Legoktm: [C: 04-1] "Also see my comment on the task." [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T265689) (owner: 10MichaelSchoenitzer) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220123T0800) [08:22:52] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:10] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:48] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:08] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:08:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:16:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [11:18:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:29:21] (03CR) 10MarcoAurelio: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/1164/" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [11:30:30] (03CR) 10MarcoAurelio: p::mediawiki::maintenance: Run recountCategories.php monthly on all wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [15:04:08] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:05:14] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:01:04] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:10:10] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, wdqs1010, build2001, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:03:39] (03PS1) 10Andrew Bogott: cinder.conf: adjust backup file size based on performance testing [puppet] - 10https://gerrit.wikimedia.org/r/756176 [20:04:58] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:41] (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: adjust backup file size based on performance testing [puppet] - 10https://gerrit.wikimedia.org/r/756176 (owner: 10Andrew Bogott) [20:06:56] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:57:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [20:59:52] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:08:14] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:16:56] PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:00] PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:17:02] PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:17:16] PROBLEM - cassandra-a SSL 10.192.32.152:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:17:16] PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:17:34] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:18:00] PROBLEM - MD RAID on restbase2011 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:18:01] ACKNOWLEDGEMENT - MD RAID on restbase2011 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T299871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:18:04] 10SRE, 10ops-codfw: Degraded RAID on restbase2011 - https://phabricator.wikimedia.org/T299871 (10ops-monitoring-bot) [21:18:06] PROBLEM - cassandra-c SSL 10.192.32.154:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:18:06] PROBLEM - cassandra-a CQL 10.192.32.152:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.152 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:18:26] PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:18:54] PROBLEM - cassandra-c service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:26:53] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@fa62e75]: (no justification provided) [21:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:02] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@fa62e75]: (no justification provided) (duration: 00m 09s) [21:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:24] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, wdqs1010, labstore1006, miscweb1002, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:46:00] RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:53:12] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:02:20] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@37937f6]: (no justification provided) [22:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:28] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@37937f6]: (no justification provided) (duration: 00m 08s) [22:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:12] (03PS2) 10Hashar: gerrit: port our theme to JavaScript [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T292759) [22:31:53] (03CR) 10Hashar: [C: 04-1] "I have got the result table to show by moving the Javascript code inside Gerrit.install(plugin => {}). Apparently the dom-module can no mo" [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar)