[00:04:49] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:49] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:59] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:34] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs1006 [puppet] - 10https://gerrit.wikimedia.org/r/730819 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [02:19:25] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:57] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm) [05:33:22] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Joe) >>! In T293530#7433190, @Legoktm wrote: > The one subtask I haven't filed yet because I haven't had the chance to verify it is having excimer be able to interrupt C functio... [06:03:19] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:46] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm) >>! In T293530#7433611, @Joe wrote: >>>! In T293530#7433190, @Legoktm wrote: >> The one subtask I haven't filed yet because I haven't had the chance to verify it is hav... [06:13:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the approach taken here is problematic. It unties the sets of different classes that are the minimum base for setting up a working product" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [06:14:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "And just to clarify: the style guide says "one role per server", which means you can only declare one role in manifests/site.pp; it doesn'" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [06:16:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Again, this is pointless repetition and removal of composition based on a mistaken read of the puppet style guide." [puppet] - 10https://gerrit.wikimedia.org/r/731094 (owner: 10Muehlenhoff) [06:24:22] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10tstarling) I think the right way to interrupt slow network operations is with a select timeout. In this case there is MYSQLI_OPT_READ_TIMEOUT. It hasn't been necessary with MySQ... [06:27:38] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Joe) >>! In T293530#7433617, @tstarling wrote: > I think the right way to interrupt slow network operations is with a select timeout. In this case there is MYSQLI_OPT_READ_TIMEO... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211016T0700) [09:15:29] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:42] (03PS1) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 [12:18:06] (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [12:18:32] (03PS2) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 [12:19:13] (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [12:23:48] (03PS3) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 [12:24:37] (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [12:29:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:31:02] (03PS4) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 [12:34:17] (03CR) 10Majavah: debian: Rename package to toolforge-webservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [12:34:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:34:55] PROBLEM - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [12:43:09] RECOVERY - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is OK: OK: Less than 1.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [15:46:29] PROBLEM - Disk space on aqs1012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 134716 MB (3% inode=99%): /srv/cassandra-b 168245 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops [15:50:12] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T293557 (10MediaJS) [15:52:29] 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10MediaJS) [15:53:41] 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10MediaJS) 05Open→03Declined Never mind. [16:15:18] 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10RhinosF1) a:05akosiaris→03None [22:26:01] PROBLEM - Disk space on aqs1012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-b 134720 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops