[00:04:49] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:49] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:59] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:17:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs1006 [puppet] - 10https://gerrit.wikimedia.org/r/730819 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[02:19:25] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[02:19:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:56:14] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:57] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm)
[05:33:22] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Joe) >>! In T293530#7433190, @Legoktm wrote: > The one subtask I haven't filed yet because I haven't had the chance to verify it is having excimer be able to interrupt C functio...
[06:03:19] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:46] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm) >>! In T293530#7433611, @Joe wrote: >>>! In T293530#7433190, @Legoktm wrote: >> The one subtask I haven't filed yet because I haven't had the chance to verify it is hav...
[06:13:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the approach taken here is problematic. It unties the sets of different classes that are the minimum base for setting up a working product" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff)
[06:14:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "And just to clarify: the style guide says "one role per server", which means you can only declare one role in manifests/site.pp; it doesn'" [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff)
[06:16:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Again, this is pointless repetition and removal of composition based on a mistaken read of the puppet style guide." [puppet] - 10https://gerrit.wikimedia.org/r/731094 (owner: 10Muehlenhoff)
[06:24:22] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10tstarling) I think the right way to interrupt slow network operations is with a select timeout. In this case there is MYSQLI_OPT_READ_TIMEOUT. It hasn't been necessary with MySQ...
[06:27:38] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Joe) >>! In T293530#7433617, @tstarling wrote: > I think the right way to interrupt slow network operations is with a select timeout. In this case there is MYSQLI_OPT_READ_TIMEO...
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211016T0700)
[09:15:29] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:42] <wikibugs>	 (03PS1) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220
[12:18:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah)
[12:18:32] <wikibugs>	 (03PS2) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220
[12:19:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah)
[12:23:48] <wikibugs>	 (03PS3) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220
[12:24:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah)
[12:29:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[12:31:02] <wikibugs>	 (03PS4) 10Majavah: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220
[12:34:17] <wikibugs>	 (03CR) 10Majavah: debian: Rename package to toolforge-webservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah)
[12:34:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[12:34:55] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1
[12:43:09] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is OK: OK: Less than 1.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1
[15:46:29] <icinga-wm>	 PROBLEM - Disk space on aqs1012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 134716 MB (3% inode=99%): /srv/cassandra-b 168245 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops
[15:50:12] <wikibugs>	 10SRE, 10vm-requests: <site>: <number> of VMs requested for <service> - https://phabricator.wikimedia.org/T293557 (10MediaJS)
[15:52:29] <wikibugs>	 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10MediaJS)
[15:53:41] <wikibugs>	 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10MediaJS) 05Open→03Declined Never mind.
[16:15:18] <wikibugs>	 10SRE, 10vm-requests: eqiad: 2 of VMs requested for mediawiki - https://phabricator.wikimedia.org/T293557 (10RhinosF1) a:05akosiaris→03None
[22:26:01] <icinga-wm>	 PROBLEM - Disk space on aqs1012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-b 134720 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops