[00:35:03] (03PS1) 10Tim Starling: Increase wgMaxUserDBWriteDuration to 10 on votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713024 (https://phabricator.wikimedia.org/T288831) [01:11:27] (03PS8) 10Zoranzoki21: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [01:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [02:05:43] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 33 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [03:49:05] (03CR) 10VolkerE: [C: 04-1] Adding square logo and wordmark for Wikimania (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [03:51:41] (03CR) 10Krinkle: [C: 04-1] "Question on task. Not sure if I missed something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712548 (https://phabricator.wikimedia.org/T288702) (owner: 10Aaron Schulz) [03:54:35] !log restarting mailman3 on lists1001, bounce runner crashed (T288880) [03:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:45] T288880: Mailman3 bounce runner crashed: TypeError: unsupported operand type(s) for +: 'NoneType' and 'datetime.timedelta' - https://phabricator.wikimedia.org/T288880 [03:56:23] the queue cleared up, icinga should announce the recoveries in a minute [03:57:01] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:57:13] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:32:40] (03CR) 10Physikerwelt: [C: 03+1] "I don't have +2 here an no reviewer with +2 was added..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [05:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [08:46:57] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [10:35:21] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:59:07] (03PS1) 10Majavah: toolforge: Remove unmodified items from ingress-nginx values [puppet] - 10https://gerrit.wikimedia.org/r/713042 [13:01:18] I think the Debian mirror alert is because of the bullseye release. [13:05:52] (03PS10) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [13:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [14:23:21] (03CR) 10Urbanecm: Clean up temporary variable wgMathUseRestBase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [14:46:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 99 probes of 616 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:50:13] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 72 probes of 699 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:52:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 44 probes of 616 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:56:03] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 6 probes of 699 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:58:56] (03PS1) 10Majavah: P::toolforge: Prepare apt repos for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713052 (https://phabricator.wikimedia.org/T284590) [16:04:53] (03PS1) 10Majavah: Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) [16:12:07] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.28 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [16:20:17] (03PS1) 10Majavah: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) [16:20:57] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:21:58] (03PS2) 10Majavah: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) [17:32:45] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.25 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [17:51:51] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [18:53:55] (03PS10) 10Juan90264: Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) [19:03:19] (03CR) 10Juan90264: [C: 03+1] Adding square logo and wordmark for Wikimania (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [19:21:59] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:39] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:35:27] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.004321 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:47:29] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01052 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:53:51] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:17] PROBLEM - cassandra-a service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:55:23] PROBLEM - cassandra-a SSL 10.192.16.85:7001 on restbase2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [20:55:33] PROBLEM - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is CRITICAL: connect to address 10.192.16.85 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:15:07] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [21:19:23] RECOVERY - Check systemd state on restbase2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:49] RECOVERY - cassandra-a service on restbase2014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:21:01] RECOVERY - cassandra-a SSL 10.192.16.85:7001 on restbase2014 is OK: SSL OK - Certificate restbase2014-a valid until 2022-10-08 10:53:35 +0000 (expires in 419 days) https://phabricator.wikimedia.org/T120662 [21:23:03] RECOVERY - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is OK: TCP OK - 0.035 second response time on 10.192.16.85 port 9042 https://phabricator.wikimedia.org/T93886 [21:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [23:18:43] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-08-11 22:48:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting