[00:14:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:16:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:47:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:51:16] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:51:58] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:07:40] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 35 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:08:18] sigh [05:09:30] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:51:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:53:04] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:02] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:56] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:02:37] (03PS1) 10Majavah: Drop python2 flake8 runs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697095 [09:03:14] (03PS1) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [09:26:28] (03CR) 10Legoktm: [C: 03+1] Add python-build-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/685462 (owner: 10Volans) [09:40:05] (03PS2) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [09:40:49] (03CR) 10jerkins-bot: [V: 04-1] Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [10:12:34] (03PS3) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [10:56:33] (03PS1) 10Majavah: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 [10:57:12] (03PS4) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [10:57:21] (03CR) 10jerkins-bot: [V: 04-1] Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [10:57:52] (03CR) 10jerkins-bot: [V: 04-1] Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [11:00:44] (03PS2) 10Majavah: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 [11:00:46] (03PS5) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [11:02:45] (03PS6) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [11:10:00] (03PS7) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [11:20:18] (03CR) 10Majavah: [C: 04-2] "Turns out some scripts are still executed on Python 2. Holding this -2 until those are updated." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697095 (owner: 10Majavah) [13:23:19] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Strange Swedish date format in lists.wikimedia.org - https://phabricator.wikimedia.org/T283967 (10Tacsipacsi) It’s https://gitlab.com/mailman/hyperkitty/-/issues/357 upstream (except for the capitalization, which is probably even “upper stream”, since the HyperKi... [13:42:10] (03PS2) 10Fomafix: Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) [13:42:27] (03PS2) 10Fomafix: Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) [14:07:13] (03PS1) 10Ladsgroup: dumps: Migrate rsync of ngingxlogs from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) [14:08:04] (03PS2) 10Ladsgroup: dumps: Migrate rsync of nginxlogs from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) [14:08:44] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:13:40] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:51] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp1087.eqiad.wmnet [14:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:02] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10elukey) 05Resolved→03Open The issue came back, the host is down again :( ` ------------------------------------------------------------------------------- Record: 1019 D... [14:44:39] !log execute apt-get clean on an-airflow1001 to free space [14:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:21] (03CR) 10Chico Venancio: [C: 03+1] "> Patch Set 2: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [16:54:09] (03PS1) 10Dzahn: add entrypoints, fix apache2 config, make httpd start and log to stderr [container/miscweb] - 10https://gerrit.wikimedia.org/r/697140 (https://phabricator.wikimedia.org/T281538) [16:59:08] (03CR) 10Dzahn: [C: 03+2] add entrypoints, fix apache2 config, make httpd start and log to stderr [container/miscweb] - 10https://gerrit.wikimedia.org/r/697140 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [16:59:36] (03Merged) 10jenkins-bot: add entrypoints, fix apache2 config, make httpd start and log to stderr [container/miscweb] - 10https://gerrit.wikimedia.org/r/697140 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [17:05:52] (03PS1) 10Dzahn: copy httpd.conf from staging to test and production variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/697142 (https://phabricator.wikimedia.org/T281538) [17:06:53] (03CR) 10Dzahn: [C: 03+2] copy httpd.conf from staging to test and production variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/697142 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [17:07:22] (03Merged) 10jenkins-bot: copy httpd.conf from staging to test and production variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/697142 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:04:10] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:31:27] (03PS1) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) [19:42:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:05:42] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:06:45] (03PS2) 10Jforrester: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender) [20:30:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:14:28] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:47:22] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:54:35] (03PS4) 10Jcrespo: Revert "Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"""" [puppet] - 10https://gerrit.wikimedia.org/r/695830 [21:56:37] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"""" [puppet] - 10https://gerrit.wikimedia.org/r/695830 (owner: 10Jcrespo) [22:15:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup2002), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:19:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:03:57] 10SRE, 10Phabricator, 10User-Matthewrbowker: [Discussion] Phabricator has been declared EOL - https://phabricator.wikimedia.org/T283980 (10Peachey88) [23:08:10] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:29:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:31:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid