[00:02:02] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088664 [00:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088664 (owner: 10TrainBranchBot) [00:40:29] (03PS4) 10Scott French: Add title-case mapping to support migration to PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) [00:44:19] (03CR) 10Scott French: Add title-case mapping to support migration to PHP 8.1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [00:54:41] (03CR) 10BCornwall: [C:03+1] wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [01:08:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088666 [01:08:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088666 (owner: 10TrainBranchBot) [01:20:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088664 (owner: 10TrainBranchBot) [01:34:22] 06SRE, 10Thumbor: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10306114 (10Reedy) [01:34:30] 06SRE, 10Thumbor: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10306115 (10Reedy) p:05Triage→03High [01:40:36] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088666 (owner: 10TrainBranchBot) [02:14:06] 06SRE, 10Thumbor, 07Wikimedia-Incident: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10306119 (10Krinkle) [02:17:13] 06SRE, 10Thumbor, 07Wikimedia-Incident: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10306122 (10Reedy) {F57690184 size=full} [02:33:21] 06SRE, 10Thumbor, 07Wikimedia-Incident: "Error: 500, Internal Server Error" during thumbnail generation - https://phabricator.wikimedia.org/T379426#10306125 (10Scott_French) There appears to have been a single thumbor pod in codfw that somehow got wedged (thumbor-main-549679978-74rbk), and had been returning... [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:45:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:46:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:54:32] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.817 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:22] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52923 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:22:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:02:54] PROBLEM - MD RAID on wikikube-worker1256 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:02:55] ACKNOWLEDGEMENT - MD RAID on wikikube-worker1256 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T379454 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:03:04] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454 (10ops-monitoring-bot) 03NEW [09:22:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:49] (03Abandoned) 10Majavah: openstack: wikitech: Stop setting writable LDAP credentials [puppet] - 10https://gerrit.wikimedia.org/r/1042267 (https://phabricator.wikimedia.org/T367287) (owner: 10Majavah) [11:46:00] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 147225520 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:47:00] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:54:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:54:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:15:56] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1088278 (owner: 10L10n-bot) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:51] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:48:54] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:48:55] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:48:58] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:48:59] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:49:01] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:01] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:05:57] (03PS1) 10Gergő Tisza: Fix warning about missing central account for temp users [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088770 (https://phabricator.wikimedia.org/T378289) [21:06:20] (03PS1) 10Gergő Tisza: Check session provider when autocreating [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088771 (https://phabricator.wikimedia.org/T378289) [21:06:40] (03CR) 10Reedy: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [21:07:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088770 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [21:07:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088771 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza)