[00:10:50] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140798 [00:10:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140798 (owner: 10TrainBranchBot) [00:32:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140798 (owner: 10TrainBranchBot) [01:11:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:50] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 47.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:00] 10ops-drmrs: Inbound errors on interface cr1-drmrs:xe-0/1/3 (Transit: Arelion (IC-370330) {#D0068}) - https://phabricator.wikimedia.org/T393228 (10phaultfinder) 03NEW [04:13:29] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10787670 (10Sreejithk2000) Hi Matthew, I am a commons admin. Please delete it from swift... [04:16:34] PROBLEM - MD RAID on an-druid1003 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:16:35] ACKNOWLEDGEMENT - MD RAID on an-druid1003 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T393229 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:16:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229 (10ops-monitoring-bot) 03NEW [04:35:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:25:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:37:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:58] PROBLEM - Hadoop NodeManager on an-worker1190 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:57:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:58] RECOVERY - Hadoop NodeManager on an-worker1190 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10787710 (10phaultfinder) [06:45:38] (03PS7) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [06:45:38] (03PS1) 10Jforrester: [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 [06:48:46] (03PS2) 10Jforrester: [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 [06:48:47] (03PS8) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [06:58:34] (03CR) 10Jforrester: "@hashar, we're closing your wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [07:11:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:40] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:46:54] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140492 (owner: 10PipelineBot) [07:46:58] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138389 (owner: 10PipelineBot) [07:47:01] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127763 (owner: 10PipelineBot) [07:47:03] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126708 (owner: 10PipelineBot) [07:47:06] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123458 (owner: 10PipelineBot) [07:47:38] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101139 (owner: 10PipelineBot) [07:47:41] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079342 (owner: 10PipelineBot) [07:47:50] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075005 (owner: 10PipelineBot) [07:47:53] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072896 (owner: 10PipelineBot) [07:47:55] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072001 (owner: 10PipelineBot) [07:47:58] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065911 (owner: 10PipelineBot) [07:48:00] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062487 (owner: 10PipelineBot) [07:48:03] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046678 (owner: 10PipelineBot) [08:04:17] (03PS1) 10Ilias Sarantopoulos: ml-services: disable gpu in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140992 (https://phabricator.wikimedia.org/T393154) [08:05:12] (03CR) 10Ilias Sarantopoulos: "the gpu has already manually been removed from the pod, so I'm just following up to update the config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140992 (https://phabricator.wikimedia.org/T393154) (owner: 10Ilias Sarantopoulos) [08:05:38] (03CR) 10Krinkle: [C:04-1] "These are meant to be static snapshots that never change for historical baseline performance analysis. Is something external/live relating" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138350 (https://phabricator.wikimedia.org/T379568) (owner: 10Cyndywikime) [08:11:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:12:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:15:18] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:17:32] PROBLEM - Hadoop NodeManager on an-worker1195 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:20:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:24:32] RECOVERY - Hadoop NodeManager on an-worker1195 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:32:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10787970 (10phaultfinder) [09:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10788026 (10phaultfinder) [10:02:21] (03PS1) 10Novem Linguae: core-Permissions: add move-subpages to enwiki templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) [10:09:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:42:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:04] (03PS1) 10Bunnypranav: Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) [13:12:05] (03CR) 10Dreamy Jazz: [C:03+1] Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [13:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:23:34] (03CR) 10Bunnypranav: [C:03+1] core-Permissions: add move-subpages to enwiki templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae) [13:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:43] (03CR) 10Krinkle: [C:03+2] tests: Fix dynamic property warning in DNSSRVRecordTest and DBRecordCacheTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140652 (owner: 10Krinkle) [13:30:35] (03Merged) 10jenkins-bot: tests: Fix dynamic property warning in DNSSRVRecordTest and DBRecordCacheTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140652 (owner: 10Krinkle) [13:34:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:34:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 (owner: 10Krinkle) [13:34:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:35:40] (03Merged) 10jenkins-bot: multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:35:42] (03Merged) 10jenkins-bot: tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140658 (owner: 10Krinkle) [13:35:44] (03Merged) 10jenkins-bot: multiversion: Separate wmf-config reading from actual Multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140707 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:36:47] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1140653|multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() (T169821)]], [[gerrit:1140658|tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php]], [[gerrit:1140707|multiversion: Separate wmf-config reading from actual Multiversion (T169821)]] [13:36:50] T169821: Try to make wmf-config/wgConf's per-wiki configuration cache redundant - https://phabricator.wikimedia.org/T169821 [13:41:28] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1140653|multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() (T169821)]], [[gerrit:1140658|tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php]], [[gerrit:1140707|multiversion: Separate wmf-config reading from actual Multiversion (T169821)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:46:23] !log krinkle@deploy1003 krinkle: Continuing with sync [13:53:10] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140653|multiversion: Remove getMWConfigForCacheing() as identical to getConfigGlobals() (T169821)]], [[gerrit:1140658|tests: Move buildLogoHTML.php to tests/ alongside buildConfigCache.php]], [[gerrit:1140707|multiversion: Separate wmf-config reading from actual Multiversion (T169821)]] (duration: 16m 22s) [13:53:16] T169821: Try to make wmf-config/wgConf's per-wiki configuration cache redundant - https://phabricator.wikimedia.org/T169821 [14:13:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:30] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:04:30] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:18] RECOVERY - MD RAID on aqs1015 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:48] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:24:38] (03PS1) 10Arturo Borrero Gonzalez: horizon: use VXLAN/IPv6-dualstack as default network for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/1141188 (https://phabricator.wikimedia.org/T374824) [16:30:50] (03CR) 10Andrew Bogott: [C:03+2] horizon: use VXLAN/IPv6-dualstack as default network for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/1141188 (https://phabricator.wikimedia.org/T374824) (owner: 10Arturo Borrero Gonzalez) [16:34:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - codfw - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [16:36:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:52:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:09] (03PS1) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.3 (vanila)§:wq [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141194 (https://phabricator.wikimedia.org/T393281) [16:56:21] (03PS2) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.3 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141194 (https://phabricator.wikimedia.org/T393281) [16:57:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:02] (03PS3) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.3 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141194 (https://phabricator.wikimedia.org/T393281) [17:21:30] (03PS1) 10Effie Mouzeli: cache.mcrouter: upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141201 (https://phabricator.wikimedia.org/T393281) [17:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:34] PROBLEM - Disk space on contint1002 is CRITICAL: DISK CRITICAL - free space: /srv 36422 MB (5% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=contint1002&var-datasource=eqiad+prometheus/ops [18:35:53] !log delete a stuck thumbor pod in codfw [18:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:23] (03PS2) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141201 (https://phabricator.wikimedia.org/T393281) [18:43:33] (03PS1) 10Majavah: P:microsites: peopleweb: Refresh image on front page [puppet] - 10https://gerrit.wikimedia.org/r/1141280 [18:44:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - codfw - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [18:49:27] (03PS1) 10Effie Mouzeli: mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 [18:51:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:10:34] PROBLEM - Disk space on contint1002 is CRITICAL: DISK CRITICAL - free space: /srv 36422 MB (5% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=contint1002&var-datasource=eqiad+prometheus/ops [19:12:53] (03Abandoned) 10Effie Mouzeli: (WIP) services_proxy: rename keepalive upstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1003458 (owner: 10Effie Mouzeli) [19:13:17] (03Abandoned) 10Effie Mouzeli: tegola: Switch swift container to tegola-swift-codfw-v003 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) (owner: 10Effie Mouzeli) [19:13:59] (03CR) 10Effie Mouzeli: [C:03+1] mc: remove unused "memcached-pecl" definition from wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) (owner: 10Krinkle) [19:14:57] (03PS2) 10Effie Mouzeli: cronjobs: update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074165 [19:15:13] (03PS5) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [19:16:38] (03CR) 10CI reject: [V:04-1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [19:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:43:18] (03PS1) 10Majavah: P:microsites: peopleweb: Set content as utf-8 by default [puppet] - 10https://gerrit.wikimedia.org/r/1141311 [20:03:56] 06SRE, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#10789054 (10Peachey88) [20:07:51] PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100% [20:08:44] i'll depool [20:09:10] !log taavi@cumin1002 dbctl commit (dc=all): 'depool db1246', diff saved to https://phabricator.wikimedia.org/P75739 and previous config saved to /var/cache/conftool/dbconfig/20250503-200910-taavi.json [20:09:34] thanks [20:09:39] acked the page [20:11:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10789058 (10Ladsgroup) 05Resolved→03Open It paged again. https://xkcd.com/2083/ [21:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:50:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141417 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141417 (owner: 10TrainBranchBot) [23:41:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:50:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141417 (owner: 10TrainBranchBot)