[00:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638231 (10phaultfinder) [00:10:37] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:48] (03PS1) 10BCornwall: Add dsantamaria to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1127996 (https://phabricator.wikimedia.org/T388693) [00:12:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10638247 (10BCornwall) Sorry for the confusion, @DSantamaria, but analytics-privatedata-users doesn't need an SSH key, so... [00:13:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10638248 (10BCornwall) [00:23:01] PROBLEM - Host dbprov2005 is DOWN: PING CRITICAL - Packet loss = 100% [00:35:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638269 (10phaultfinder) [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127998 [00:38:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127998 (owner: 10TrainBranchBot) [00:50:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127998 (owner: 10TrainBranchBot) [00:51:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [00:51:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [01:09:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127999 [01:09:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127999 (owner: 10TrainBranchBot) [01:27:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127999 (owner: 10TrainBranchBot) [01:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638334 (10phaultfinder) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/e039462b8df2c820d4ace985b84434624e2a9c5a68bd62e60b8f29ba7d7f23b6/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638365 (10phaultfinder) [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:13:37] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:15:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638372 (10phaultfinder) [02:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638376 (10phaultfinder) [03:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638383 (10phaultfinder) [03:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638386 (10phaultfinder) [04:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638389 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638402 (10phaultfinder) [05:24:05] (03CR) 10Krinkle: Migrate x2 off LB config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [05:24:39] (03PS2) 10Krinkle: Migrate x2 off LB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638419 (10phaultfinder) [06:30:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638426 (10phaultfinder) [07:45:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638440 (10phaultfinder) [08:05:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638447 (10phaultfinder) [09:00:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638479 (10phaultfinder) [09:14:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638486 (10phaultfinder) [10:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638505 (10phaultfinder) [10:49:32] (03CR) 10TheDJ: [C:03+1] beta: remove obsolete $wgMwEmbedModuleConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) (owner: 10Hashar) [10:51:05] (03PS1) 10Aklapper: phabricator weekly changes email: Also exclude Security's Vuln-* tags [puppet] - 10https://gerrit.wikimedia.org/r/1128005 (https://phabricator.wikimedia.org/T387508) [11:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:34] (03CR) 10Aklapper: "Slyngshede: Would you know who could review this? Thanks in advance!" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [12:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638602 (10phaultfinder) [13:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638650 (10phaultfinder) [13:37:26] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:38] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:39] PROBLEM - Hadoop DataNode on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [13:50:39] RECOVERY - Hadoop DataNode on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:09:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638768 (10phaultfinder) [15:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:47:55] (03PS12) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [16:18:09] (03PS2) 10Kamila Součková: services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) [16:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:43:41] (03CR) 10Kamila Součková: "It is possible that the change to the CI made something flakey: in one of the CI runs, I got a random `in ./helmfile.yaml: no top-level co" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638859 (10phaultfinder) [18:00:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:11:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:25:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638923 (10phaultfinder) [18:28:05] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10638927 (10DavidEppstein) Possibly this needs to be reopened again? See similar behavior of certain images not being visible for som... [18:33:24] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10638929 (10DavidEppstein) We have similar behavior being reported again at https://en.wikipedia.org/wiki/Wikipedia:SVG_help#Rendering_issue –... [18:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638970 (10phaultfinder) [19:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:18] (03PS1) 10Majavah: P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982) [20:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:23] (03CR) 10Jforrester: [C:03+1] P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982) (owner: 10Majavah) [22:00:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639119 (10phaultfinder) [23:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639178 (10phaultfinder)