[00:12:05] (03PS3) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) [00:17:51] (03CR) 10DDesouza: "It looks fine though I'm not familiarized with this codebase." [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [00:21:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676 (owner: 10TrainBranchBot) [00:43:09] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:57] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:58:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676 (owner: 10TrainBranchBot) [00:59:06] welp, I will ACK and depool [00:59:15] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:22] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:02:16] weird, everything looks fine though [01:02:57] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:16] https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1&var-cluster=ulsfo%20prometheus%2Fops not a significant increase yeah [01:03:20] ok that resolved itself [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355098 (10phaultfinder) [01:04:15] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:04:39] anyway, it's for tomorrow now [01:40:22] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:40:57] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:42:00] (03PS1) 10AntiCompositeNumber: Add global_edit_count to fullviews [puppet] - 10https://gerrit.wikimedia.org/r/990790 (https://phabricator.wikimedia.org/T344108) [01:44:15] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:57] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:38:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:39:16] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0300) [03:05:22] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:05:58] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432) [03:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [03:09:15] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:16] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:58] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:15] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:47] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [03:23:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:23:19] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [03:27:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [03:33:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:38:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:43:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:45:22] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:49:15] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:53:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:54:15] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:58:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0400) [04:25:22] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:15] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:30:23] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:34:15] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:42:31] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:43:37] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:48:19] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:48:45] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:09:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:24:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:29:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:34:26] (03PS8) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [06:34:30] (03PS1) 10Andrea Denisse: grafana: Create Grafana sysuser and home directory [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [06:46:01] (03PS2) 10Andrea Denisse: grafana: Create Grafana sysuser and home directory [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0700) [07:00:05] kormat, marostegui, and Amir1: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0700). [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:16] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:04:49] (03PS1) 10Slyngshede: Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 [08:05:52] (03CR) 10Slyngshede: "Fix issues highlighted by Taavi." [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede) [08:10:10] (03CR) 10Brouberol: [C: 03+1] Update statsd-exporter mappings for Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/990688 (https://phabricator.wikimedia.org/T343232) (owner: 10Aqu) [08:28:49] (03CR) 10Muehlenhoff: grafana: Create Grafana sysuser and home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [08:31:37] (03CR) 10Muehlenhoff: Bump version number to 0.0.4 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede) [08:34:17] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9460261, @ABran-WMF wrote: > I ran the following test: with a custom PKI, Nice! Out of interest, which PKI t... [08:37:33] (03PS2) 10Slyngshede: Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 [08:38:13] (03CR) 10Slyngshede: Bump version number to 0.0.4 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede) [08:39:48] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9461185, @MoritzMuehlenhoff wrote: > Nice! Out of interest, which PKI tool did you use for your tests? As a next step... [08:49:00] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9461200, @ABran-WMF wrote: >>>! In T352974#9461185, @MoritzMuehlenhoff wrote: >> Nice! Out of interest, which... [08:51:56] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [08:56:40] (03CR) 10Muehlenhoff: [C: 03+1] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede) [08:57:09] (03PS9) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [08:57:57] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432) [08:57:59] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [08:58:42] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [08:59:05] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.14 refs T354432 [08:59:09] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [08:59:30] (03PS1) 10Muehlenhoff: Remove access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/990947 [09:00:04] jnuche and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0900). nyaa~ [09:02:27] (03CR) 10Jelto: [V: 03+1 C: 03+1] trafficserver: switch design.wikimedia.org to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [09:02:48] (03PS1) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) [09:03:55] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Daniram3 out of all services on: 2211 hosts [09:03:55] !log reprepro: Copy grafana v9.4.14 from buster to bookworm [09:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] !log reprepro: Copy grafana v9.4.14 from buster to bookworm - T352665 [09:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:24] T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665 [09:05:05] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Daniram3 out of all services on: 2211 hosts [09:05:13] (03CR) 10Slyngshede: Changes to Python infrastucture to help building Debian package. (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [09:05:26] (03CR) 10Slyngshede: [C: 03+2] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede) [09:05:30] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede) [09:06:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/990947 (owner: 10Muehlenhoff) [09:07:23] (03CR) 10David Caro: [C: 03+1] P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:08:15] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@24065.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:46] (03CR) 10David Caro: P:openstack: nova::compute: include certificate chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:09:44] (03PS2) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619) [09:09:52] (03PS2) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) [09:10:22] (03PS3) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) [09:10:36] (03CR) 10Majavah: P:openstack: nova::compute: include certificate chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:11:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1122/co" [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:13:04] (03CR) 10Majavah: [C: 03+2] P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:14:40] (03PS4) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) [09:18:13] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline and below" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:18:28] (03CR) 10Majavah: [C: 03+2] P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah) [09:23:45] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [09:23:47] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/990950 (https://phabricator.wikimedia.org/T349619) [09:24:50] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1039 [puppet] - 10https://gerrit.wikimedia.org/r/990951 (https://phabricator.wikimedia.org/T349619) [09:25:11] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:26:40] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cloudvirt2004-dev as active - taavi@cumin1002" [09:26:48] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/990952 (https://phabricator.wikimedia.org/T349619) [09:26:50] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2039 [puppet] - 10https://gerrit.wikimedia.org/r/990953 (https://phabricator.wikimedia.org/T349619) [09:27:21] (03PS3) 10Effie Mouzeli: (DNM) Switch Mediawiki main memcache clusters to puppet 7: all hosts [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619) [09:28:36] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cloudvirt2004-dev as active - taavi@cumin1002" [09:32:05] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-dpkg-success-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:12] (03CR) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [09:32:29] (03PS2) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) [09:32:43] PROBLEM - Disk space on mwdebug1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=79%): /tmp 0 MB (0% inode=79%): /var/tmp 0 MB (0% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops [09:33:57] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): /tmp 0 MB (0% inode=77%): /var/tmp 0 MB (0% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [09:34:07] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-dpkg-success-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:30] mwdebug is not an happy camper at all, I'll take a quick look [09:38:48] I'd imagine another mediawiki version in /srv/mediawiki brought over the limit [09:39:19] PROBLEM - Disk space on mw2272 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2272&var-datasource=codfw+prometheus/ops [09:40:58] (03PS1) 10Kosta Harlan: PreAuthenticationProvider: Deny account creation based on ipoid data [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928) [09:42:34] similar problem, /srv being 40G on mw2272 means current mw versions don't fit anymore [09:45:33] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:45:39] (03PS1) 10Majavah: report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947) [09:46:55] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 500 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Swift [09:47:05] (03PS2) 10Majavah: P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T300427) [09:47:58] thoughts/ideas on what best ? trying to gauge how widespread the issue is now [09:48:27] godog: we can just delete some old versions right? [09:49:04] fairly sure that's safe, php-1.42.0-wmf.7 is from november for example [09:49:17] hnowlan: I believe so yeah, no idea how to do that though [09:49:37] PROBLEM - Disk space on mw2283 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2283&var-datasource=codfw+prometheus/ops [09:51:13] PROBLEM - Disk space on mw2282 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2282&var-datasource=codfw+prometheus/ops [09:51:20] I think we can just rm although i don't know what side effects that might have [09:51:50] indeed [09:51:54] same here [09:51:57] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.14 refs T354432 (duration: 52m 52s) [09:51:57] (03PS1) 10Majavah: Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) [09:52:01] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [09:52:02] I'll try on a mwdebug host [09:52:35] jnuche: ^ see above, mw hosts running out of disk space [09:53:01] ideally old trains would be removed via the scap command to do so [09:53:04] FWIW I’ve occasionally deleted directories on individual hosts that were left over for some reason https://sal.toolforge.org/log/qOTkE4sBhuQtenzvv3hj [09:53:13] RECOVERY - Disk space on mwdebug1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops [09:53:15] but that wasn’t fleet-wide [09:53:27] jnuche: can we ditch old mw versions please? [09:53:36] Lucas_WMDE taavi ack thanks! [09:53:37] I guess you’d delete on deployment.e.w and scap sync that? [09:53:43] PROBLEM - Disk space on mw2259 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops [09:53:49] I'd imagine so too [09:54:05] godog: thanks, I just saw the issue, there's some 30 prod machines affected plus debug and test servers [09:54:18] and yeah, scap was supposed to remove old versions AFAIK [09:54:25] not sure why old dirs were left behind [09:54:27] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [09:54:28] (03PS1) 10Majavah: templates: drop cloud-support1-c-eqiad includes [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115) [09:55:04] cc hnowlan ^ [09:55:09] PROBLEM - Disk space on mw2271 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2271&var-datasource=codfw+prometheus/ops [09:56:07] PROBLEM - Disk space on mw2286 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2286&var-datasource=codfw+prometheus/ops [09:56:07] scap will sync what it's told to right? there's 5 older php-1.42.0-wmf.x versions in /srv/mediawiki-staging [09:56:13] jnuche: can we force scap to clean up now as taavi mentioned ? [09:56:28] (03PS1) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) [09:57:14] godog: sould be possible, looking it up, failing that I'll just remove the dirs from the deployment server and resync [09:57:33] ack, sgtm [09:57:47] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:07] PROBLEM - Disk space on mw2287 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2287&var-datasource=codfw+prometheus/ops [09:58:31] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [09:58:54] in the meantime I'll open a followup task [09:59:07] PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [09:59:25] PROBLEM - Disk space on mw2285 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2285&var-datasource=codfw+prometheus/ops [10:00:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1038.eqiad.wmnet [10:00:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/990950 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:01:01] PROBLEM - Disk space on mw2264 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2264&var-datasource=codfw+prometheus/ops [10:01:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [10:01:25] PROBLEM - Disk space on mw2267 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops [10:01:27] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:35] PROBLEM - Disk space on mw2289 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops [10:01:45] deleted now and resyncing [10:01:49] PROBLEM - Disk space on mw2265 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2265&var-datasource=codfw+prometheus/ops [10:04:00] PROBLEM - Disk space on mw2266 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [10:04:51] 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10fgiunchedi) [10:04:57] created ^ [10:05:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1038.eqiad.wmnet [10:05:47] PROBLEM - Disk space on mw2262 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2262&var-datasource=codfw+prometheus/ops [10:05:55] PROBLEM - Disk space on mw2288 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2288&var-datasource=codfw+prometheus/ops [10:06:11] PROBLEM - Disk space on mw2276 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops [10:06:37] PROBLEM - Disk space on mw2269 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2269&var-datasource=codfw+prometheus/ops [10:06:48] !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.7, 1.42.0-wmf.9, 1.42.0-wmf.10, 1.42.0-wmf.12 (duration: 07m 08s) [10:07:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1039.eqiad.wmnet [10:07:38] mmmh, the prune removed the versions from the deploy server, but not from the target hosts, sigh [10:08:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1039 [puppet] - 10https://gerrit.wikimedia.org/r/990951 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:08:47] jnuche: ack, I can run a cumin command on the mw fleet too, would a simple rm -rf /srv/mediawiki/php-1.42.0-wmf.7 do the trick for example ? [10:09:31] godog: yeah, that should work, once there's enough space on the hosts maybe the prune will work [10:09:40] ack, doing cc hnowlan [10:09:48] I think it's trying to sync wmf.14 and bails out once it fails, so it never prunes the other dirs [10:10:07] RECOVERY - Disk space on mw2283 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2283&var-datasource=codfw+prometheus/ops [10:10:28] (03CR) 10Jelto: [V: 03+1 C: 03+2] trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [10:10:45] !log manually pruning php-1.42.0-wmf.7 from mw22* - T355117 [10:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:58] T355117: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 [10:11:34] jnuche: I think between your actions and mine we're good now [10:11:43] RECOVERY - Disk space on mw2282 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2282&var-datasource=codfw+prometheus/ops [10:11:50] there will be recoveries coming in [10:12:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1039.eqiad.wmnet [10:13:28] (03PS1) 10Klausman: Add Lift Wing recommendation-api-ng SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 [10:13:49] godog: thanks a lot, can you also run that comin command for the other old branches?: 1.42.0-wmf.9, 1.42.0-wmf.10, 1.42.0-wmf.12 [10:14:13] RECOVERY - Disk space on mw2259 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops [10:15:14] jnuche: will do yeah, I've limited the cumin command to mw22* as those hosts seemed to be problematic [10:15:20] and I'm scared to do it on mw* [10:15:32] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [10:15:39] RECOVERY - Disk space on mw2271 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2271&var-datasource=codfw+prometheus/ops [10:16:03] godog: I suggest to use mw aliases insted of hostname prefixes as some host have been migrated to be k8s hosts [10:16:11] !log clean up also 1.42.0-wmf.9 1.42.0-wmf.10 1.42.0-wmf.12 from mw22* - T355117 [10:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:15] T355117: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 [10:16:29] volans: thank you [10:16:35] RECOVERY - Disk space on mw2286 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2286&var-datasource=codfw+prometheus/ops [10:16:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2038.codfw.wmnet [10:17:38] godog: thx, I think the only non-mw22 are debug and test hosts, I can take care of those [10:17:49] (03PS2) 10Klausman: Add Lift Wing recommendation-api-ng SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 (https://phabricator.wikimedia.org/T347262) [10:17:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/990952 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:18:37] RECOVERY - Disk space on mw2287 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2287&var-datasource=codfw+prometheus/ops [10:19:01] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [10:19:08] jnuche: yeah I think we're good, I've verified with this thanos query to check for > 95% usage https://w.wiki/8rVF [10:19:35] RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [10:19:55] RECOVERY - Disk space on mw2285 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2285&var-datasource=codfw+prometheus/ops [10:20:23] RECOVERY - Disk space on mw2272 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2272&var-datasource=codfw+prometheus/ops [10:21:13] godog: hum, didn't know about thanos, nice :) [10:21:24] I'm going to wait a few minutes and then retry the presync [10:21:31] RECOVERY - Disk space on mw2264 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2264&var-datasource=codfw+prometheus/ops [10:21:34] it's inevitable [10:21:37] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [10:21:55] RECOVERY - Disk space on mw2267 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops [10:22:05] RECOVERY - Disk space on mw2289 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops [10:22:10] :D [10:22:19] RECOVERY - Disk space on mw2265 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2265&var-datasource=codfw+prometheus/ops [10:24:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2038.codfw.wmnet [10:24:29] RECOVERY - Disk space on mw2266 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [10:26:15] RECOVERY - Disk space on mw2262 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2262&var-datasource=codfw+prometheus/ops [10:26:25] RECOVERY - Disk space on mw2288 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2288&var-datasource=codfw+prometheus/ops [10:26:41] RECOVERY - Disk space on mw2276 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops [10:27:07] RECOVERY - Disk space on mw2269 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2269&var-datasource=codfw+prometheus/ops [10:29:09] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [10:30:15] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.14 refs T354432 [10:30:19] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [10:30:39] taking a break, bbiab [10:30:41] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2039.codfw.wmnet [10:32:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2039 [puppet] - 10https://gerrit.wikimedia.org/r/990953 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [10:32:21] (03PS1) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 [10:35:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [10:37:43] (03PS2) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 [10:38:43] (03PS6) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [10:41:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2039.codfw.wmnet [10:43:35] (03PS7) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [10:47:05] (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [10:47:11] (03PS1) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) [10:47:50] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [10:51:00] (03PS7) 10Btullis: Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [10:52:31] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving this since codfw is done and eqiad is tracked in T354684 [10:53:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1124/co" [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [10:53:45] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [10:59:52] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.14 refs T354432 (duration: 29m 36s) [10:59:56] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [11:00:02] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Nice finding Arnaud! >>! In T352974#9461217, @MoritzMuehlenhoff wrote: > > Let's create a separate task for switching Orchestrator... [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1100) [11:01:02] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1040 [puppet] - 10https://gerrit.wikimedia.org/r/990971 [11:01:27] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2040 [puppet] - 10https://gerrit.wikimedia.org/r/990972 [11:01:48] heads-up that I'm still going to be running the train unless infra deployments need to happen [11:02:08] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973 [11:02:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one remaining nit inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [11:03:05] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974 [11:03:35] (03CR) 10Brouberol: [C: 03+1] Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:03:46] (03CR) 10Btullis: [V: 03+1] Switch presto from Puppet to PKI certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:03:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [11:05:23] (03PS2) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974 [11:05:52] (03CR) 10Btullis: [V: 03+1 C: 03+2] Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:06:09] (03PS3) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974 [11:08:03] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1040.eqiad.wmnet [11:08:31] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1040 [puppet] - 10https://gerrit.wikimedia.org/r/990971 (owner: 10Effie Mouzeli) [11:09:17] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:38] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) The question on how to run debmonitor-client in Pontoon is an interesting one, though unrelated to this issue; debmonitor is not installed b... [11:09:53] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) [11:10:11] (03Merged) 10jenkins-bot: PreAuthenticationProvider: Deny account creation based on ipoid data [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [11:12:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1040.eqiad.wmnet [11:13:23] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [11:15:22] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1041.eqiad.wmnet [11:15:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973 (owner: 10Effie Mouzeli) [11:16:03] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]] [11:16:04] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973 (owner: 10Effie Mouzeli) [11:16:07] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [11:19:32] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [11:21:37] (03PS2) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) [11:23:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1041.eqiad.wmnet [11:26:22] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2040.codfw.wmnet [11:28:39] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2040 [puppet] - 10https://gerrit.wikimedia.org/r/990972 (owner: 10Effie Mouzeli) [11:30:38] (03CR) 10Filippo Giunchedi: [C: 03+1] Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede) [11:33:21] (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:33:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2040.codfw.wmnet [11:33:27] (03CR) 10Slyngshede: [C: 03+2] Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede) [11:34:34] (03Merged) 10jenkins-bot: Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede) [11:36:04] !log jnuche@deploy2002 jnuche and kharlan: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:36:08] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [11:36:31] !log jnuche@deploy2002 jnuche and kharlan: Continuing with sync [11:38:31] (03Abandoned) 10Effie Mouzeli: mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [11:39:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2041.codfw.wmnet [11:40:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974 (owner: 10Effie Mouzeli) [11:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2041.codfw.wmnet [11:45:24] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc-wf1001 [puppet] - 10https://gerrit.wikimedia.org/r/990984 [11:45:35] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]] (duration: 29m 32s) [11:45:39] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [11:47:33] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:58] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432) [11:48:00] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [11:49:01] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [11:56:13] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.14 refs T354432 [11:56:17] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [11:56:37] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:15] (03PS8) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [11:57:20] (03CR) 10Slyngshede: Package Debmonitor server as .deb (035 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [11:58:07] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1042 [puppet] - 10https://gerrit.wikimedia.org/r/990986 [11:58:09] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2042 [puppet] - 10https://gerrit.wikimedia.org/r/990987 [11:58:11] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1043 [puppet] - 10https://gerrit.wikimedia.org/r/990988 [11:58:13] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2043 [puppet] - 10https://gerrit.wikimedia.org/r/990989 [11:58:15] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1044 [puppet] - 10https://gerrit.wikimedia.org/r/990990 [11:58:17] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2044 [puppet] - 10https://gerrit.wikimedia.org/r/990991 [11:58:19] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1045 [puppet] - 10https://gerrit.wikimedia.org/r/990992 [11:58:21] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2045 [puppet] - 10https://gerrit.wikimedia.org/r/990993 [11:58:23] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1046 [puppet] - 10https://gerrit.wikimedia.org/r/990994 [11:58:25] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2046 [puppet] - 10https://gerrit.wikimedia.org/r/990995 [11:58:27] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1047 [puppet] - 10https://gerrit.wikimedia.org/r/990996 [11:58:37] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2047 [puppet] - 10https://gerrit.wikimedia.org/r/990997 [11:58:41] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1048 [puppet] - 10https://gerrit.wikimedia.org/r/990998 [11:58:45] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2048 [puppet] - 10https://gerrit.wikimedia.org/r/990999 [12:05:16] (03PS1) 10Jelto: miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) [12:10:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-wf1001.eqiad.wmnet [12:11:20] (03PS1) 10Muehlenhoff: Switch mc-wf1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991001 (https://phabricator.wikimedia.org/T349619) [12:11:44] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [12:14:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-wf1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991001 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:16:21] effie: you created change 990990 \o/ ^^ [12:17:47] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:59] Lucas_WMDE: hahaha [12:18:41] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [12:18:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-wf1001.eqiad.wmnet [12:22:42] (03PS3) 10Urbanecm: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) [12:24:27] (03CR) 10Muehlenhoff: Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:25:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [12:30:03] !log installing systemd bugfix updates from Bullseye point release [12:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:29] (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:38:18] (03PS1) 10KartikMistry: Set MT threshold for Punjabi Wikipedia to 97 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991002 (https://phabricator.wikimedia.org/T347789) [12:40:56] (03PS1) 10Slyngshede: Netfilter: Remove exclude filter. [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) [12:42:51] (03CR) 10Slyngshede: "Following up on the comments on 989188, I doesn't think it realistic to keep an exclude list in sync with Puppet. The expression to trigge" [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:46:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:49:08] (03CR) 10EoghanGaffney: [C: 03+1] miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [12:50:36] (03CR) 10Majavah: Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:50:56] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [12:52:02] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [12:56:21] (03CR) 10Vgutierrez: [C: 04-1] "not sure about this nib of pcc output:" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:56:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [12:57:18] (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:57:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [12:59:45] (03CR) 10Jelto: [C: 03+2] miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [12:59:58] (03CR) 10Muehlenhoff: [C: 03+2] Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1300) [13:00:57] (03Merged) 10jenkins-bot: miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [13:01:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS bullseye [13:02:22] !log reimage mc-wf1001 (part of puppet7 migration) [13:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:05:49] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:06:29] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:06:37] (03PS9) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [13:08:25] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:08:42] (03PS6) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 [13:08:54] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:09:18] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:09:39] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:10:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:10:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [13:11:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:13:15] (03PS1) 10Dreamy Jazz: Support parallel PhotoDNA requests [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) [13:14:48] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:14:59] (03CR) 10Ayounsi: [C: 03+1] Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [13:15:25] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [13:16:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:16:21] (03PS10) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [13:16:47] (03CR) 10Majavah: [C: 03+2] Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [13:18:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [13:19:35] (03PS2) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) [13:19:37] (03PS1) 10Majavah: network: remove cloud-support1-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/991006 (https://phabricator.wikimedia.org/T355115) [13:20:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [13:23:49] (03Merged) 10jenkins-bot: Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [13:25:29] (03PS21) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [13:26:57] (03PS21) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [13:28:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:28:25] (03CR) 10Hnowlan: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [13:28:41] (03PS1) 10Filippo Giunchedi: sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) [13:28:43] kamila_: seems like you disabled BGP on mw2436/mw2437 yesterday in netbox, but did not commit it via homer. homer is now giving me a diff relating to that with an unrelated change, is it ok to deploy that? [13:29:57] taavi: oh shit that was by accident and I have a bug in my script XD [13:30:09] sorry, don't deploy, I'll clean it up [13:30:10] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1126/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [13:30:13] thank you, sorryyy [13:30:32] kamila_: no worries, just lemme know when that's fixed [13:30:55] (03PS1) 10Filippo Giunchedi: graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) [13:31:41] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:42] taavi: fixed [13:32:37] thanks :D [13:32:58] (03PS11) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [13:33:11] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:18] yeah now the diff looks much better [13:34:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good point, better go to as simple as we can" [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:34:36] (03PS1) 10Jelto: miscweb/microsites: remove profile::microsites::design [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) [13:35:07] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS bullseye [13:35:09] next time I run this script, I will double check that I did a clean git checkout and didn't leave my debugging stuff in there '^^ [13:36:03] (03CR) 10Ladsgroup: [C: 03+2] report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [13:36:39] (03Merged) 10jenkins-bot: report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [13:37:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1128/co" [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [13:37:47] (03CR) 10Hashar: [C: 04-1] Add base production images containing Java 8 JDK and JRE (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:38:03] (03PS6) 10Hashar: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:38:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:40:30] (03PS7) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) [13:40:32] (03PS2) 10Btullis: Update the openjdk-11 images to match openjdk-8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 [13:41:15] (03CR) 10Hashar: [C: 03+1] "I have amended the shell bits which had two `&&` due to some copy pastas. I have build the images locally invoking twice:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:43:05] (03CR) 10Hashar: [C: 03+1] "Ben version (PS 7) moves the `&&` at the end of the lines, which is good as well :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:44:16] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [13:47:15] (03CR) 10Jelto: "looks mostly good but profile::gerrit::is_replica is removed from hiera but still used in profile::gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [13:50:09] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [13:50:21] (03CR) 10Jelto: [C: 03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [13:51:27] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [13:54:30] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:54:34] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Wrong filenames in the File history section (timestamp differs from displayed timestamp) - https://phabricator.wikimedia.org/T302985 (10bjh21) I think there may be a mistaken assumption here: > When you saved file on the beginning of the name it h... [13:59:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "As far as I can tell, none of the code touched here is reached via web requests (only via maintenance), so hopefully backporting it won’t " [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz) [14:00:00] I can self-serve my backports [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1400) [14:00:05] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:53] o/ [14:00:54] Dreamy_Jazz: ack [14:01:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz) [14:02:24] Both patches I will deploy will only affect manually run maintenance scripts [14:03:48] (03Merged) 10jenkins-bot: Support parallel PhotoDNA requests [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz) [14:04:14] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]] [14:04:25] T354408: Support parallelizing scans to PhotoDNA - https://phabricator.wikimedia.org/T354408 [14:05:46] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:53] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:05:56] (03PS1) 10Majavah: alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) [14:06:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:07:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:07:12] 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) The problem was caused by older MW versions being left over on the drive. For instance: ` mwdeploy@mw2272:/srv/mediawiki$ ls composer.json dblists-index.php errorpages l... [14:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54725 and previous config saved to /var/cache/conftool/dbconfig/20240116-140713-marostegui.json [14:07:17] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:07:23] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [14:07:46] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [14:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54726 and previous config saved to /var/cache/conftool/dbconfig/20240116-140938-marostegui.json [14:11:29] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]] (duration: 07m 14s) [14:11:48] T354408: Support parallelizing scans to PhotoDNA - https://phabricator.wikimedia.org/T354408 [14:11:48] (03PS1) 10Dreamy Jazz: Add more statsd counters and add logstash logging [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419) [14:12:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419) (owner: 10Dreamy Jazz) [14:12:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:14:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [14:14:29] (03Merged) 10jenkins-bot: Add more statsd counters and add logstash logging [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419) (owner: 10Dreamy Jazz) [14:14:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [14:15:58] I received an 502 proxy error when using scap backport for 990760. I will re-try this. [14:16:55] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]] [14:16:59] T351419: Create a Grafana chart to plot the number of PhotoDNA requests per day per wiki - https://phabricator.wikimedia.org/T351419 [14:17:01] !log installing 5.10.205 kernels on buster hosts running the 5.10 backport [14:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:33] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:38] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:23:01] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:10] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]] (duration: 07m 15s) [14:24:14] T351419: Create a Grafana chart to plot the number of PhotoDNA requests per day per wiki - https://phabricator.wikimedia.org/T351419 [14:24:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P54727 and previous config saved to /var/cache/conftool/dbconfig/20240116-142444-marostegui.json [14:26:00] (03PS12) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:28:24] (03CR) 10EoghanGaffney: [C: 03+1] miscweb/microsites: remove profile::microsites::design [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:29:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:30:44] I think that is it for the backport window, unless anyone has any other things to deploy? [14:31:41] !log UTC afternoon deploys done [14:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:05] !log installing ca-certificates-java bugfix updates on bookworm [14:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] Dreamy_Jazz: thanks for doing the window ^^ [14:33:39] No problem :) [14:36:07] (03PS1) 10Marostegui: db2124: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991025 (https://phabricator.wikimedia.org/T354506) [14:36:49] (03PS13) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:37:36] (03CR) 10Marostegui: [C: 03+2] db2124: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991025 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [14:39:16] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P54728 and previous config saved to /var/cache/conftool/dbconfig/20240116-143951-marostegui.json [14:42:28] (03PS1) 10Hnowlan: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) [14:44:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:50:02] (03CR) 10Muehlenhoff: Configure ACLs for reprepro upload queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:54:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54729 and previous config saved to /var/cache/conftool/dbconfig/20240116-145458-marostegui.json [14:55:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:55:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:55:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:55:20] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:55:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:55:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:55:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:56:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54730 and previous config saved to /var/cache/conftool/dbconfig/20240116-145613-marostegui.json [14:56:15] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [14:57:03] (03CR) 10Majavah: [C: 03+2] templates: drop cloud-support1-c-eqiad includes [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [14:58:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54731 and previous config saved to /var/cache/conftool/dbconfig/20240116-145837-marostegui.json [14:58:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for cloud-support1-c-eqiad - cmooney@cumin1002" [15:00:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for cloud-support1-c-eqiad - cmooney@cumin1002" [15:00:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:53] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:24] (03CR) 10Marostegui: [C: 03+1] mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 (owner: 10Ladsgroup) [15:04:54] (03PS3) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 [15:04:57] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 (owner: 10Ladsgroup) [15:07:53] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:08:35] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "common/profile/trafficserver/backend.yaml: target: http://design.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [15:10:41] (03PS7) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 [15:11:30] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:13:30] !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 20 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt [15:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P54732 and previous config saved to /var/cache/conftool/dbconfig/20240116-151344-marostegui.json [15:13:44] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [15:14:03] (03PS2) 10Ssingh: depool codfw: do not merge! emergency depool patch [dns] - 10https://gerrit.wikimedia.org/r/989534 (https://phabricator.wikimedia.org/T352758) [15:14:12] (03CR) 10CI reject: [V: 04-1] [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [15:18:40] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah) [15:18:41] !log Stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 20 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:52] !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 25 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt [15:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:56] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [15:19:09] (03PS8) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 [15:19:23] !log Disabling puppet and PyBal on lvs2013 ahead of migration of network link to ssw1-a1-codfw T352784 [15:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:27] T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 [15:21:05] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) [15:21:11] (03CR) 10Majavah: [C: 03+2] alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah) [15:23:09] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:23:19] (03CR) 10Brouberol: [C: 03+1] "Looks good, now that https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/989786 is merged" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis) [15:23:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:33] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:25:45] ^^ that's related to my work messed up downtime [15:26:09] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:27:29] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6,lvs2013 with reason: moving lvs hosts codfw T352784 [15:27:29] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1129/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [15:27:43] T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 [15:27:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6,lvs2013 with reason: moving lvs hosts codfw T352784 [15:28:06] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=38432fab-1dd6-4ffe-a093-648c38675985) set by c... [15:28:39] !log stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 25 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-25.txt [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:48] (03Merged) 10jenkins-bot: alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah) [15:28:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P54733 and previous config saved to /var/cache/conftool/dbconfig/20240116-152850-marostegui.json [15:29:07] !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt [15:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:11] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [15:32:48] (03CR) 10Ssingh: "I see cloud-support1-c-eqiad: in network/data/data.yaml as well. Should that be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [15:33:16] (03CR) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [15:33:34] (03CR) 10Ssingh: [C: 03+1] "Ah, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [15:36:23] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) Good stuff! Does this mean that we can decline {T354411} now, or would you still prefer that role to be migrated back to puppet 5? [15:36:43] (03CR) 10Btullis: [C: 03+2] Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis) [15:37:45] (03PS1) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [15:39:02] 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis) a:05BTullis→03Jclark-ctr [15:39:50] 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) a:05BTullis→03Jclark-ctr [15:40:09] (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) [15:40:29] PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54734 and previous config saved to /var/cache/conftool/dbconfig/20240116-154357-marostegui.json [15:43:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:44:01] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:44:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:44:17] PROBLEM - Disk space on ms-be2072 is CRITICAL: DISK CRITICAL - /srv/swift-storage/objects0 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2072&var-datasource=codfw+prometheus/ops [15:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54735 and previous config saved to /var/cache/conftool/dbconfig/20240116-154419-marostegui.json [15:44:59] (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney) [15:46:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54736 and previous config saved to /var/cache/conftool/dbconfig/20240116-154643-marostegui.json [15:47:08] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) [15:47:41] (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney) [15:47:48] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) 05In progress→03Resolved [15:48:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ABran-WMF) [15:48:20] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9462685, @BTullis wrote: > Good stuff! > Does this mean that we can decline {T354411} now, or would you still prefer... [15:49:55] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) p:05Triage→03High [15:50:33] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 214, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:38] (03CR) 10Giuseppe Lavagetto: "Overall LGTM, I would like some additional comments in the files to ease our life in the future, and adding a networkpolicy egress templat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [15:54:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:54:25] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:55:18] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918 [15:55:23] T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 [15:55:24] T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 [15:55:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918 [16:00:04] !log stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt [16:00:05] eoghan, jelto, and arnoldokoth: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1600). [16:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P54737 and previous config saved to /var/cache/conftool/dbconfig/20240116-160150-marostegui.json [16:03:09] !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt` [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:18] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [16:14:20] (03PS3) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) [16:14:56] (03PS2) 10Hnowlan: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) [16:15:36] (03CR) 10Hnowlan: modules: add cassandra client module (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [16:16:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P54738 and previous config saved to /var/cache/conftool/dbconfig/20240116-161656-marostegui.json [16:19:51] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [16:20:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [16:20:08] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment [16:20:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment [16:20:49] !log phabricator deploy is imminent [16:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 for T354969 [16:21:29] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 for T354969 (duration: 00m 27s) [16:21:30] T354969: Deploy Phabricator/Phorge 2024-01-16 - https://phabricator.wikimedia.org/T354969 [16:22:05] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab1004 for T354969 [16:22:56] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab1004 for T354969 (duration: 00m 50s) [16:27:50] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) Paging for ticket.wikimedia.org might be a bit expensive if done similar like pages for mediawiki for example (especially outside of business hours). But that's my... [16:31:28] (03CR) 10Jelto: [C: 03+1] "lgtm now. One hiera file hieradata/cloud/eqiad1/devtools/common.yaml still uses the is_replica flag, see open comment." [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [16:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54739 and previous config saved to /var/cache/conftool/dbconfig/20240116-163203-marostegui.json [16:32:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [16:32:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [16:32:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:32:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54740 and previous config saved to /var/cache/conftool/dbconfig/20240116-163224-marostegui.json [16:32:31] (03PS1) 10C. Scott Ananian: WIP: turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 [16:32:33] (03CR) 10Btullis: "I'm doing a local build of spark against openjdk-8." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis) [16:33:04] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on prometheus1005.eqiad.wmnet with reason: memory upgrade [16:33:18] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus1005.eqiad.wmnet with reason: memory upgrade [16:33:24] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=639b8465-e0d6-4049-bc5d-4c38af1cc396) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason... [16:33:42] (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) (owner: 10Cathal Mooney) [16:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54741 and previous config saved to /var/cache/conftool/dbconfig/20240116-163449-marostegui.json [16:37:42] (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff) [16:39:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:23] (03PS16) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [16:44:16] (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:10] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) 05Open→03Resolved Link removed [16:47:16] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10Papaul) [16:49:09] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:49:16] (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:49:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P54742 and previous config saved to /var/cache/conftool/dbconfig/20240116-164957-marostegui.json [16:56:26] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw1460.eqiad.wmnet with OS bullseye [16:56:40] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw1460.eqiad.wmnet with OS bullseye [16:56:42] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on prometheus1006.eqiad.wmnet with reason: memory upgrade [16:56:56] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus1006.eqiad.wmnet with reason: memory upgrade [16:57:03] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8a4aebbc-9222-4c62-b55e-a4c6a3f6d9a6) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason... [17:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:25] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) After moving the lincard in cr1, we are seeing the error now in cr1. I email Support to request again a replacement [17:04:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P54743 and previous config saved to /var/cache/conftool/dbconfig/20240116-170503-marostegui.json [17:09:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:05] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage [17:11:07] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918 [17:11:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918 [17:11:23] T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 [17:11:23] T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 [17:12:41] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) ` Hello Papaul Sure, no problem, thanks for the troubleshooting you performed, I will proceed with the RMA, please provide me with the following information (please fill out the blank spaces to avoid any misun... [17:12:50] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage [17:14:16] !log Disabling puppet and PyBal on lvs2012 ahead of migration of network link to lsw1-b2-codfw T352909 [17:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] T352909: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 [17:14:53] (03PS2) 10Btullis: Switch all spark images to use Java 8 as their base JDK/JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) [17:18:11] (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:20:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54744 and previous config saved to /var/cache/conftool/dbconfig/20240116-172011-marostegui.json [17:20:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [17:20:18] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:20:22] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [17:20:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54745 and previous config saved to /var/cache/conftool/dbconfig/20240116-172032-marostegui.json [17:23:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54746 and previous config saved to /var/cache/conftool/dbconfig/20240116-172300-marostegui.json [17:24:04] (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) (owner: 10Cathal Mooney) [17:28:40] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:40] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10VRiley-WMF) Added memory and confirmed that these units have come back up and are operating as expected. Closing ticket. [17:31:02] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10VRiley-WMF) 05Open→03Resolved [17:31:04] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10VRiley-WMF) [17:32:19] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1460.eqiad.wmnet with OS bullseye [17:32:32] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw1460.eqiad.wmnet with OS bullseye completed: - mw1460 (**PASS**) - Downt... [17:32:46] (03CR) 10Volans: "Replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [17:35:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10XenoRyet) Approved from my end. [17:38:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P54747 and previous config saved to /var/cache/conftool/dbconfig/20240116-173806-marostegui.json [17:38:53] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10fgiunchedi) Confirmed on my end too all is well, thank you again @VRiley-WMF ! [17:38:59] (03PS3) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) [17:42:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:42:53] (03PS2) 10Slyngshede: Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) [17:43:18] PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:43:54] ^ expected [17:47:26] (03PS3) 10Slyngshede: Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) [17:48:19] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [17:48:53] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10JWheeler-WMF) [17:49:00] (03PS2) 10Hnowlan: kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) [17:52:00] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10JWheeler-WMF) a:03Arrbee [17:52:59] (03CR) 10Hnowlan: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [17:53:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P54748 and previous config saved to /var/cache/conftool/dbconfig/20240116-175313-marostegui.json [17:53:27] (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) [17:56:32] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [17:57:06] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) 05Open→03Resolved Work complete, all looking good. [17:57:48] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) 05Open→03Resolved Work complete without issue. [17:57:54] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [17:59:45] (03PS3) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1800) [18:05:00] (03PS1) 10Jdlrobson: Fix text overflow in history page [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991049 (https://phabricator.wikimedia.org/T354218) [18:06:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) [18:07:01] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) 05Open→03Resolved Work completed, all looking good. [18:07:14] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [18:08:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54749 and previous config saved to /var/cache/conftool/dbconfig/20240116-180819-marostegui.json [18:08:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:08:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:08:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54750 and previous config saved to /var/cache/conftool/dbconfig/20240116-180841-marostegui.json [18:09:47] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:10:59] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) [18:11:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54751 and previous config saved to /var/cache/conftool/dbconfig/20240116-181107-marostegui.json [18:11:14] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) 05Open→03Resolved [18:12:06] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:14:42] (03PS2) 10Dzahn: phabricator: auto-sync /srv/repos between servers [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) [18:15:09] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) [18:15:28] 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) 05Open→03Resolved [18:17:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:03] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:18:09] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:18:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:19:04] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:19:55] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:19:56] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:20:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:45] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:26:11] (03PS1) 10Kamila Součková: mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846) [18:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P54752 and previous config saved to /var/cache/conftool/dbconfig/20240116-182613-marostegui.json [18:28:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:33:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/990247/1130/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [18:36:20] !log stopped tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt` [18:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:38:02] !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --sleep 1 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-non-job-queue.txt` [18:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:06] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [18:40:31] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10KFrancis) Hi all, I have sent the NDA for signatures. I'll confirm when it's complete. Thanks! [18:40:36] (03PS1) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 [18:40:38] (03PS1) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 [18:40:40] (03PS1) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) [18:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P54753 and previous config saved to /var/cache/conftool/dbconfig/20240116-184120-marostegui.json [18:41:51] (03CR) 10CI reject: [V: 04-1] Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah) [18:41:53] (03CR) 10CI reject: [V: 04-1] Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 (owner: 10Majavah) [18:41:59] (03CR) 10CI reject: [V: 04-1] Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah) [18:42:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:42:48] !log phab2002 - pulling repo data from phab1004 by running sync script created by rsync::quickdatacopy after gerrit:990247 T354221 [18:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:52] T354221: automate data syncing between phabricator servers - https://phabricator.wikimedia.org/T354221 [18:42:59] (03PS2) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) [18:43:01] (03PS2) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 [18:43:03] (03PS2) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 [18:44:00] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-phabricator-repos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:03] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [18:44:29] (03CR) 10CI reject: [V: 04-1] Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah) [18:44:45] (03CR) 10CI reject: [V: 04-1] Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah) [18:44:49] (03CR) 10CI reject: [V: 04-1] Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 (owner: 10Majavah) [18:45:34] (03PS3) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) [18:45:36] (03PS3) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 [18:45:38] (03PS3) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 [18:50:02] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:02] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1360.eqiad.wmnet with OS bullseye [18:50:37] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1361.eqiad.wmnet with OS bullseye [18:51:14] (03PS4) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [18:51:14] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1362.eqiad.wmnet with OS bullseye [18:51:45] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1363.eqiad.wmnet with OS bullseye [18:52:14] (03CR) 10Htriedman: "changed list location to helmfile.d/services/eventstreams/values.yaml + updated to most current version of list" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [18:55:33] (03CR) 10Kamila Součková: [C: 03+1] "sorry for the merge conflicts '^^" [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [18:56:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54754 and previous config saved to /var/cache/conftool/dbconfig/20240116-185626-marostegui.json [18:56:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:56:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:56:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:57:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [18:57:17] (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [18:57:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [18:57:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54755 and previous config saved to /var/cache/conftool/dbconfig/20240116-185723-marostegui.json [18:59:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54756 and previous config saved to /var/cache/conftool/dbconfig/20240116-185949-marostegui.json [19:00:05] jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1900). [19:03:19] PROBLEM - Check systemd state on kubernetes2053 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:50] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1361.eqiad.wmnet with reason: host reimage [19:05:20] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1360.eqiad.wmnet with reason: host reimage [19:05:29] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:05:33] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1362.eqiad.wmnet with reason: host reimage [19:05:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "tested both manually running the "sync" script that is created by this on the passive server and by starting the systemd service on the sa" [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [19:06:33] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1363.eqiad.wmnet with reason: host reimage [19:06:42] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1374.eqiad.wmnet with OS bullseye [19:07:10] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1375.eqiad.wmnet with OS bullseye [19:07:46] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1362.eqiad.wmnet with reason: host reimage [19:07:49] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1376.eqiad.wmnet with OS bullseye [19:08:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1361.eqiad.wmnet with reason: host reimage [19:10:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1363.eqiad.wmnet with reason: host reimage [19:11:45] (03CR) 10Ottomata: [C: 03+1] "one nit, but lgtm:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [19:12:15] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:12:46] RECOVERY - Check systemd state on kubernetes2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:17] (03PS3) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) [19:13:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1360.eqiad.wmnet with reason: host reimage [19:14:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P54757 and previous config saved to /var/cache/conftool/dbconfig/20240116-191456-marostegui.json [19:16:41] PROBLEM - Check for large files in client bucket on mw1362 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.204: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [19:16:51] PROBLEM - Check size of conntrack table on mw1362 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.204: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:17:41] RECOVERY - Check for large files in client bucket on mw1362 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [19:17:51] RECOVERY - Check size of conntrack table on mw1362 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:18:30] ^ downtime cookbook failed, I'm reimaging the host [19:18:33] sorry for the noise [19:18:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw2422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:21:19] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1374.eqiad.wmnet with reason: host reimage [19:21:47] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1375.eqiad.wmnet with reason: host reimage [19:22:25] (03PS5) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [19:22:25] PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:51] (03CR) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [19:23:10] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:23:32] (KubernetesCalicoDown) firing: mw1362.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1362.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:23:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1376.eqiad.wmnet with reason: host reimage [19:24:22] RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [19:24:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1374.eqiad.wmnet with reason: host reimage [19:27:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1376.eqiad.wmnet with reason: host reimage [19:27:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1362.eqiad.wmnet with OS bullseye [19:28:33] (KubernetesCalicoDown) resolved: mw1362.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1362.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:29:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1361.eqiad.wmnet with OS bullseye [19:29:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1375.eqiad.wmnet with reason: host reimage [19:30:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P54758 and previous config saved to /var/cache/conftool/dbconfig/20240116-193002-marostegui.json [19:30:31] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/990250/1131/phab1004.eqiad.wmnet/change.phab1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [19:31:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1363.eqiad.wmnet with OS bullseye [19:31:05] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2291.codfw.wmnet with OS bullseye [19:31:45] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:31:48] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2292.codfw.wmnet with OS bullseye [19:32:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2293.codfw.wmnet with OS bullseye [19:34:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1360.eqiad.wmnet with OS bullseye [19:34:51] (03PS1) 10Jdlrobson: Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991050 (https://phabricator.wikimedia.org/T354315) [19:34:57] (03PS4) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) [19:35:40] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:38:31] (03PS1) 10Ryan Kemper: wdqs: add exp graph split endpoints to alt_names [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) [19:38:58] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [19:42:19] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [19:42:31] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add exp graph split endpoints to alt_names [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [19:43:26] (03Abandoned) 10Jeena Huneidi: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/990332 (owner: 10Jeena Huneidi) [19:44:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/990250/1132/" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [19:45:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54759 and previous config saved to /var/cache/conftool/dbconfig/20240116-194509-marostegui.json [19:45:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:45:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:45:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:45:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1374.eqiad.wmnet with OS bullseye [19:46:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2294.codfw.wmnet with OS bullseye [19:47:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1376.eqiad.wmnet with OS bullseye [19:47:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2291.codfw.wmnet with reason: host reimage [19:47:52] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2292.codfw.wmnet with reason: host reimage [19:49:00] RECOVERY - Check whether ferm is active by checking the default input chain on mw2422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:49:09] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2293.codfw.wmnet with reason: host reimage [19:50:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1375.eqiad.wmnet with OS bullseye [19:50:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2291.codfw.wmnet with reason: host reimage [19:52:05] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2295.codfw.wmnet with OS bullseye [19:52:27] (03CR) 10Dzahn: [C: 03+2] phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [19:53:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2293.codfw.wmnet with reason: host reimage [19:54:19] (03PS1) 10Ryan Kemper: wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) [19:55:18] (03CR) 10Dzahn: "I don't think you can have a certificate matching that. wildcard only for one level" [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [19:56:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2292.codfw.wmnet with reason: host reimage [19:56:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2296.codfw.wmnet with OS bullseye [19:59:43] (03CR) 10Dzahn: [C: 03+2] "tested with:" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [20:02:57] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2294.codfw.wmnet with reason: host reimage [20:03:29] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2297.codfw.wmnet with OS bullseye [20:06:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2294.codfw.wmnet with reason: host reimage [20:08:18] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: host reimage [20:11:17] (03CR) 10Ryan Kemper: wdqs graph-split: subdomain of query.wikidata.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [20:11:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: host reimage [20:12:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [20:12:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2291.codfw.wmnet with OS bullseye [20:12:42] (03CR) 10Bking: [V: 03+1] wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [20:13:32] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2296.codfw.wmnet with reason: host reimage [20:13:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2293.codfw.wmnet with OS bullseye [20:15:47] (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper) [20:16:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2292.codfw.wmnet with OS bullseye [20:17:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2296.codfw.wmnet with reason: host reimage [20:18:37] (03PS1) 10Ryan Kemper: wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) [20:20:00] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2297.codfw.wmnet with reason: host reimage [20:20:19] (03PS2) 10Ryan Kemper: wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) [20:20:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [20:23:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2297.codfw.wmnet with reason: host reimage [20:23:47] (03CR) 10Bking: [C: 03+1] wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [20:24:07] (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [20:24:28] 10SRE-tools, 10Infrastructure-Foundations: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10Volans) p:05Triage→03Medium a:03Volans [20:25:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2294.codfw.wmnet with OS bullseye [20:26:18] !log T351650 Running puppet on `P:trafficserver::backend` following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/991091 [20:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:21] T351650: Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 [20:30:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2295.codfw.wmnet with OS bullseye [20:37:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2296.codfw.wmnet with OS bullseye [20:43:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2297.codfw.wmnet with OS bullseye [20:43:59] (03CR) 10ArielGlenn: [C: 03+1] "Looks great, thanks for the explanation in the comments" [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo) [20:54:58] (03CR) 10Jdlrobson: [C: 03+1] "Will you be able to deploy this change? https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) (owner: 10Anzx) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T2100). [21:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] i can deploy today [21:00:37] Jdlrobson: i assume you're arround, based on your C+1, but asking just in case :) [21:10:21] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:27:10] (03PS1) 10Jeena Huneidi: Merge remote-tracking branch 'origin' into updateTrainDev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 [21:30:27] (03PS2) 10Jeena Huneidi: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 [21:32:59] (03CR) 10Jeena Huneidi: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 (owner: 10Jeena Huneidi) [21:33:59] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 (owner: 10Jeena Huneidi) [21:35:30] hello sorry im late for the window urbanecm [21:35:33] i had a last minute call [21:35:35] is it too late? [21:40:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:40:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54760 and previous config saved to /var/cache/conftool/dbconfig/20240116-214016-ladsgroup.json [21:40:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:03:50] Jdlrobson: unfortunately, i just saw the ping. so yes, at this point. [22:26:26] urbanecm: no worries. I've moved it to tomorrow :) [22:26:37] sounds good! [23:15:04] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [23:15:35] (03CR) 10Cwhite: [C: 03+1] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [23:23:10] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:38:02] (03PS2) 10Tim Starling: Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) [23:41:04] (03CR) 10Cwhite: [V: 03+1 C: 03+1] "PCC NOOP https://puppet-compiler.wmflabs.org/output/990166/1134/" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite) [23:41:32] (03CR) 10Tim Starling: [C: 03+2] Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling) [23:42:20] (03Merged) 10jenkins-bot: Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling) [23:55:48] !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: Disable wgUseSameSiteLegacyCookies T344791 (duration: 09m 19s) [23:55:53] T344791: Get rid of ss0- SameSite cookie prefix hack - https://phabricator.wikimedia.org/T344791 [23:56:05] (03PS18) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [23:57:16] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [23:59:36] (03PS19) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591)