[00:12:05] <wikibugs>	 (03PS3) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774)
[00:17:51] <wikibugs>	 (03CR) 10DDesouza: "It looks fine though I'm not familiarized with this codebase." [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[00:21:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676
[00:38:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676 (owner: 10TrainBranchBot)
[00:43:09] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:57:57] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:58:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990676 (owner: 10TrainBranchBot)
[00:59:06] <sukhe>	 welp, I will ACK and depool
[00:59:15] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:00:22] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:02:16] <sukhe>	 weird, everything looks fine though
[01:02:57] <jinxer-wm>	 (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:03:16] <sukhe>	 https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1&var-cluster=ulsfo%20prometheus%2Fops not a significant increase yeah
[01:03:20] <sukhe>	 ok that resolved itself
[01:03:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355098 (10phaultfinder)
[01:04:15] <jinxer-wm>	 (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:04:39] <sukhe>	 anyway, it's for tomorrow now
[01:40:22] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:40:57] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:42:00] <wikibugs>	 (03PS1) 10AntiCompositeNumber: Add global_edit_count to fullviews [puppet] - 10https://gerrit.wikimedia.org/r/990790 (https://phabricator.wikimedia.org/T344108)
[01:44:15] <jinxer-wm>	 (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:45:57] <jinxer-wm>	 (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:33:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:38:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:39:16] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0300)
[03:05:22] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:05:58] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:07:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432)
[03:07:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[03:09:15] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:09:16] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:10:58] <jinxer-wm>	 (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:14:15] <jinxer-wm>	 (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:21:47] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[03:23:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:23:19] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[03:27:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.14 [core] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990677 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[03:33:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:38:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:43:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:45:22] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:49:15] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:53:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:54:15] <jinxer-wm>	 (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:58:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0400)
[04:25:22] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:29:15] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:30:23] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:34:15] <jinxer-wm>	 (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:42:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:43:37] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:48:19] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:48:45] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:09:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:19:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:24:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:29:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:34:26] <wikibugs>	 (03PS8) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982)
[06:34:30] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Create Grafana sysuser and home directory [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[06:46:01] <wikibugs>	 (03PS2) 10Andrea Denisse: grafana: Create Grafana sysuser and home directory [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0700).
[07:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:16] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:04:49] <wikibugs>	 (03PS1) 10Slyngshede: Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943
[08:05:52] <wikibugs>	 (03CR) 10Slyngshede: "Fix issues highlighted by Taavi." [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede)
[08:10:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Update statsd-exporter mappings for Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/990688 (https://phabricator.wikimedia.org/T343232) (owner: 10Aqu)
[08:28:49] <wikibugs>	 (03CR) 10Muehlenhoff: grafana: Create Grafana sysuser and home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[08:31:37] <wikibugs>	 (03CR) 10Muehlenhoff: Bump version number to 0.0.4 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede)
[08:34:17] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9460261, @ABran-WMF wrote: > I ran the following test: with a custom PKI,   Nice! Out of interest, which PKI t...
[08:37:33] <wikibugs>	 (03PS2) 10Slyngshede: Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701
[08:38:13] <wikibugs>	 (03CR) 10Slyngshede: Bump version number to 0.0.4 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede)
[08:39:48] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9461185, @MoritzMuehlenhoff wrote: > Nice! Out of interest, which PKI tool did you use for your tests? As a next step...
[08:49:00] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9461200, @ABran-WMF wrote: >>>! In T352974#9461185, @MoritzMuehlenhoff wrote: >> Nice! Out of interest, which...
[08:51:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[08:56:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede)
[08:57:09] <wikibugs>	 (03PS9) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799
[08:57:57] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432)
[08:57:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[08:58:42] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990946 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[08:59:05] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.14  refs T354432
[08:59:09] <stashbot>	 T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432
[08:59:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/990947
[09:00:04] <jouncebot>	 jnuche and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T0900). nyaa~
[09:02:27] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] trafficserver: switch design.wikimedia.org to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[09:02:48] <wikibugs>	 (03PS1) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067)
[09:03:55] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Daniram3 out of all services on: 2211 hosts
[09:03:55] <denisse>	 !log reprepro: Copy grafana v9.4.14 from buster to bookworm
[09:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:20] <denisse>	 !log reprepro: Copy grafana v9.4.14 from buster to bookworm - T352665
[09:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:24] <stashbot>	 T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665
[09:05:05] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Daniram3 out of all services on: 2211 hosts
[09:05:13] <wikibugs>	 (03CR) 10Slyngshede: Changes to Python infrastucture to help building Debian package. (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede)
[09:05:26] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede)
[09:05:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 (owner: 10Slyngshede)
[09:06:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/990947 (owner: 10Muehlenhoff)
[09:07:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:08:15] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@24065.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:46] <wikibugs>	 (03CR) 10David Caro: P:openstack: nova::compute: include certificate chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:09:44] <wikibugs>	 (03PS2) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619)
[09:09:52] <wikibugs>	 (03PS2) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067)
[09:10:22] <wikibugs>	 (03PS3) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067)
[09:10:36] <wikibugs>	 (03CR) 10Majavah: P:openstack: nova::compute: include certificate chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:11:36] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1122/co" [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:13:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:14:40] <wikibugs>	 (03PS4) 10Majavah: P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067)
[09:18:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline and below" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:18:28] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: nova::compute: include certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/990948 (https://phabricator.wikimedia.org/T355067) (owner: 10Majavah)
[09:23:45] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.netbox
[09:23:47] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/990950 (https://phabricator.wikimedia.org/T349619)
[09:24:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1039 [puppet] - 10https://gerrit.wikimedia.org/r/990951 (https://phabricator.wikimedia.org/T349619)
[09:25:11] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:26:40] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cloudvirt2004-dev as active - taavi@cumin1002"
[09:26:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/990952 (https://phabricator.wikimedia.org/T349619)
[09:26:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2039 [puppet] - 10https://gerrit.wikimedia.org/r/990953 (https://phabricator.wikimedia.org/T349619)
[09:27:21] <wikibugs>	 (03PS3) 10Effie Mouzeli: (DNM) Switch Mediawiki main memcache clusters to puppet 7: all hosts [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619)
[09:28:36] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cloudvirt2004-dev as active - taavi@cumin1002"
[09:32:05] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-dpkg-success-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[09:32:29] <wikibugs>	 (03PS2) 10Filippo Giunchedi: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555)
[09:32:43] <icinga-wm>	 PROBLEM - Disk space on mwdebug1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=79%): /tmp 0 MB (0% inode=79%): /var/tmp 0 MB (0% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops
[09:33:57] <icinga-wm>	 PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): /tmp 0 MB (0% inode=77%): /var/tmp 0 MB (0% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops
[09:34:07] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-dpkg-success-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:30] <godog>	 mwdebug is not an happy camper at all, I'll take a quick look
[09:38:48] <godog>	 I'd imagine another mediawiki version in /srv/mediawiki brought over the limit
[09:39:19] <icinga-wm>	 PROBLEM - Disk space on mw2272 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2272&var-datasource=codfw+prometheus/ops
[09:40:58] <wikibugs>	 (03PS1) 10Kosta Harlan: PreAuthenticationProvider: Deny account creation based on ipoid data [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928)
[09:42:34] <godog>	 similar problem, /srv being 40G on mw2272 means current mw versions don't fit anymore
[09:45:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:45:39] <wikibugs>	 (03PS1) 10Majavah: report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947)
[09:46:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 500 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:47:05] <wikibugs>	 (03PS2) 10Majavah: P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T300427)
[09:47:58] <godog>	 thoughts/ideas on what best ? trying to gauge how widespread the issue is now
[09:48:27] <hnowlan>	 godog: we can just delete some old versions right? 
[09:49:04] <hnowlan>	 fairly sure that's safe, php-1.42.0-wmf.7 is from november for example
[09:49:17] <godog>	 hnowlan: I believe so yeah, no idea how to do that though
[09:49:37] <icinga-wm>	 PROBLEM - Disk space on mw2283 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2283&var-datasource=codfw+prometheus/ops
[09:51:13] <icinga-wm>	 PROBLEM - Disk space on mw2282 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2282&var-datasource=codfw+prometheus/ops
[09:51:20] <hnowlan>	 I think we can just rm although i don't know what side effects that might have 
[09:51:50] <godog>	 indeed
[09:51:54] <godog>	 same here
[09:51:57] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.14  refs T354432 (duration: 52m 52s)
[09:51:57] <wikibugs>	 (03PS1) 10Majavah: Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115)
[09:52:01] <stashbot>	 T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432
[09:52:02] <hnowlan>	 I'll try on a mwdebug host
[09:52:35] <godog>	 jnuche: ^ see above, mw hosts running out of disk space
[09:53:01] <taavi>	 ideally old trains would be removed via the scap command to do so
[09:53:04] <Lucas_WMDE>	 FWIW I’ve occasionally deleted directories on individual hosts that were left over for some reason https://sal.toolforge.org/log/qOTkE4sBhuQtenzvv3hj
[09:53:13] <icinga-wm>	 RECOVERY - Disk space on mwdebug1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops
[09:53:15] <Lucas_WMDE>	 but that wasn’t fleet-wide
[09:53:27] <godog>	 jnuche: can we ditch old mw versions please?
[09:53:36] <godog>	 Lucas_WMDE taavi ack thanks!
[09:53:37] <Lucas_WMDE>	 I guess you’d delete on deployment.e.w and scap sync that?
[09:53:43] <icinga-wm>	 PROBLEM - Disk space on mw2259 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops
[09:53:49] <godog>	 I'd imagine so too
[09:54:05] <jnuche>	 godog: thanks, I just saw the issue, there's some 30 prod machines affected plus debug and test servers
[09:54:18] <jnuche>	 and yeah, scap was supposed to remove old versions AFAIK
[09:54:25] <jnuche>	 not sure why old dirs were left behind
[09:54:27] <icinga-wm>	 RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops
[09:54:28] <wikibugs>	 (03PS1) 10Majavah: templates: drop cloud-support1-c-eqiad includes [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115)
[09:55:04] <godog>	 cc hnowlan ^
[09:55:09] <icinga-wm>	 PROBLEM - Disk space on mw2271 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2271&var-datasource=codfw+prometheus/ops
[09:56:07] <icinga-wm>	 PROBLEM - Disk space on mw2286 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2286&var-datasource=codfw+prometheus/ops
[09:56:07] <hnowlan>	 scap will sync what it's told to right? there's 5 older php-1.42.0-wmf.x versions in /srv/mediawiki-staging
[09:56:13] <godog>	 jnuche: can we force scap to clean up now as taavi mentioned ?
[09:56:28] <wikibugs>	 (03PS1) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115)
[09:57:14] <jnuche>	 godog: sould be possible, looking it up, failing that I'll just remove the dirs from the deployment server and resync
[09:57:33] <godog>	 ack, sgtm
[09:57:47] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:07] <icinga-wm>	 PROBLEM - Disk space on mw2287 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2287&var-datasource=codfw+prometheus/ops
[09:58:31] <icinga-wm>	 PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops
[09:58:54] <godog>	 in the meantime I'll open a followup task
[09:59:07] <icinga-wm>	 PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops
[09:59:25] <icinga-wm>	 PROBLEM - Disk space on mw2285 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2285&var-datasource=codfw+prometheus/ops
[10:00:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1038.eqiad.wmnet
[10:00:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1038 [puppet] - 10https://gerrit.wikimedia.org/r/990950 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[10:01:01] <icinga-wm>	 PROBLEM - Disk space on mw2264 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2264&var-datasource=codfw+prometheus/ops
[10:01:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[10:01:25] <icinga-wm>	 PROBLEM - Disk space on mw2267 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops
[10:01:27] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:35] <icinga-wm>	 PROBLEM - Disk space on mw2289 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops
[10:01:45] <jnuche>	 deleted now and resyncing
[10:01:49] <icinga-wm>	 PROBLEM - Disk space on mw2265 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2265&var-datasource=codfw+prometheus/ops
[10:04:00] <icinga-wm>	 PROBLEM - Disk space on mw2266 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops
[10:04:51] <wikibugs>	 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10fgiunchedi)
[10:04:57] <godog>	 created ^
[10:05:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1038.eqiad.wmnet
[10:05:47] <icinga-wm>	 PROBLEM - Disk space on mw2262 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2262&var-datasource=codfw+prometheus/ops
[10:05:55] <icinga-wm>	 PROBLEM - Disk space on mw2288 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2288&var-datasource=codfw+prometheus/ops
[10:06:11] <icinga-wm>	 PROBLEM - Disk space on mw2276 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops
[10:06:37] <icinga-wm>	 PROBLEM - Disk space on mw2269 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2269&var-datasource=codfw+prometheus/ops
[10:06:48] <logmsgbot>	 !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.7, 1.42.0-wmf.9, 1.42.0-wmf.10, 1.42.0-wmf.12 (duration: 07m 08s)
[10:07:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1039.eqiad.wmnet
[10:07:38] <jnuche>	 mmmh, the prune removed the versions from the deploy server, but not from the target hosts, sigh
[10:08:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1039 [puppet] - 10https://gerrit.wikimedia.org/r/990951 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[10:08:47] <godog>	 jnuche: ack, I can run a cumin command on the mw fleet too, would a simple rm -rf /srv/mediawiki/php-1.42.0-wmf.7 do the trick for example ?
[10:09:31] <jnuche>	 godog: yeah, that should work, once there's enough space on the hosts maybe the prune will work
[10:09:40] <godog>	 ack, doing cc hnowlan 
[10:09:48] <jnuche>	 I think it's trying to sync wmf.14 and bails out once it fails, so it never prunes the other dirs
[10:10:07] <icinga-wm>	 RECOVERY - Disk space on mw2283 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2283&var-datasource=codfw+prometheus/ops
[10:10:28] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[10:10:45] <godog>	 !log manually pruning php-1.42.0-wmf.7 from mw22* - T355117
[10:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:58] <stashbot>	 T355117: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117
[10:11:34] <godog>	 jnuche: I think between your actions and mine we're good now
[10:11:43] <icinga-wm>	 RECOVERY - Disk space on mw2282 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2282&var-datasource=codfw+prometheus/ops
[10:11:50] <godog>	 there will be recoveries coming in
[10:12:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1039.eqiad.wmnet
[10:13:28] <wikibugs>	 (03PS1) 10Klausman: Add Lift Wing recommendation-api-ng SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187
[10:13:49] <jnuche>	 godog: thanks a lot, can you also run that comin command for the other old branches?: 1.42.0-wmf.9, 1.42.0-wmf.10, 1.42.0-wmf.12
[10:14:13] <icinga-wm>	 RECOVERY - Disk space on mw2259 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops
[10:15:14] <godog>	 jnuche: will do yeah, I've limited the cumin command to mw22* as those hosts seemed to be problematic
[10:15:20] <godog>	 and I'm scared to do it on mw*
[10:15:32] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet
[10:15:39] <icinga-wm>	 RECOVERY - Disk space on mw2271 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2271&var-datasource=codfw+prometheus/ops
[10:16:03] <volans>	 godog: I suggest to use mw aliases insted of hostname prefixes as some host have been migrated to be k8s hosts
[10:16:11] <godog>	 !log clean up also 1.42.0-wmf.9 1.42.0-wmf.10 1.42.0-wmf.12 from mw22* - T355117
[10:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:15] <stashbot>	 T355117: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117
[10:16:29] <godog>	 volans: thank you
[10:16:35] <icinga-wm>	 RECOVERY - Disk space on mw2286 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2286&var-datasource=codfw+prometheus/ops
[10:16:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2038.codfw.wmnet
[10:17:38] <jnuche>	 godog: thx, I think the only non-mw22 are debug and test hosts, I can take care of those
[10:17:49] <wikibugs>	 (03PS2) 10Klausman: Add Lift Wing recommendation-api-ng SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 (https://phabricator.wikimedia.org/T347262)
[10:17:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/990952 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[10:18:37] <icinga-wm>	 RECOVERY - Disk space on mw2287 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2287&var-datasource=codfw+prometheus/ops
[10:19:01] <icinga-wm>	 RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops
[10:19:08] <godog>	 jnuche: yeah I think we're good, I've verified with this thanos query to check for > 95% usage https://w.wiki/8rVF
[10:19:35] <icinga-wm>	 RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops
[10:19:55] <icinga-wm>	 RECOVERY - Disk space on mw2285 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2285&var-datasource=codfw+prometheus/ops
[10:20:23] <icinga-wm>	 RECOVERY - Disk space on mw2272 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2272&var-datasource=codfw+prometheus/ops
[10:21:13] <jnuche>	 godog: hum, didn't know about thanos, nice :)
[10:21:24] <jnuche>	 I'm going to wait a few minutes and then retry the presync
[10:21:31] <icinga-wm>	 RECOVERY - Disk space on mw2264 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2264&var-datasource=codfw+prometheus/ops
[10:21:34] <godog>	 it's inevitable
[10:21:37] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet
[10:21:55] <icinga-wm>	 RECOVERY - Disk space on mw2267 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops
[10:22:05] <icinga-wm>	 RECOVERY - Disk space on mw2289 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2289&var-datasource=codfw+prometheus/ops
[10:22:10] <jnuche>	 :D
[10:22:19] <icinga-wm>	 RECOVERY - Disk space on mw2265 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2265&var-datasource=codfw+prometheus/ops
[10:24:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2038.codfw.wmnet
[10:24:29] <icinga-wm>	 RECOVERY - Disk space on mw2266 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops
[10:26:15] <icinga-wm>	 RECOVERY - Disk space on mw2262 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2262&var-datasource=codfw+prometheus/ops
[10:26:25] <icinga-wm>	 RECOVERY - Disk space on mw2288 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2288&var-datasource=codfw+prometheus/ops
[10:26:41] <icinga-wm>	 RECOVERY - Disk space on mw2276 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2276&var-datasource=codfw+prometheus/ops
[10:27:07] <icinga-wm>	 RECOVERY - Disk space on mw2269 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2269&var-datasource=codfw+prometheus/ops
[10:29:09] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet
[10:30:15] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.14  refs T354432
[10:30:19] <stashbot>	 T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432
[10:30:39] <godog>	 taking a break, bbiab
[10:30:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2039.codfw.wmnet
[10:32:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2039 [puppet] - 10https://gerrit.wikimedia.org/r/990953 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[10:32:21] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967
[10:35:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet
[10:37:43] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967
[10:38:43] <wikibugs>	 (03PS6) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300
[10:41:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2039.codfw.wmnet
[10:43:35] <wikibugs>	 (03PS7) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300
[10:47:05] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[10:47:11] <wikibugs>	 (03PS1) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[10:47:50] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet
[10:51:00] <wikibugs>	 (03PS7) 10Btullis: Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642)
[10:52:31] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving this since codfw is done and eqiad is tracked in T354684
[10:53:17] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1124/co" [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis)
[10:53:45] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet
[10:59:52] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.14  refs T354432 (duration: 29m 36s)
[10:59:56] <stashbot>	 T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432
[11:00:02] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Nice finding Arnaud!  >>! In T352974#9461217, @MoritzMuehlenhoff wrote: >  > Let's create a separate task for switching Orchestrator...
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1100)
[11:01:02] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1040 [puppet] - 10https://gerrit.wikimedia.org/r/990971
[11:01:27] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2040 [puppet] - 10https://gerrit.wikimedia.org/r/990972
[11:01:48] <jnuche>	 heads-up that I'm still going to be running the train unless infra deployments need to happen
[11:02:08] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973
[11:02:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one remaining nit inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[11:03:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974
[11:03:35] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis)
[11:03:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Switch presto from Puppet to PKI certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis)
[11:03:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan)
[11:05:23] <wikibugs>	 (03PS2) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974
[11:05:52] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Switch presto from Puppet to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis)
[11:06:09] <wikibugs>	 (03PS3) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974
[11:08:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1040.eqiad.wmnet
[11:08:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1040 [puppet] - 10https://gerrit.wikimedia.org/r/990971 (owner: 10Effie Mouzeli)
[11:09:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:09:38] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) The question on how to run debmonitor-client in Pontoon is an interesting one, though unrelated to this issue; debmonitor is not installed b...
[11:09:53] <wikibugs>	 (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074)
[11:10:11] <wikibugs>	 (03Merged) 10jenkins-bot: PreAuthenticationProvider: Deny account creation based on ipoid data [extensions/CentralAuth] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/990752 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan)
[11:12:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1040.eqiad.wmnet
[11:13:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet
[11:15:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1041.eqiad.wmnet
[11:15:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973 (owner: 10Effie Mouzeli)
[11:16:03] <logmsgbot>	 !log jnuche@deploy2002 Started scap: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]]
[11:16:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1041 [puppet] - 10https://gerrit.wikimedia.org/r/990973 (owner: 10Effie Mouzeli)
[11:16:07] <stashbot>	 T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928
[11:19:32] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet
[11:21:37] <wikibugs>	 (03PS2) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074)
[11:23:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1041.eqiad.wmnet
[11:26:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2040.codfw.wmnet
[11:28:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2040 [puppet] - 10https://gerrit.wikimedia.org/r/990972 (owner: 10Effie Mouzeli)
[11:30:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede)
[11:33:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:33:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2040.codfw.wmnet
[11:33:27] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede)
[11:34:34] <wikibugs>	 (03Merged) 10jenkins-bot: Netfilter, minor improvements to alerts. [alerts] - 10https://gerrit.wikimedia.org/r/990943 (owner: 10Slyngshede)
[11:36:04] <logmsgbot>	 !log jnuche@deploy2002 jnuche and kharlan: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:36:08] <stashbot>	 T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928
[11:36:31] <logmsgbot>	 !log jnuche@deploy2002 jnuche and kharlan: Continuing with sync
[11:38:31] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[11:39:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2041.codfw.wmnet
[11:40:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2041 [puppet] - 10https://gerrit.wikimedia.org/r/990974 (owner: 10Effie Mouzeli)
[11:45:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2041.codfw.wmnet
[11:45:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc-wf1001 [puppet] - 10https://gerrit.wikimedia.org/r/990984
[11:45:35] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:990752|PreAuthenticationProvider: Deny account creation based on ipoid data (T354928)]] (duration: 29m 32s)
[11:45:39] <stashbot>	 T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928
[11:47:33] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:58] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432)
[11:48:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[11:49:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990985 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot)
[11:56:13] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.14  refs T354432
[11:56:17] <stashbot>	 T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432
[11:56:37] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:15] <wikibugs>	 (03PS8) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300
[11:57:20] <wikibugs>	 (03CR) 10Slyngshede: Package Debmonitor server as .deb (035 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[11:58:07] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1042 [puppet] - 10https://gerrit.wikimedia.org/r/990986
[11:58:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2042 [puppet] - 10https://gerrit.wikimedia.org/r/990987
[11:58:11] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1043 [puppet] - 10https://gerrit.wikimedia.org/r/990988
[11:58:13] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2043 [puppet] - 10https://gerrit.wikimedia.org/r/990989
[11:58:15] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1044 [puppet] - 10https://gerrit.wikimedia.org/r/990990
[11:58:17] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2044 [puppet] - 10https://gerrit.wikimedia.org/r/990991
[11:58:19] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1045 [puppet] - 10https://gerrit.wikimedia.org/r/990992
[11:58:21] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2045 [puppet] - 10https://gerrit.wikimedia.org/r/990993
[11:58:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1046 [puppet] - 10https://gerrit.wikimedia.org/r/990994
[11:58:25] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2046 [puppet] - 10https://gerrit.wikimedia.org/r/990995
[11:58:27] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1047 [puppet] - 10https://gerrit.wikimedia.org/r/990996
[11:58:37] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2047 [puppet] - 10https://gerrit.wikimedia.org/r/990997
[11:58:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc1048 [puppet] - 10https://gerrit.wikimedia.org/r/990998
[11:58:45] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7: mc2048 [puppet] - 10https://gerrit.wikimedia.org/r/990999
[12:05:16] <wikibugs>	 (03PS1) 10Jelto: miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791)
[12:10:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-wf1001.eqiad.wmnet
[12:11:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mc-wf1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991001 (https://phabricator.wikimedia.org/T349619)
[12:11:44] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet
[12:14:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-wf1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991001 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:16:21] <Lucas_WMDE>	 effie: you created change 990990 \o/ ^^
[12:17:47] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:59] <effie>	 Lucas_WMDE: hahaha 
[12:18:41] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet
[12:18:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-wf1001.eqiad.wmnet
[12:22:42] <wikibugs>	 (03PS3) 10Urbanecm: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225)
[12:24:27] <wikibugs>	 (03CR) 10Muehlenhoff: Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:25:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[12:30:03] <moritzm>	 !log installing systemd bugfix updates from Bullseye point release
[12:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:21] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:29] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:38:18] <wikibugs>	 (03PS1) 10KartikMistry: Set MT threshold for Punjabi Wikipedia to 97 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991002 (https://phabricator.wikimedia.org/T347789)
[12:40:56] <wikibugs>	 (03PS1) 10Slyngshede: Netfilter: Remove exclude filter. [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694)
[12:42:51] <wikibugs>	 (03CR) 10Slyngshede: "Following up on the comments on 989188, I doesn't think it realistic to keep an exclude list in sync with Puppet. The expression to trigge" [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:46:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[12:49:08] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[12:50:36] <wikibugs>	 (03CR) 10Majavah: Netfilter max connection tracking entires. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:50:56] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
[12:52:02] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet
[12:56:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "not sure about this nib of pcc output:" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:56:53] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
[12:57:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:57:58] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet
[12:59:45] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[12:59:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1300)
[13:00:57] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add design-style-guide release [deployment-charts] - 10https://gerrit.wikimedia.org/r/991000 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[13:01:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS bullseye
[13:02:22] <effie>	 !log reimage  mc-wf1001 (part of puppet7 migration)
[13:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:05:49] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:06:29] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[13:06:37] <wikibugs>	 (03PS9) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[13:08:25] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[13:08:42] <wikibugs>	 (03PS6) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464
[13:08:54] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[13:09:18] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[13:09:39] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[13:10:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:10:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[13:11:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:13:15] <wikibugs>	 (03PS1) 10Dreamy Jazz: Support parallel PhotoDNA requests [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408)
[13:14:48] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[13:14:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[13:15:25] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage
[13:16:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:16:21] <wikibugs>	 (03PS10) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[13:16:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[13:18:17] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage
[13:19:35] <wikibugs>	 (03PS2) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115)
[13:19:37] <wikibugs>	 (03PS1) 10Majavah: network: remove cloud-support1-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/991006 (https://phabricator.wikimedia.org/T355115)
[13:20:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[13:23:49] <wikibugs>	 (03Merged) 10jenkins-bot: Remove cloud-support VLANs from policies [homer/public] - 10https://gerrit.wikimedia.org/r/990960 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[13:25:29] <wikibugs>	 (03PS21) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[13:26:57] <wikibugs>	 (03PS21) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[13:28:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:28:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[13:28:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597)
[13:28:43] <taavi>	 kamila_: seems like you disabled BGP on mw2436/mw2437 yesterday in netbox, but did not commit it via homer. homer is now giving me a diff relating to that with an unrelated change, is it ok to deploy that?
[13:29:57] <kamila_>	 taavi: oh shit that was by accident and I have a bug in my script XD
[13:30:09] <kamila_>	 sorry, don't deploy, I'll clean it up
[13:30:10] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1126/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney)
[13:30:13] <kamila_>	 thank you, sorryyy
[13:30:32] <taavi>	 kamila_: no worries, just lemme know when that's fixed
[13:30:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597)
[13:31:41] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:42] <kamila_>	 taavi: fixed
[13:32:37] <taavi>	 thanks :D
[13:32:58] <wikibugs>	 (03PS11) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[13:33:11] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:18] <taavi>	 yeah now the diff looks much better
[13:34:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Good point, better go to as simple as we can" [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[13:34:36] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: remove profile::microsites::design [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791)
[13:35:07] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS bullseye
[13:35:09] <kamila_>	 next time I run this script, I will double check that I did a clean git checkout and didn't leave my debugging stuff in there '^^
[13:36:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[13:36:39] <wikibugs>	 (03Merged) 10jenkins-bot: report_users: drop dbproxy1018/9 [software] - 10https://gerrit.wikimedia.org/r/990957 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[13:37:34] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1128/co" [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[13:37:47] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] Add base production images containing Java 8 JDK and JRE (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:38:03] <wikibugs>	 (03PS6) 10Hashar: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:38:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:40:30] <wikibugs>	 (03PS7) 10Btullis: Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176)
[13:40:32] <wikibugs>	 (03PS2) 10Btullis: Update the openjdk-11 images to match openjdk-8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036
[13:41:15] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have amended the shell bits which had two `&&` due to some copy pastas. I have build the images locally invoking twice:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:43:05] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Ben version (PS 7) moves the `&&` at the end of the lines, which is good as well :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:44:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add base production images containing Java 8 JDK and JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[13:47:15] <wikibugs>	 (03CR) 10Jelto: "looks mostly good but profile::gerrit::is_replica is removed from hiera but still used in profile::gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney)
[13:50:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[13:50:21] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[13:51:27] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[13:54:30] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:54:34] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Wrong filenames in the File history section (timestamp differs from displayed timestamp) - https://phabricator.wikimedia.org/T302985 (10bjh21) I think there may be a mistaken assumption here:  > When you saved file on the beginning of the name it h...
[13:59:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "As far as I can tell, none of the code touched here is reached via web requests (only via maintenance), so hopefully backporting it won’t " [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz)
[14:00:00] <Dreamy_Jazz>	 I can self-serve my backports
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1400)
[14:00:05] <jouncebot>	 Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:53] <Lucas_WMDE>	 o/
[14:00:54] <Lucas_WMDE>	 Dreamy_Jazz: ack
[14:01:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz)
[14:02:24] <Dreamy_Jazz>	 Both patches I will deploy will only affect manually run maintenance scripts
[14:03:48] <wikibugs>	 (03Merged) 10jenkins-bot: Support parallel PhotoDNA requests [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990754 (https://phabricator.wikimedia.org/T354408) (owner: 10Dreamy Jazz)
[14:04:14] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]]
[14:04:25] <stashbot>	 T354408: Support parallelizing scans to PhotoDNA - https://phabricator.wikimedia.org/T354408
[14:05:46] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:05:53] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[14:05:56] <wikibugs>	 (03PS1) 10Majavah: alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490)
[14:06:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:07:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:07:12] <wikibugs>	 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) The problem was caused by older MW versions being left over on the drive. For instance:  ` mwdeploy@mw2272:/srv/mediawiki$ ls composer.json  dblists-index.php  errorpages  l...
[14:07:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54725 and previous config saved to /var/cache/conftool/dbconfig/20240116-140713-marostegui.json
[14:07:17] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:07:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet
[14:07:46] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
[14:09:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54726 and previous config saved to /var/cache/conftool/dbconfig/20240116-140938-marostegui.json
[14:11:29] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:990754|Support parallel PhotoDNA requests (T354408)]] (duration: 07m 14s)
[14:11:48] <stashbot>	 T354408: Support parallelizing scans to PhotoDNA - https://phabricator.wikimedia.org/T354408
[14:11:48] <wikibugs>	 (03PS1) 10Dreamy Jazz: Add more statsd counters and add logstash logging [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419)
[14:12:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419) (owner: 10Dreamy Jazz)
[14:12:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[14:14:04] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet
[14:14:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add more statsd counters and add logstash logging [extensions/MediaModeration] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990760 (https://phabricator.wikimedia.org/T351419) (owner: 10Dreamy Jazz)
[14:14:33] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
[14:15:58] <Dreamy_Jazz>	 I received an 502 proxy error when using scap backport for 990760. I will re-try this.
[14:16:55] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]]
[14:16:59] <stashbot>	 T351419: Create a Grafana chart to plot the number of PhotoDNA requests per day per wiki - https://phabricator.wikimedia.org/T351419
[14:17:01] <moritzm>	 !log installing 5.10.205 kernels on buster hosts running the 5.10 backport
[14:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:33] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:18:38] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[14:23:01] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:10] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:990760|Add more statsd counters and add logstash logging (T351419)]] (duration: 07m 15s)
[14:24:14] <stashbot>	 T351419: Create a Grafana chart to plot the number of PhotoDNA requests per day per wiki - https://phabricator.wikimedia.org/T351419
[14:24:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P54727 and previous config saved to /var/cache/conftool/dbconfig/20240116-142444-marostegui.json
[14:26:00] <wikibugs>	 (03PS12) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[14:28:24] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] miscweb/microsites: remove profile::microsites::design [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:29:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[14:30:44] <Dreamy_Jazz>	 I think that is it for the backport window, unless anyone has any other things to deploy?
[14:31:41] <Dreamy_Jazz>	 !log UTC afternoon deploys done
[14:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:05] <moritzm>	 !log installing ca-certificates-java bugfix updates on bookworm
[14:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:32] <Lucas_WMDE>	 Dreamy_Jazz: thanks for doing the window ^^
[14:33:39] <Dreamy_Jazz>	 No problem :)
[14:36:07] <wikibugs>	 (03PS1) 10Marostegui: db2124: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991025 (https://phabricator.wikimedia.org/T354506)
[14:36:49] <wikibugs>	 (03PS13) 10Muehlenhoff: Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[14:37:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2124: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991025 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[14:39:16] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P54728 and previous config saved to /var/cache/conftool/dbconfig/20240116-143951-marostegui.json
[14:42:28] <wikibugs>	 (03PS1) 10Hnowlan: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507)
[14:44:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[14:50:02] <wikibugs>	 (03CR) 10Muehlenhoff: Configure ACLs for reprepro upload queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[14:54:17] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54729 and previous config saved to /var/cache/conftool/dbconfig/20240116-145458-marostegui.json
[14:55:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:55:03] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:55:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:55:20] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[14:55:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:55:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:55:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:56:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:56:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54730 and previous config saved to /var/cache/conftool/dbconfig/20240116-145613-marostegui.json
[14:56:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[14:57:03] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] templates: drop cloud-support1-c-eqiad includes [dns] - 10https://gerrit.wikimedia.org/r/990961 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[14:58:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54731 and previous config saved to /var/cache/conftool/dbconfig/20240116-145837-marostegui.json
[14:58:55] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for cloud-support1-c-eqiad - cmooney@cumin1002"
[15:00:01] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for cloud-support1-c-eqiad - cmooney@cumin1002"
[15:00:01] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:00:53] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 (owner: 10Ladsgroup)
[15:04:54] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967
[15:04:57] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: The root grant on localhost is set via unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/990967 (owner: 10Ladsgroup)
[15:07:53] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[15:08:35] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "common/profile/trafficserver/backend.yaml:      target: http://design.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[15:10:41] <wikibugs>	 (03PS7) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464
[15:11:30] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[15:13:30] <Dreamy_Jazz>	 !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 20 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt
[15:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P54732 and previous config saved to /var/cache/conftool/dbconfig/20240116-151344-marostegui.json
[15:13:44] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[15:14:03] <wikibugs>	 (03PS2) 10Ssingh: depool codfw: do not merge! emergency depool patch [dns] - 10https://gerrit.wikimedia.org/r/989534 (https://phabricator.wikimedia.org/T352758)
[15:14:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney)
[15:18:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah)
[15:18:41] <Dreamy_Jazz>	 !log Stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 20 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt
[15:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:52] <Dreamy_Jazz>	 !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 25 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-20.txt
[15:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:56] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[15:19:09] <wikibugs>	 (03PS8) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464
[15:19:23] <topranks>	 !log Disabling puppet and PyBal on lvs2013 ahead of migration of network link to ssw1-a1-codfw T352784
[15:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:27] <stashbot>	 T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784
[15:21:05] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel)
[15:21:11] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah)
[15:23:09] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:23:19] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good, now that https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/989786 is merged" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis)
[15:23:35] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:25:33] <icinga-wm>	 PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:25:45] <topranks>	 ^^ that's related to my work messed up downtime
[15:26:09] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:27:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[15:27:29] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6,lvs2013 with reason: moving lvs hosts codfw T352784
[15:27:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1129/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney)
[15:27:43] <stashbot>	 T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784
[15:27:46] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6,lvs2013 with reason: moving lvs hosts codfw T352784
[15:28:06] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=38432fab-1dd6-4ffe-a093-648c38675985) set by c...
[15:28:39] <Dreamy_Jazz>	 !log stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 25 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-25.txt
[15:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:48] <wikibugs>	 (03Merged) 10jenkins-bot: alertmanager: fix timezone bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/991017 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah)
[15:28:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P54733 and previous config saved to /var/cache/conftool/dbconfig/20240116-152850-marostegui.json
[15:29:07] <Dreamy_Jazz>	 !log T351400 running mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt
[15:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:11] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[15:32:48] <wikibugs>	 (03CR) 10Ssingh: "I see cloud-support1-c-eqiad: in network/data/data.yaml as well. Should that be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[15:33:16] <wikibugs>	 (03CR) 10Majavah: hieradata: drop cloud-support1-c-eqiad from LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[15:33:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Ah, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah)
[15:36:23] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) Good stuff! Does this mean that we can decline {T354411} now, or would you still prefer that role to be migrated back to puppet 5?
[15:36:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis)
[15:37:45] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[15:39:02] <wikibugs>	 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis) a:05BTullis→03Jclark-ctr
[15:39:50] <wikibugs>	 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) a:05BTullis→03Jclark-ctr
[15:40:09] <wikibugs>	 (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784)
[15:40:29] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T354336)', diff saved to https://phabricator.wikimedia.org/P54734 and previous config saved to /var/cache/conftool/dbconfig/20240116-154357-marostegui.json
[15:43:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[15:44:01] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[15:44:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[15:44:17] <icinga-wm>	 PROBLEM - Disk space on ms-be2072 is CRITICAL: DISK CRITICAL - /srv/swift-storage/objects0 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2072&var-datasource=codfw+prometheus/ops
[15:44:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54735 and previous config saved to /var/cache/conftool/dbconfig/20240116-154419-marostegui.json
[15:44:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney)
[15:46:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54736 and previous config saved to /var/cache/conftool/dbconfig/20240116-154643-marostegui.json
[15:47:08] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF)
[15:47:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney)
[15:47:48] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) 05In progress→03Resolved
[15:48:00] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ABran-WMF)
[15:48:20] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9462685, @BTullis wrote: > Good stuff! > Does this mean that we can decline {T354411} now, or would you still prefer...
[15:49:55] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) p:05Triage→03High
[15:50:33] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 214, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:51:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall LGTM, I would like some additional comments in the files to ease our life in the future, and adding a networkpolicy egress templat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[15:54:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:54:25] <icinga-wm>	 RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:55:18] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918
[15:55:23] <stashbot>	 T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784
[15:55:24] <stashbot>	 T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918
[15:55:32] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918
[16:00:04] <Dreamy_Jazz>	 !log stopped mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt
[16:00:05] <jouncebot>	 eoghan, jelto, and arnoldokoth: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1600).
[16:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P54737 and previous config saved to /var/cache/conftool/dbconfig/20240116-160150-marostegui.json
[16:03:09] <Dreamy_Jazz>	 !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt`
[16:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:18] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[16:14:20] <wikibugs>	 (03PS3) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909)
[16:14:56] <wikibugs>	 (03PS2) 10Hnowlan: modules: add cassandra client module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507)
[16:15:36] <wikibugs>	 (03CR) 10Hnowlan: modules: add cassandra client module (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[16:16:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P54738 and previous config saved to /var/cache/conftool/dbconfig/20240116-161656-marostegui.json
[16:19:51] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: deployment
[16:20:04] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: deployment
[16:20:08] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment
[16:20:22] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: deployment
[16:20:49] <mutante>	 !log phabricator deploy is imminent
[16:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:01] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 for T354969
[16:21:29] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 for T354969 (duration: 00m 27s)
[16:21:30] <stashbot>	 T354969: Deploy Phabricator/Phorge 2024-01-16 - https://phabricator.wikimedia.org/T354969
[16:22:05] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab1004 for T354969
[16:22:56] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab1004 for T354969 (duration: 00m 50s)
[16:27:50] <wikibugs>	 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) Paging for ticket.wikimedia.org might be a bit expensive if done similar like pages for mediawiki for example (especially outside of business hours). But that's my...
[16:31:28] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm now. One hiera file hieradata/cloud/eqiad1/devtools/common.yaml still uses the is_replica flag, see open comment." [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney)
[16:32:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T354336)', diff saved to https://phabricator.wikimedia.org/P54739 and previous config saved to /var/cache/conftool/dbconfig/20240116-163203-marostegui.json
[16:32:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[16:32:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[16:32:20] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:32:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54740 and previous config saved to /var/cache/conftool/dbconfig/20240116-163224-marostegui.json
[16:32:31] <wikibugs>	 (03PS1) 10C. Scott Ananian: WIP: turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039
[16:32:33] <wikibugs>	 (03CR) 10Btullis: "I'm doing a local build of spark against openjdk-8." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis)
[16:33:04] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on prometheus1005.eqiad.wmnet with reason: memory upgrade
[16:33:18] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus1005.eqiad.wmnet with reason: memory upgrade
[16:33:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=639b8465-e0d6-4049-bc5d-4c38af1cc396) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason...
[16:33:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) (owner: 10Cathal Mooney)
[16:34:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54741 and previous config saved to /var/cache/conftool/dbconfig/20240116-163449-marostegui.json
[16:37:42] <wikibugs>	 (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff)
[16:39:17] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:41:23] <wikibugs>	 (03PS16) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624)
[16:44:16] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:47:10] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) 05Open→03Resolved Link removed
[16:47:16] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10Papaul)
[16:49:09] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:49:16] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:49:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P54742 and previous config saved to /var/cache/conftool/dbconfig/20240116-164957-marostegui.json
[16:56:26] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw1460.eqiad.wmnet with OS bullseye
[16:56:40] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw1460.eqiad.wmnet with OS bullseye
[16:56:42] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on prometheus1006.eqiad.wmnet with reason: memory upgrade
[16:56:56] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus1006.eqiad.wmnet with reason: memory upgrade
[16:57:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8a4aebbc-9222-4c62-b55e-a4c6a3f6d9a6) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason...
[17:00:05] <jouncebot>	 jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:01:25] <wikibugs>	 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) After moving the lincard in cr1, we are seeing the error now in cr1. I email Support to request again a replacement
[17:04:17] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:05:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P54743 and previous config saved to /var/cache/conftool/dbconfig/20240116-170503-marostegui.json
[17:09:17] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:10:05] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage
[17:11:07] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918
[17:11:21] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918
[17:11:23] <stashbot>	 T352784: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784
[17:11:23] <stashbot>	 T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918
[17:12:41] <wikibugs>	 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) ` Hello Papaul     Sure, no problem, thanks for the troubleshooting you performed, I will proceed with the RMA, please provide me with the following information (please fill out the blank spaces to avoid any misun...
[17:12:50] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage
[17:14:16] <topranks>	 !log Disabling puppet and PyBal on lvs2012 ahead of migration of network link to lsw1-b2-codfw T352909
[17:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:22] <stashbot>	 T352909: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909
[17:14:53] <wikibugs>	 (03PS2) 10Btullis: Switch all spark images to use Java 8 as their base JDK/JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777)
[17:18:11] <wikibugs>	 (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:20:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T354336)', diff saved to https://phabricator.wikimedia.org/P54744 and previous config saved to /var/cache/conftool/dbconfig/20240116-172011-marostegui.json
[17:20:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[17:20:18] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[17:20:22] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[17:20:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54745 and previous config saved to /var/cache/conftool/dbconfig/20240116-172032-marostegui.json
[17:23:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54746 and previous config saved to /var/cache/conftool/dbconfig/20240116-172300-marostegui.json
[17:24:04] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) (owner: 10Cathal Mooney)
[17:28:40] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:30:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10VRiley-WMF) Added memory and confirmed that these units have come back up and are operating as expected. Closing ticket.
[17:31:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10VRiley-WMF) 05Open→03Resolved
[17:31:04] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10VRiley-WMF)
[17:32:19] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1460.eqiad.wmnet with OS bullseye
[17:32:32] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw1460.eqiad.wmnet with OS bullseye completed: - mw1460 (**PASS**)   - Downt...
[17:32:46] <wikibugs>	 (03CR) 10Volans: "Replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb)
[17:35:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10XenoRyet) Approved from my end.
[17:38:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P54747 and previous config saved to /var/cache/conftool/dbconfig/20240116-173806-marostegui.json
[17:38:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10fgiunchedi) Confirmed on my end too all is well, thank you again @VRiley-WMF !
[17:38:59] <wikibugs>	 (03PS3) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074)
[17:42:30] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:42:53] <wikibugs>	 (03PS2) 10Slyngshede: Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825)
[17:43:18] <icinga-wm>	 PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:43:54] <sukhe>	 ^ expected
[17:47:26] <wikibugs>	 (03PS3) 10Slyngshede: Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825)
[17:48:19] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[17:48:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for <INSERT USERNAME> - https://phabricator.wikimedia.org/T355170 (10JWheeler-WMF)
[17:49:00] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791)
[17:52:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <ldap/wmf> for <JWheeler-WMF> - https://phabricator.wikimedia.org/T355170 (10JWheeler-WMF) a:03Arrbee
[17:52:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[17:53:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P54748 and previous config saved to /var/cache/conftool/dbconfig/20240116-175313-marostegui.json
[17:53:27] <wikibugs>	 (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912)
[17:56:32] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney)
[17:57:06] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) 05Open→03Resolved Work complete, all looking good.
[17:57:48] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) 05Open→03Resolved Work complete without issue.
[17:57:54] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[17:59:45] <wikibugs>	 (03PS3) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1800)
[18:05:00] <wikibugs>	 (03PS1) 10Jdlrobson: Fix text overflow in history page [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991049 (https://phabricator.wikimedia.org/T354218)
[18:06:30] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney)
[18:07:01] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) 05Open→03Resolved Work completed, all looking good.
[18:07:14] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[18:08:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T354336)', diff saved to https://phabricator.wikimedia.org/P54749 and previous config saved to /var/cache/conftool/dbconfig/20240116-180819-marostegui.json
[18:08:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:08:32] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:08:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54750 and previous config saved to /var/cache/conftool/dbconfig/20240116-180841-marostegui.json
[18:09:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF
[18:10:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF)
[18:11:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54751 and previous config saved to /var/cache/conftool/dbconfig/20240116-181107-marostegui.json
[18:11:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) 05Open→03Resolved
[18:12:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF
[18:14:42] <wikibugs>	 (03PS2) 10Dzahn: phabricator: auto-sync /srv/repos between servers [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221)
[18:15:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF)
[18:15:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) 05Open→03Resolved
[18:17:58] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:03] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:18:09] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:18:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:19:04] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[18:19:55] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[18:19:56] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[18:20:10] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:20:45] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[18:26:11] <wikibugs>	 (03PS1) 10Kamila Součková: mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846)
[18:26:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P54752 and previous config saved to /var/cache/conftool/dbconfig/20240116-182613-marostegui.json
[18:28:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:33:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/990247/1130/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[18:36:20] <Dreamy_Jazz>	 !log stopped tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt`
[18:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:38:02] <Dreamy_Jazz>	 !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --sleep 1 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-non-job-queue.txt`
[18:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:06] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[18:40:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10KFrancis) Hi all, I have sent the NDA for signatures.  I'll confirm when it's complete.  Thanks!
[18:40:36] <wikibugs>	 (03PS1) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044
[18:40:38] <wikibugs>	 (03PS1) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045
[18:40:40] <wikibugs>	 (03PS1) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174)
[18:41:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P54753 and previous config saved to /var/cache/conftool/dbconfig/20240116-184120-marostegui.json
[18:41:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah)
[18:41:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 (owner: 10Majavah)
[18:41:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah)
[18:42:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:42:48] <mutante>	 !log phab2002 - pulling repo data from phab1004 by running sync script created by rsync::quickdatacopy after gerrit:990247 T354221
[18:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:52] <stashbot>	 T354221: automate data syncing between phabricator servers - https://phabricator.wikimedia.org/T354221
[18:42:59] <wikibugs>	 (03PS2) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174)
[18:43:01] <wikibugs>	 (03PS2) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044
[18:43:03] <wikibugs>	 (03PS2) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045
[18:44:00] <icinga-wm>	 PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-phabricator-repos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:44:03] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/990981 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[18:44:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah)
[18:44:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah)
[18:44:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045 (owner: 10Majavah)
[18:45:34] <wikibugs>	 (03PS3) 10Majavah: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174)
[18:45:36] <wikibugs>	 (03PS3) 10Majavah: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044
[18:45:38] <wikibugs>	 (03PS3) 10Majavah: Typing fixes [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991045
[18:50:02] <icinga-wm>	 RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1360.eqiad.wmnet with OS bullseye
[18:50:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1361.eqiad.wmnet with OS bullseye
[18:51:14] <wikibugs>	 (03PS4) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114
[18:51:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1362.eqiad.wmnet with OS bullseye
[18:51:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1363.eqiad.wmnet with OS bullseye
[18:52:14] <wikibugs>	 (03CR) 10Htriedman: "changed list location to helmfile.d/services/eventstreams/values.yaml + updated to most current version of list" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[18:55:33] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "sorry for the merge conflicts '^^" [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[18:56:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54754 and previous config saved to /var/cache/conftool/dbconfig/20240116-185626-marostegui.json
[18:56:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[18:56:32] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:56:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[18:57:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[18:57:17] <wikibugs>	 (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn)
[18:57:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[18:57:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54755 and previous config saved to /var/cache/conftool/dbconfig/20240116-185723-marostegui.json
[18:59:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54756 and previous config saved to /var/cache/conftool/dbconfig/20240116-185949-marostegui.json
[19:00:05] <jouncebot>	 jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T1900).
[19:03:19] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2053 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1361.eqiad.wmnet with reason: host reimage
[19:05:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1360.eqiad.wmnet with reason: host reimage
[19:05:29] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:05:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1362.eqiad.wmnet with reason: host reimage
[19:05:41] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "tested both manually running the "sync" script that is created by this on the passive server and by starting the systemd service on the sa" [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[19:06:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1363.eqiad.wmnet with reason: host reimage
[19:06:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1374.eqiad.wmnet with OS bullseye
[19:07:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1375.eqiad.wmnet with OS bullseye
[19:07:46] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1362.eqiad.wmnet with reason: host reimage
[19:07:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1376.eqiad.wmnet with OS bullseye
[19:08:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1361.eqiad.wmnet with reason: host reimage
[19:10:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1363.eqiad.wmnet with reason: host reimage
[19:11:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "one nit, but lgtm:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[19:12:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:12:46] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:17] <wikibugs>	 (03PS3) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221)
[19:13:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1360.eqiad.wmnet with reason: host reimage
[19:14:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P54757 and previous config saved to /var/cache/conftool/dbconfig/20240116-191456-marostegui.json
[19:16:41] <icinga-wm>	 PROBLEM - Check for large files in client bucket on mw1362 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.204: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[19:16:51] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1362 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.204: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:17:41] <icinga-wm>	 RECOVERY - Check for large files in client bucket on mw1362 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[19:17:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1362 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:18:30] <kamila_>	 ^ downtime cookbook failed, I'm reimaging the host
[19:18:33] <kamila_>	 sorry for the noise
[19:18:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:21:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1374.eqiad.wmnet with reason: host reimage
[19:21:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1375.eqiad.wmnet with reason: host reimage
[19:22:25] <wikibugs>	 (03PS5) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114
[19:22:25] <icinga-wm>	 PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100%
[19:22:51] <wikibugs>	 (03CR) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[19:23:10] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:23:32] <jinxer-wm>	 (KubernetesCalicoDown) firing: mw1362.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1362.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:23:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1376.eqiad.wmnet with reason: host reimage
[19:24:22] <icinga-wm>	 RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[19:24:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1374.eqiad.wmnet with reason: host reimage
[19:27:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1376.eqiad.wmnet with reason: host reimage
[19:27:34] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1362.eqiad.wmnet with OS bullseye
[19:28:33] <jinxer-wm>	 (KubernetesCalicoDown) resolved: mw1362.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1362.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:29:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1361.eqiad.wmnet with OS bullseye
[19:29:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1375.eqiad.wmnet with reason: host reimage
[19:30:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P54758 and previous config saved to /var/cache/conftool/dbconfig/20240116-193002-marostegui.json
[19:30:31] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/990250/1131/phab1004.eqiad.wmnet/change.phab1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[19:31:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1363.eqiad.wmnet with OS bullseye
[19:31:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2291.codfw.wmnet with OS bullseye
[19:31:45] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:31:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2292.codfw.wmnet with OS bullseye
[19:32:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2293.codfw.wmnet with OS bullseye
[19:34:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1360.eqiad.wmnet with OS bullseye
[19:34:51] <wikibugs>	 (03PS1) 10Jdlrobson: Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991050 (https://phabricator.wikimedia.org/T354315)
[19:34:57] <wikibugs>	 (03PS4) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221)
[19:35:40] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:38:31] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: add exp graph split endpoints to alt_names [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661)
[19:38:58] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[19:42:19] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[19:42:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add exp graph split endpoints to alt_names [puppet] - 10https://gerrit.wikimedia.org/r/991088 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[19:43:26] <wikibugs>	 (03Abandoned) 10Jeena Huneidi: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/990332 (owner: 10Jeena Huneidi)
[19:44:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/990250/1132/" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[19:45:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T354336)', diff saved to https://phabricator.wikimedia.org/P54759 and previous config saved to /var/cache/conftool/dbconfig/20240116-194509-marostegui.json
[19:45:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:45:20] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[19:45:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:45:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1374.eqiad.wmnet with OS bullseye
[19:46:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2294.codfw.wmnet with OS bullseye
[19:47:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1376.eqiad.wmnet with OS bullseye
[19:47:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2291.codfw.wmnet with reason: host reimage
[19:47:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2292.codfw.wmnet with reason: host reimage
[19:49:00] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:49:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2293.codfw.wmnet with reason: host reimage
[19:50:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1375.eqiad.wmnet with OS bullseye
[19:50:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2291.codfw.wmnet with reason: host reimage
[19:52:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2295.codfw.wmnet with OS bullseye
[19:52:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[19:53:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2293.codfw.wmnet with reason: host reimage
[19:54:19] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661)
[19:55:18] <wikibugs>	 (03CR) 10Dzahn: "I don't think you can have a certificate matching that. wildcard only for one level" [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[19:56:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2292.codfw.wmnet with reason: host reimage
[19:56:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2296.codfw.wmnet with OS bullseye
[19:59:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "tested with:" [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[20:02:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2294.codfw.wmnet with reason: host reimage
[20:03:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2297.codfw.wmnet with OS bullseye
[20:06:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2294.codfw.wmnet with reason: host reimage
[20:08:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: host reimage
[20:11:17] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs graph-split: subdomain of query.wikidata.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[20:11:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: host reimage
[20:12:10] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[20:12:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2291.codfw.wmnet with OS bullseye
[20:12:42] <wikibugs>	 (03CR) 10Bking: [V: 03+1] wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[20:13:32] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2296.codfw.wmnet with reason: host reimage
[20:13:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2293.codfw.wmnet with OS bullseye
[20:15:47] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: subdomain of query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/991089 (https://phabricator.wikimedia.org/T354661) (owner: 10Ryan Kemper)
[20:16:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2292.codfw.wmnet with OS bullseye
[20:17:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2296.codfw.wmnet with reason: host reimage
[20:18:37] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464)
[20:20:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2297.codfw.wmnet with reason: host reimage
[20:20:19] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464)
[20:20:35] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper)
[20:23:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2297.codfw.wmnet with reason: host reimage
[20:23:47] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper)
[20:24:07] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: new trafficserver rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/991091 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper)
[20:24:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10Volans) p:05Triage→03Medium a:03Volans
[20:25:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2294.codfw.wmnet with OS bullseye
[20:26:18] <ryankemper>	 !log T351650 Running puppet on `P:trafficserver::backend` following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/991091
[20:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:21] <stashbot>	 T351650: Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650
[20:30:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2295.codfw.wmnet with OS bullseye
[20:37:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2296.codfw.wmnet with OS bullseye
[20:43:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2297.codfw.wmnet with OS bullseye
[20:43:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Looks great, thanks for the explanation in the comments" [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo)
[20:54:58] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "Will you be able to deploy this change? https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) (owner: 10Anzx)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240116T2100).
[21:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:16] <urbanecm>	 i can deploy today
[21:00:37] <urbanecm>	 Jdlrobson: i assume you're arround, based on your C+1, but asking just in case :)
[21:10:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:27:10] <wikibugs>	 (03PS1) 10Jeena Huneidi: Merge remote-tracking branch 'origin' into updateTrainDev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092
[21:30:27] <wikibugs>	 (03PS2) 10Jeena Huneidi: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092
[21:32:59] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 (owner: 10Jeena Huneidi)
[21:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991092 (owner: 10Jeena Huneidi)
[21:35:30] <Jdlrobson>	 hello sorry im late for the window urbanecm 
[21:35:33] <Jdlrobson>	 i had a last minute call
[21:35:35] <Jdlrobson>	 is it too late?
[21:40:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[21:40:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[21:40:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54760 and previous config saved to /var/cache/conftool/dbconfig/20240116-214016-ladsgroup.json
[21:40:46] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[22:03:50] <urbanecm>	 Jdlrobson: unfortunately, i just saw the ping. so yes, at this point.
[22:26:26] <Jdlrobson>	 urbanecm: no worries. I've moved it to tomorrow :)
[22:26:37] <urbanecm>	 sounds good!
[23:15:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[23:15:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[23:23:10] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:38:02] <wikibugs>	 (03PS2) 10Tim Starling: Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791)
[23:41:04] <wikibugs>	 (03CR) 10Cwhite: [V: 03+1 C: 03+1] "PCC NOOP https://puppet-compiler.wmflabs.org/output/990166/1134/" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite)
[23:41:32] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling)
[23:42:20] <wikibugs>	 (03Merged) 10jenkins-bot: Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) (owner: 10Tim Starling)
[23:55:48] <logmsgbot>	 !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: Disable wgUseSameSiteLegacyCookies T344791 (duration: 09m 19s)
[23:55:53] <stashbot>	 T344791: Get rid of ss0- SameSite cookie prefix hack - https://phabricator.wikimedia.org/T344791
[23:56:05] <wikibugs>	 (03PS18) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591)
[23:57:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[23:59:36] <wikibugs>	 (03PS19) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591)