[00:07:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343198)', diff saved to https://phabricator.wikimedia.org/P52314 and previous config saved to /var/cache/conftool/dbconfig/20230908-000742-arnaudb.json
[00:07:46] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:22:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P52315 and previous config saved to /var/cache/conftool/dbconfig/20230908-002248-arnaudb.json
[00:37:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P52316 and previous config saved to /var/cache/conftool/dbconfig/20230908-003755-arnaudb.json
[00:38:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017
[00:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017 (owner: 10TrainBranchBot)
[00:46:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343198)', diff saved to https://phabricator.wikimedia.org/P52317 and previous config saved to /var/cache/conftool/dbconfig/20230908-005301-arnaudb.json
[00:53:03] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[00:53:05] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:53:16] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[00:53:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017 (owner: 10TrainBranchBot)
[00:53:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52318 and previous config saved to /var/cache/conftool/dbconfig/20230908-005323-arnaudb.json
[01:24:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:29:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:38:13] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[01:55:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:00:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:06:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:53:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:16:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:42:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52319 and previous config saved to /var/cache/conftool/dbconfig/20230908-034241-arnaudb.json
[03:42:45] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[03:45:39] <wikibugs>	 (03PS2) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790)
[03:49:52] <wikibugs>	 (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[03:57:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P52320 and previous config saved to /var/cache/conftool/dbconfig/20230908-035747-arnaudb.json
[04:12:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P52321 and previous config saved to /var/cache/conftool/dbconfig/20230908-041254-arnaudb.json
[04:23:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs-internal: switch wdqs1016 from public to internal role [puppet] - 10https://gerrit.wikimedia.org/r/955396 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking)
[04:28:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52322 and previous config saved to /var/cache/conftool/dbconfig/20230908-042800-arnaudb.json
[04:28:02] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[04:28:04] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[04:28:15] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[04:28:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52323 and previous config saved to /var/cache/conftool/dbconfig/20230908-042821-arnaudb.json
[04:29:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye
[04:54:59] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage
[04:57:29] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage
[05:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:14:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955699 (owner: 10Majavah)
[05:22:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867)
[05:23:44] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye
[05:24:16] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[05:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:44:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:49:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230908T0600)
[06:20:14] <wikibugs>	 (03PS3) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790)
[06:20:59] <wikibugs>	 (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[06:21:25] <wikibugs>	 (03CR) 10EoghanGaffney: "One small question, then I think we're ready to go!" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[06:22:57] <wikibugs>	 (03PS4) 10Muehlenhoff: Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497)
[06:23:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:24:12] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/955776/43179/" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[06:24:26] <wikibugs>	 (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[06:29:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Adapt transition code for ferm -> nftables [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:32:20] <wikibugs>	 (03PS1) 10JMeybohm: eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826)
[06:33:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey)
[06:34:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey)
[06:35:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[06:38:27] <wikibugs>	 (03Merged) 10jenkins-bot: eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[06:45:58] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST clusterroles) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:49:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey)
[06:49:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey)
[06:49:56] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:51:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass down the ensure to the requestctl settings [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497)
[06:53:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST clusterroles) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:56:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:58:37] <wikibugs>	 (03PS3) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713
[06:58:49] <wikibugs>	 (03CR) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede)
[06:59:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230908T0700)
[07:02:03] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[07:13:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52324 and previous config saved to /var/cache/conftool/dbconfig/20230908-071322-arnaudb.json
[07:13:25] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[07:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:20:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[07:21:07] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[07:21:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:21:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[07:21:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[07:22:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[07:22:12] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[07:22:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[07:23:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[07:23:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[07:23:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[07:24:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[07:24:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[07:24:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[07:25:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[07:25:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[07:25:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:26:07] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:26:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:27:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite)
[07:27:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::statsd_exporter: add support for empty mappings [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[07:27:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:28:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P52325 and previous config saved to /var/cache/conftool/dbconfig/20230908-072828-arnaudb.json
[07:31:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "It seems the change to install statsd_exporter class got lost between patchsets, other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[07:31:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) 05Open→03Stalled p:05Triage→03Medium `deployment` membership requires the approval of @thcipriani and `analytics-privatedata-us...
[07:32:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez)
[07:32:15] <wikibugs>	 10SRE, 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10ayounsi)
[07:32:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi)
[07:34:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) we are also pending on @acooper submitting their public SSH key
[07:40:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey)
[07:40:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey)
[07:40:53] <wikibugs>	 (03PS2) 10Elukey: python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725
[07:40:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey)
[07:43:08] <wikibugs>	 (03PS1) 10Elukey: ml-services: move ores-legacy to the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955869
[07:43:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P52326 and previous config saved to /var/cache/conftool/dbconfig/20230908-074334-arnaudb.json
[07:52:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: move ores-legacy to the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955869 (owner: 10Elukey)
[07:53:15] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43180/console" [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans)
[07:55:46] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] cassandra: remove cassandra/twcs deployment [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans)
[07:58:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52327 and previous config saved to /var/cache/conftool/dbconfig/20230908-075840-arnaudb.json
[07:58:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[07:58:44] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[07:58:56] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[07:59:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52328 and previous config saved to /var/cache/conftool/dbconfig/20230908-075901-arnaudb.json
[08:01:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[08:03:04] <wikibugs>	 (03PS1) 10Hashar: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695)
[08:04:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[08:04:44] <wikibugs>	 (03PS1) 10Hashar: envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695)
[08:06:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[08:09:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[08:10:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10phuedx) >>! In T345696#9150637, @Fabfur wrote: > I think you should have access now, please let me know if it's not the case and I'll investigate further!  Confirmed. Thanks!
[08:13:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) 05Stalled→03Resolved
[08:17:23] <wikibugs>	 (03CR) 10Btullis: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[08:18:25] <wikibugs>	 (03PS1) 10Hashar: tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695)
[08:23:21] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[08:23:35] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[08:25:49] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede)
[08:25:51] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede)
[08:26:44] <wikibugs>	 (03Abandoned) 10AikoChou: changeprop: allow retries for liftwing streams with 500 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/954969 (owner: 10AikoChou)
[08:34:52] <wikibugs>	 (03PS1) 10Elukey: role::deployment_server::kubernets: add config for rec-api-ng [labs/private] - 10https://gerrit.wikimedia.org/r/955882
[08:35:26] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] role::deployment_server::kubernets: add config for rec-api-ng [labs/private] - 10https://gerrit.wikimedia.org/r/955882 (owner: 10Elukey)
[08:41:39] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Enable LDAP validators for usernames. [puppet] - 10https://gerrit.wikimedia.org/r/955884
[08:48:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955884 (owner: 10Slyngshede)
[08:52:37] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885
[08:53:22] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885
[08:53:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:54:48] <wikibugs>	 (03PS1) 10AikoChou: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058)
[08:55:14] <wikibugs>	 (03PS1) 10Elukey: profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887
[08:55:34] <wikibugs>	 (03PS4) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[08:58:19] <wikibugs>	 (03CR) 10AikoChou: ml-services: add annotations for inference_services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[08:58:45] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey)
[08:59:14] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey)
[08:59:56] <wikibugs>	 (03Abandoned) 10AikoChou: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[09:00:14] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey)
[09:00:36] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[09:01:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey)
[09:02:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) I almost forgot, for `analytics-privatedata-users` I'm assuming @acooper needs a kerberos principal as well, details available on https...
[09:02:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43181/console" [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey)
[09:03:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey)
[09:04:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey)
[09:10:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:11:41] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[09:11:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:13:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:15:10] <wikibugs>	 (03Abandoned) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[09:15:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:15:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:16:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[09:16:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867)
[09:16:59] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890)
[09:17:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[09:19:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867) (owner: 10Muehlenhoff)
[09:22:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts furud.codfw.wmnet
[09:25:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:26:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:28:25] <wikibugs>	 (03CR) 10David Caro: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah)
[09:28:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[09:28:58] <wikibugs>	 (03PS2) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[09:29:21] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[09:29:27] <wikibugs>	 (03PS1) 10Majavah: P:puppetserver: fix reports location [puppet] - 10https://gerrit.wikimedia.org/r/955890
[09:29:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: furud.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:30:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:30:13] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10Peachey88)
[09:31:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: furud.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:31:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:31:44] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts furud.codfw.wmnet
[09:31:53] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `furud.codfw.wmnet` - furud.codfw.wmnet (**FAIL**)   - Downtimed host on Icinga/Alertmanager   - Found physi...
[09:34:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
[09:36:50] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:38:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
[09:43:02] <wikibugs>	 (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[09:46:37] <vgutierrez>	 !log restart fifo-log-demux@notpurge.service in cp4052
[09:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:19] <wikibugs>	 (03CR) 10FNegri: "PCC is failing because something is requiring "<= bullseye", I don't think is this file but I'm not finding where that requirement is comi" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[09:47:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:49:06] <wikibugs>	 (03PS5) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[09:50:14] <wikibugs>	 (03CR) 10FNegri: "https://puppet-compiler.wmflabs.org/output/955841/43183/" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[09:51:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:51:33] <wikibugs>	 (03PS6) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[09:52:02] <wikibugs>	 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol)
[09:52:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:54:23] <wikibugs>	 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol)
[09:55:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-rw1001.wikimedia.org with OS bookworm
[09:56:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:56:08] <wikibugs>	 (03CR) 10FNegri: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah)
[10:00:52] <wikibugs>	 (03PS1) 10Joal: Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616)
[10:03:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal)
[10:03:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563)
[10:03:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563)
[10:04:10] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul
[10:05:07] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890)
[10:05:43] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[10:05:44] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[10:06:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage
[10:07:00] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:10:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage
[10:13:20] <wikibugs>	 (03PS3) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:15:32] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43184/console" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:16:31] <wikibugs>	 (03PS2) 10Joal: Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616)
[10:17:07] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43185/console" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:17:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Ensure standalone puppet works with puppet7 - https://phabricator.wikimedia.org/T345702 (10jbond) 05Open→03Resolved a:03jbond This is working currently
[10:17:35] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "Found the issue: ceph::common was enforcing" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:17:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[10:17:43] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "The Galera change LGTM. Not sure about the Ceph one, which a) preferrably would be a separate patch and b) might have issues with the newe" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:24:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-rw1001.wikimedia.org with OS bookworm
[10:24:34] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] Galera: allow installing debian-hosted packages for Bookworm or later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:26:40] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Enable LDAP validators for usernames. [puppet] - 10https://gerrit.wikimedia.org/r/955884 (owner: 10Slyngshede)
[10:27:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-rw2001.wikimedia.org with OS bookworm
[10:28:09] <wikibugs>	 (03PS2) 10Hashar: update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695)
[10:28:11] <wikibugs>	 (03PS1) 10Hashar: update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901
[10:30:23] <wikibugs>	 (03PS4) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[10:30:25] <wikibugs>	 (03PS1) 10FNegri: ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810)
[10:31:23] <wikibugs>	 (03CR) 10Hashar: "That is for upgrading tox to version 4 :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[10:31:50] <wikibugs>	 (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[10:32:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:33:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:35:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks <3.  the fact that  `sudo puppet config --section server print reportdir ` dosen't show this dir is mildly frustrating" [puppet] - 10https://gerrit.wikimedia.org/r/955890 (owner: 10Majavah)
[10:35:23] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:puppetserver: fix reports location [puppet] - 10https://gerrit.wikimedia.org/r/955890 (owner: 10Majavah)
[10:36:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:38:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm Q inline" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:39:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) @colewhite thanks
[10:41:50] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:10] <wikibugs>	 (03CR) 10Muehlenhoff: Use a single ensure for managing the nftables state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:46:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage
[10:49:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage
[10:50:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:53:20] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff)
[10:55:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/955904 (https://phabricator.wikimedia.org/T331699)
[11:01:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/955904 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[11:03:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52333 and previous config saved to /var/cache/conftool/dbconfig/20230908-110331-arnaudb.json
[11:03:35] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[11:04:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:05:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:06:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-rw2001.wikimedia.org with OS bookworm
[11:07:07] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:07:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:09:21] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[11:10:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) 05Stalled→03Open
[11:11:47] <wikibugs>	 (03PS3) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890)
[11:12:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) Thanks I added the SSH key.  I'll ask Mark to approve.
[11:13:12] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[11:14:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[11:14:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[11:16:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10mark) Approved.
[11:17:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1005.wikimedia.org
[11:17:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:18:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P52334 and previous config saved to /var/cache/conftool/dbconfig/20230908-111838-arnaudb.json
[11:19:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff)
[11:20:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1005.wikimedia.org - jmm@cumin2002"
[11:21:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1005.wikimedia.org - jmm@cumin2002"
[11:21:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:21:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1005.wikimedia.org on all recursors
[11:21:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1005.wikimedia.org on all recursors
[11:21:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster::servers: remove puppetservers from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:21:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1005.wikimedia.org - jmm@cumin2002"
[11:22:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1005.wikimedia.org - jmm@cumin2002"
[11:23:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica1005.wikimedia.org with OS bookworm
[11:24:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm
[11:33:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P52335 and previous config saved to /var/cache/conftool/dbconfig/20230908-113344-arnaudb.json
[11:34:50] <wikibugs>	 (03PS1) 10Jbond: puppetserver: fix ssl permissions [puppet] - 10https://gerrit.wikimedia.org/r/955908
[11:36:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica1005.wikimedia.org with reason: host reimage
[11:38:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[11:39:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909
[11:39:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica1005.wikimedia.org with reason: host reimage
[11:42:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: fix ssl permissions [puppet] - 10https://gerrit.wikimedia.org/r/955908 (owner: 10Jbond)
[11:42:20] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490)
[11:45:03] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400)
[11:45:28] <wikibugs>	 (03PS1) 10Jbond: puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914
[11:45:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914 (owner: 10Jbond)
[11:45:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914 (owner: 10Jbond)
[11:46:34] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400)
[11:48:36] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490)
[11:48:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52336 and previous config saved to /var/cache/conftool/dbconfig/20230908-114850-arnaudb.json
[11:48:53] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[11:48:56] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[11:49:06] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[11:49:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52337 and previous config saved to /var/cache/conftool/dbconfig/20230908-114911-arnaudb.json
[11:53:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-replica1005.wikimedia.org with OS bookworm
[11:53:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1005.wikimedia.org
[11:54:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm completed: - ldap-rep...
[11:54:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:55:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: nagios: emit warnings from check_dsh_groups [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118)
[11:55:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org
[11:55:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:58:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[11:58:25] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/955916
[11:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[11:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:59:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors
[11:59:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors
[11:59:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:01:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff)
[12:04:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:05:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:05:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:05:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors
[12:05:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors
[12:05:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica1006.wikimedia.org
[12:14:34] <wikibugs>	 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi)
[12:14:48] <wikibugs>	 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi)
[12:16:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org
[12:16:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:17:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:18:00] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica1006.wikimedia.org
[12:18:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org
[12:18:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:18:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Great work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[12:20:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "change seems fine to me but I'm not the one to make the call on if its the right policy decision" [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[12:20:45] <wikibugs>	 (03PS5) 10TTO: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245)
[12:21:01] <wikibugs>	 (03CR) 10TTO: "Thanks for the comments, @Urbanecm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[12:21:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955909 (owner: 10Muehlenhoff)
[12:21:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you John, I'm ok with a sanity check only!" [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[12:22:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:23:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:23:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:23:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors
[12:23:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors
[12:23:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:23:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] rest-gateway: set SNI when using ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[12:24:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1006.wikimedia.org - jmm@cumin2002"
[12:26:23] <wikibugs>	 (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[12:27:51] <wikibugs>	 (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[12:27:53] <wikibugs>	 (03CR) 10Ayounsi: Junos: Add more info on commit errors (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[12:29:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix cloudbackup alias [puppet] - 10https://gerrit.wikimedia.org/r/955923
[12:29:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica1006.wikimedia.org with OS bookworm
[12:29:31] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[12:29:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: fix indentation and add missing descr for Lift Wing [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955769 (owner: 10Elukey)
[12:29:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm
[12:31:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1005.wikimedia.org
[12:31:50] <wikibugs>	 (03PS1) 10Brouberol: Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924
[12:32:46] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[12:32:51] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff)
[12:33:36] <wikibugs>	 (03Merged) 10jenkins-bot: services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[12:35:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1005.wikimedia.org
[12:36:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica1006.wikimedia.org with reason: host reimage
[12:39:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica1006.wikimedia.org with reason: host reimage
[12:40:15] <elukey>	 jouncebot: next
[12:40:15] <jouncebot>	 In 18 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230909T0700)
[12:43:39] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400)
[12:45:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: set SNI when using ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[12:45:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[12:46:18] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[12:48:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) > to reduce load on LVS hosts  My recollection is that it wasn't really about raw load or PPS at the LVSes.  It was that our Linux kernel settings ha...
[12:48:48] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::piwik::database: Enforce type for port [puppet] - 10https://gerrit.wikimedia.org/r/955927
[12:49:17] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:18] <wikibugs>	 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05Open→03In progress p:05Triage→03Medium
[12:49:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:49:38] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond)
[12:49:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:50:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955927 (owner: 10Muehlenhoff)
[12:50:05] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:50:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:51:11] <wikibugs>	 (03PS1) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909)
[12:51:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:51:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:51:40] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:51:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:52:07] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:52:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/...
[12:53:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-replica1006.wikimedia.org with OS bookworm
[12:53:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1006.wikimedia.org
[12:53:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm completed: - ldap-rep...
[12:53:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[12:56:04] <wikibugs>	 (03PS4) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890)
[12:57:34] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[12:57:41] <icinga-wm>	 PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:57:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational pac...
[12:59:59] <logmsgbot>	 !log isaranto@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[13:00:11] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396)
[13:00:11] <logmsgbot>	 !log isaranto@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[13:01:31] <logmsgbot>	 !log isaranto@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[13:01:51] <logmsgbot>	 !log isaranto@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[13:03:43] <wikibugs>	 (03PS1) 10Btullis: Add snapshot101[4-7] to the dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907)
[13:04:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[13:05:07] <logmsgbot>	 !log isaranto@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[13:05:21] <logmsgbot>	 !log isaranto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[13:05:24] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis)
[13:05:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol)
[13:06:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[13:06:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[13:08:17] <wikibugs>	 (03PS2) 10Btullis: Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907)
[13:11:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez)
[13:11:26] <wikibugs>	 (03PS5) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890)
[13:12:33] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[13:12:35] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: fix: enwiktionary in API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850)
[13:15:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) thanks! @acooper RSA keys are being deprecated in some parts of our infrastructure already (T336769), so I'm wondering if you could pro...
[13:15:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:18:07] <wikibugs>	 (03PS1) 10FNegri: [cluster::cloud_management] Don't install prod cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894)
[13:18:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis)
[13:18:39] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri)
[13:19:53] <icinga-wm>	 RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:20:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Added a nit for the commit msg, the rest looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[13:20:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396) (owner: 10Hnowlan)
[13:20:58] <wikibugs>	 (03PS2) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909)
[13:21:00] <wikibugs>	 (03PS1) 10Jbond: check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939
[13:21:43] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396) (owner: 10Hnowlan)
[13:21:45] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[13:21:49] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the reviews :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[13:22:15] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43189/console" [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri)
[13:23:20] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou)
[13:23:23] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[13:23:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) I followed these instructions already which requested rsa type (maybe worth updating the instructions if ed25519 is preferred now?) https:...
[13:23:37] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[13:23:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[13:23:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939 (owner: 10Jbond)
[13:23:55] <icinga-wm>	 PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[13:24:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[13:25:11] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877)
[13:25:58] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Thanks, these are marked as spares but ought to get the updates since they won't be spare for long." [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis)
[13:28:47] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850)
[13:29:06] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850)
[13:29:26] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[13:29:47] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul)
[13:30:08] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[13:30:36] <wikibugs>	 (03Abandoned) 10Bking: sshd_config: disable ssh-rsa public key signature algorithm [puppet] - 10https://gerrit.wikimedia.org/r/834340 (https://phabricator.wikimedia.org/T318345) (owner: 10Bking)
[13:30:55] <wikibugs>	 (03Merged) 10jenkins-bot: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos)
[13:32:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) >>! In T345877#9152356, @acooper wrote: > I followed these instructions already which requested rsa type (maybe w...
[13:34:13] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:34:51] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:35:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @papaul I've done some testing and I'm confident the IP GW moves for the row subnets to the Spines can be done gracefully.  I've yet to wo...
[13:37:48] <logmsgbot>	 !log isaranto@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[13:37:56] <logmsgbot>	 !log isaranto@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[13:38:47] <wikibugs>	 (03PS3) 10Amire80: Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203
[13:38:48] <logmsgbot>	 !log isaranto@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[13:39:08] <logmsgbot>	 !log isaranto@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[13:39:38] <logmsgbot>	 !log isaranto@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[13:39:54] <logmsgbot>	 !log isaranto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[13:43:22] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) This did get broken with the migration to the new puppetdbs as we migrated cumin to use...
[13:44:15] <wikibugs>	 (03PS1) 10Amire80: Add Wikimedia Deutschland's tech news blog [puppet] - 10https://gerrit.wikimedia.org/r/955941
[13:44:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney thanks for the update. I think we can reuse those the MPO
[13:53:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) just to be the clear the RSA key is totally valid at this point, I just wanted to save @acooper more "pain" furth...
[13:54:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "blocked till we get all the required approvals" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[13:55:41] <wikibugs>	 (03PS1) 10Ssingh: 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219)
[13:56:12] <wikibugs>	 (03CR) 10Andrew Bogott: "bookworm will install ceph-common version 16.2.11+ds-2.  On Bullseye we're running 15.2.16-1" [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[13:58:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52340 and previous config saved to /var/cache/conftool/dbconfig/20230908-135803-arnaudb.json
[13:58:07] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[13:58:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add static network defs and DHCP config for new codfw subnets [puppet] - 10https://gerrit.wikimedia.org/r/954896 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[13:59:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[14:06:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "David is not 100% sure that this will be backwards-compatible, but let's find out!" [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[14:07:33] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:43] <wikibugs>	 (03PS1) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909)
[14:13:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P52341 and previous config saved to /var/cache/conftool/dbconfig/20230908-141309-arnaudb.json
[14:17:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "(SSH key verified OOB via Slack)" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[14:17:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:00] <wikibugs>	 (03PS1) 10Jbond: P:cumin::master: drop puppet constant change check [puppet] - 10https://gerrit.wikimedia.org/r/955949 (https://phabricator.wikimedia.org/T345909)
[14:20:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for criticality, not a blocker" [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[14:20:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) Replaced optic and cable again  @cmooney @Eevans
[14:20:55] <icinga-wm>	 RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:43] <wikibugs>	 (03PS2) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909)
[14:21:50] <wikibugs>	 (03CR) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[14:21:57] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) We are monitoring this error it has been 12 days with no faults
[14:24:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[14:24:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:25:26] <wikibugs>	 (03Merged) 10jenkins-bot: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[14:25:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:cumin::master: drop puppet constant change check [puppet] - 10https://gerrit.wikimedia.org/r/955949 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond)
[14:26:11] <wikibugs>	 (03CR) 10Bking: "Adding ServiceOps teammates since this is related to 955032 ." [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson)
[14:26:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  moss-be1003 - jclark@cumin1001"
[14:27:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  moss-be1003 - jclark@cumin1001"
[14:27:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:27:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host moss-be1003
[14:28:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host moss-be1003
[14:28:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host moss-be1003.mgmt.eqiad.wmnet with reboot policy FORCED
[14:28:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P52342 and previous config saved to /var/cache/conftool/dbconfig/20230908-142815-arnaudb.json
[14:29:07] <icinga-wm>	 RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms
[14:29:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr)
[14:33:27] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-radar: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) p:05Triage→03Medium a:03Jhancock.wm error found in the lifecycle log `CPU 2 machine check error detected.` I powered down the server and drained the flea power. waited 5 minutes. server is ba...
[14:39:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:41:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  stat1011 - jclark@cumin1001"
[14:42:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  stat1011 - jclark@cumin1001"
[14:42:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:42:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host stat1011
[14:42:52] <wikibugs>	 (03PS3) 10Bking: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[14:43:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52343 and previous config saved to /var/cache/conftool/dbconfig/20230908-144321-arnaudb.json
[14:43:27] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[14:43:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host stat1011
[14:43:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[14:44:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED
[14:45:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:46:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr)
[14:48:18] <wikibugs>	 (03PS14) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874)
[14:48:30] <wikibugs>	 (03CR) 10Bking: "I did the easy stuff (I think), skillfully avoiding Rakefile ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[14:50:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:50:49] <wikibugs>	 (03PS1) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620)
[14:54:51] <wikibugs>	 (03PS2) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620)
[14:55:54] <wikibugs>	 (03CR) 10Stevemunene: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[14:56:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah)
[14:58:02] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[14:58:15] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:01:17] <wikibugs>	 (03PS3) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620)
[15:02:46] <wikibugs>	 (03PS2) 10Ssingh: 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219)
[15:07:51] <wikibugs>	 (03PS4) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620)
[15:08:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED
[15:11:00] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol)
[15:11:44] <brouberol>	 ^ misclick. I removed my vote
[15:13:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011.eqiad.wmne']
[15:13:58] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[15:15:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-be1003.mgmt.eqiad.wmnet with reboot policy FORCED
[15:16:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be1003.eqiad.wmnet']
[15:16:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['moss-be1003.eqiad.wmnet']
[15:16:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr)
[15:17:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr)
[15:19:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:19:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1042 - D 8. U 31. port 21 CableID 1899 kubernetes1043 - D 8. U 32. port 19 CableID 1902 kubernetes1044 - D 8. U 33. port 34 CableID PUR-0023000004 kuber...
[15:21:09] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219)
[15:22:17] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['stat1011.eqiad.wmne']
[15:26:38] <wikibugs>	 (03CR) 10Ssingh: "We should deploy this on Monday as renaming the configuration checks has proven to be a bit tricky historically." [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:27:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:27:41] <sukhe>	 !log running authdns-update for CR 955943
[15:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:15] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219)
[15:34:36] <wikibugs>	 (03CR) 10Ssingh: "On the other hand, this is mostly a NOOP change other than the anycast-hc side of things, so I am fine with merging it today and finishing" [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:37:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[15:37:15] <wikibugs>	 (03CR) 10Btullis: datahub: add oidc production settings (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[15:37:16] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[15:39:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[15:42:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:44:01] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye
[15:44:08] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -...
[15:44:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes1027 - jclark@cumin1001"
[15:45:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes1027 - jclark@cumin1001"
[15:45:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:48:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[15:49:18] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[15:51:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1027
[15:51:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:52:36] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott)
[15:52:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[15:52:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1027
[15:53:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED
[15:53:52] <wikibugs>	 (03PS4) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790)
[15:54:43] <wikibugs>	 (03PS1) 10Cparle: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187)
[15:54:53] <wikibugs>	 (03PS5) 10Andrea Denisse: superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790)
[15:55:04] <wikibugs>	 (03CR) 10Cparle: [C: 04-2] Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle)
[15:55:54] <wikibugs>	 (03PS15) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874)
[15:56:07] <wikibugs>	 (03CR) 10Andrea Denisse: superset: Move superset metrics to statsd-exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[15:57:36] <wikibugs>	 (03CR) 10Stevemunene: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[15:59:05] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/955776/43190/" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse)
[16:01:49] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:15] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:09] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:05:55] <sukhe>	 hi folks
[16:06:13] <sukhe>	 was someone working on anything related to the DNS updates in netbox?
[16:06:35] <sukhe>	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 228, in _collect_device                                     
[16:06:38] <sukhe>	     if self.addresses[primary.id].dns_name:                                                                                           
[16:06:41] <sukhe>	 KeyError: 14575                                                                                                                       
[16:10:32] <wikibugs>	 (03PS1) 10Bking: dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614)
[16:12:18] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking)
[16:12:21] <wikibugs>	 (03CR) 10Bking: [C: 03+2] dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking)
[16:13:08] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking)
[16:16:58] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[16:18:25] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[16:20:31] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) opened service request with Dell: 175561524
[16:27:34] <wikibugs>	 (03PS6) 10Urbanecm: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[16:28:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[16:33:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1032 - C 6. U 14. port 14 CableID 3220 kubernetes1033 - C 6. U 15. port 17 CableID 3223 kubernetes1034 - C 6. U 16. port 13 CableID 3219 kubernetes1035...
[16:37:44] <wikibugs>	 (03PS1) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158)
[16:38:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[16:39:09] <wikibugs>	 (03PS1) 10Andrew Bogott: designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974
[16:39:37] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974 (owner: 10Andrew Bogott)
[16:40:14] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975
[16:40:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9152542, @Jclark-ctr wrote: > Replaced optic and cable again  @cmooney @Eevans   Thanks @Jclark-ctr.  Unfortunately it didn't work. :(    @cmooney,...
[16:41:31] <wikibugs>	 (03PS1) 10Andrew Bogott: designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976
[16:41:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974 (owner: 10Andrew Bogott)
[16:42:21] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975 (owner: 10Andrew Bogott)
[16:42:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975 (owner: 10Andrew Bogott)
[16:43:54] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976 (owner: 10Andrew Bogott)
[16:44:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976 (owner: 10Andrew Bogott)
[16:44:50] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) I think this is fixed. I am seeing four disks in the idrac and bios. Can someone confirm?
[16:47:29] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:13:16] <taavi>	 !log reprepro copy bookworm-wikimedia bullseye-wikimedia prometheus-memcached-exporter # T345810
[17:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:21] <stashbot>	 T345810: [openstack] Upgrade codfw hosts to bookworm - https://phabricator.wikimedia.org/T345810
[17:15:05] <wikibugs>	 (03PS1) 10Majavah: openstack: use stock mariadb on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955977
[17:15:20] <wikibugs>	 (03PS2) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158)
[17:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:19:51] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:20:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: use stock mariadb on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955977 (owner: 10Majavah)
[17:20:46] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:24:58] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm
[17:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:49:55] <brett>	 hm, that was a bit of an ugly spike
[17:51:29] <brett>	 But I guess no worse than some yesterday
[17:54:47] <brett>	 > This alert means something is currently very wrong
[17:55:06] <brett>	 Seems this alert fires frequently enough that perhaps that isn't the case?
[18:09:09] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:14:09] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:14:17] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:19:09] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:25:39] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180)
[18:28:01] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia)
[18:33:19] <kimberly_sarabia>	 hello. is there anyone around today to deploy a beta cluster only patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/955980
[18:45:38] <RhinosF1>	 kimberly_sarabia: is there any urgency?
[18:46:08] <kimberly_sarabia>	 no urgency, it can wait
[18:46:41] <RhinosF1>	 kimberly_sarabia: it might be worth posting in #wikimedia-releng but it is Friday evening / afternoon
[18:49:05] <kimberly_sarabia>	 no worries. ill schedule it for mon
[19:12:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:17:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:30:02] <wikibugs>	 (03PS2) 10Milimetric: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874)
[19:57:12] <wikibugs>	 (03PS4) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032
[19:57:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[20:06:34] <wikibugs>	 (03PS5) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032
[20:07:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson)
[20:13:03] <wikibugs>	 (03PS6) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032
[20:14:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:19:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:24:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[20:26:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED
[20:27:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes102 - jclark@cumin1001"
[20:28:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes102 - jclark@cumin1001"
[20:28:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:40:43] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:35] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:49:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[20:52:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes102 - jclark@cumin1001"
[20:53:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt kubernetes102 - jclark@cumin1001"
[20:53:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:56:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[20:56:57] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:57:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1029
[20:57:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1030
[20:57:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1028
[20:58:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1029
[20:58:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1028
[20:58:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1031
[20:58:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1032
[20:58:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1030
[20:58:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1031
[20:58:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1033
[20:58:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1034
[20:59:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1034
[20:59:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1033
[20:59:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1032
[20:59:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1035
[20:59:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1036
[20:59:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1037
[21:00:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1036
[21:00:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1038
[21:00:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1035
[21:00:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1039
[21:00:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1037
[21:01:08] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1038
[21:02:00] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host kubernetes1039
[21:02:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1037
[21:02:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1037
[21:02:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1039
[21:03:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1038
[21:03:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1039
[21:03:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1040
[21:04:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1040
[21:04:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1038
[21:04:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1042
[21:05:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1043
[21:06:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1041
[21:06:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1042
[21:06:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1043
[21:07:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1044
[21:07:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1045
[21:07:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1041
[21:08:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1045
[21:08:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1044
[21:08:26] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[21:08:39] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[21:08:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52345 and previous config saved to /var/cache/conftool/dbconfig/20230908-210844-arnaudb.json
[21:08:48] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[21:09:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1046
[21:09:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1047
[21:09:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1047
[21:09:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1048
[21:09:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1048
[21:09:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1047
[21:09:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1047
[21:09:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1048
[21:09:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1048
[21:09:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1049
[21:09:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1049
[21:09:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1050
[21:09:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1051
[21:09:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1050
[21:10:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1050
[21:10:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1050
[21:10:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1051
[21:10:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1046
[21:10:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1052
[21:10:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1053
[21:10:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1052
[21:10:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1053
[21:10:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1054
[21:10:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1055
[21:10:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1056
[21:10:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1054
[21:10:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1055
[21:10:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1056
[21:14:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1028.mgmt.eqiad.wmnet with reboot policy FORCED
[21:14:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[21:14:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1030.mgmt.eqiad.wmnet with reboot policy FORCED
[21:14:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1031.mgmt.eqiad.wmnet with reboot policy FORCED
[21:14:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1032.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1033.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1034.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1035.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1036.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[21:34:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1028.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1030.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1032.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1033.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1034.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1036.mgmt.eqiad.wmnet with reboot policy FORCED
[21:34:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1035.mgmt.eqiad.wmnet with reboot policy FORCED
[21:40:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl)
[21:40:30] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[21:41:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10odimitrijevic) Approved!
[21:44:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[22:36:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:36:24] <rzl>	 looking
[22:36:55] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb6_443: Servers cp3068.esams.wmnet, cp3070.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:37:07] <jinxer-wm>	 (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:37:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1079.eqiad.wmnet, cp1085.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1081.eqiad.wmnet, cp1085.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmne
[22:37:23] <icinga-wm>	 7.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:37:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs6003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp601
[22:37:33] <icinga-wm>	 wmnet, cp6016.drmrs.wmnet are marked down but pooled: testlb6_443: Servers cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb6_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled https://wikit
[22:37:34] <icinga-wm>	 media.org/wiki/PyBal
[22:37:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp6010.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp601
[22:37:35] <icinga-wm>	 wmnet are marked down but pooled: testlb6_443: Servers cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb6_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled https://wikit
[22:37:35] <icinga-wm>	 media.org/wiki/PyBal
[22:38:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:38:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:38:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs6003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:39:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:39:40] <jinxer-wm>	 (LVSHighRX) firing: Excessive RX traffic on lvs3008:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[22:41:07] <jinxer-wm>	 (ProbeDown) resolved: (10) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:41:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[22:42:07] <jinxer-wm>	 (ProbeDown) resolved: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:44:41] <jinxer-wm>	 (LVSHighRX) resolved: Excessive RX traffic on lvs3008:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[22:46:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:10:01] <jinxer-wm>	 (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:15:01] <jinxer-wm>	 (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown