[00:07:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343198)', diff saved to https://phabricator.wikimedia.org/P52314 and previous config saved to /var/cache/conftool/dbconfig/20230908-000742-arnaudb.json [00:07:46] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:22:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P52315 and previous config saved to /var/cache/conftool/dbconfig/20230908-002248-arnaudb.json [00:37:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P52316 and previous config saved to /var/cache/conftool/dbconfig/20230908-003755-arnaudb.json [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017 (owner: 10TrainBranchBot) [00:46:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T343198)', diff saved to https://phabricator.wikimedia.org/P52317 and previous config saved to /var/cache/conftool/dbconfig/20230908-005301-arnaudb.json [00:53:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [00:53:05] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:53:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [00:53:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955017 (owner: 10TrainBranchBot) [00:53:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52318 and previous config saved to /var/cache/conftool/dbconfig/20230908-005323-arnaudb.json [01:24:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:29:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:38:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [01:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:29] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:05] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:16:27] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:03] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:42:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52319 and previous config saved to /var/cache/conftool/dbconfig/20230908-034241-arnaudb.json [03:42:45] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:45:39] (03PS2) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) [03:49:52] (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [03:57:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P52320 and previous config saved to /var/cache/conftool/dbconfig/20230908-035747-arnaudb.json [04:12:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P52321 and previous config saved to /var/cache/conftool/dbconfig/20230908-041254-arnaudb.json [04:23:09] (03CR) 10Ryan Kemper: [C: 03+2] wdqs-internal: switch wdqs1016 from public to internal role [puppet] - 10https://gerrit.wikimedia.org/r/955396 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [04:28:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52322 and previous config saved to /var/cache/conftool/dbconfig/20230908-042800-arnaudb.json [04:28:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:28:04] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:28:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:28:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52323 and previous config saved to /var/cache/conftool/dbconfig/20230908-042821-arnaudb.json [04:29:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye [04:54:59] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [04:57:29] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [05:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:14:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955699 (owner: 10Majavah) [05:22:09] (03PS1) 10Muehlenhoff: Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867) [05:23:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye [05:24:16] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:44:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:49:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230908T0600) [06:20:14] (03PS3) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) [06:20:59] (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [06:21:25] (03CR) 10EoghanGaffney: "One small question, then I think we're ready to go!" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [06:22:57] (03PS4) 10Muehlenhoff: Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) [06:23:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:24:12] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/955776/43179/" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [06:24:26] (03CR) 10Andrea Denisse: superset: Move superset logs to statsd-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [06:29:54] (03CR) 10Muehlenhoff: [C: 03+2] Adapt transition code for ferm -> nftables [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:32:20] (03PS1) 10JMeybohm: eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826) [06:33:57] (03CR) 10JMeybohm: [C: 03+1] charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [06:34:27] (03CR) 10JMeybohm: [C: 03+1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey) [06:35:59] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [06:38:27] (03Merged) 10jenkins-bot: eventrouter: Update to 0.4.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955863 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [06:45:58] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:18] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST clusterroles) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:49:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey) [06:49:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [06:49:56] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:19] (03PS1) 10Muehlenhoff: Pass down the ensure to the requestctl settings [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) [06:53:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST clusterroles) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:56:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:58:37] (03PS3) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 [06:58:49] (03CR) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede) [06:59:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230908T0700) [07:02:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [07:13:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52324 and previous config saved to /var/cache/conftool/dbconfig/20230908-071322-arnaudb.json [07:13:25] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [07:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:20:55] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:21:07] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:21:16] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:21:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:21:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:21:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:22:06] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:22:12] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:22:34] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:23:13] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:23:36] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:23:46] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:24:13] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:24:23] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:24:44] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:25:18] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:25:32] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:25:48] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:26:07] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:26:18] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:27:01] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [07:27:31] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::statsd_exporter: add support for empty mappings [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [07:27:50] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:28:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P52325 and previous config saved to /var/cache/conftool/dbconfig/20230908-072828-arnaudb.json [07:31:17] (03CR) 10Filippo Giunchedi: [C: 04-1] "It seems the change to install statsd_exporter class got lost between patchsets, other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [07:31:52] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) 05Open→03Stalled p:05Triage→03Medium `deployment` membership requires the approval of @thcipriani and `analytics-privatedata-us... [07:32:13] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) [07:32:15] 10SRE, 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10ayounsi) [07:32:19] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi) [07:34:36] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) we are also pending on @acooper submitting their public SSH key [07:40:33] (03CR) 10Elukey: [C: 03+2] charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [07:40:47] (03CR) 10Elukey: [C: 03+2] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey) [07:40:53] (03PS2) 10Elukey: python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 [07:40:55] (03CR) 10CI reject: [V: 04-1] python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 (owner: 10Elukey) [07:43:08] (03PS1) 10Elukey: ml-services: move ores-legacy to the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955869 [07:43:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P52326 and previous config saved to /var/cache/conftool/dbconfig/20230908-074334-arnaudb.json [07:52:11] (03CR) 10Elukey: [C: 03+2] ml-services: move ores-legacy to the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955869 (owner: 10Elukey) [07:53:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43180/console" [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [07:55:46] (03CR) 10Elukey: [V: 03+1 C: 03+1] cassandra: remove cassandra/twcs deployment [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [07:58:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T343198)', diff saved to https://phabricator.wikimedia.org/P52327 and previous config saved to /var/cache/conftool/dbconfig/20230908-075840-arnaudb.json [07:58:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:58:44] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [07:58:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:59:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52328 and previous config saved to /var/cache/conftool/dbconfig/20230908-075901-arnaudb.json [08:01:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:03:04] (03PS1) 10Hashar: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) [08:04:03] (03CR) 10CI reject: [V: 04-1] tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [08:04:44] (03PS1) 10Hashar: envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695) [08:06:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:09:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:10:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10phuedx) >>! In T345696#9150637, @Fabfur wrote: > I think you should have access now, please let me know if it's not the case and I'll investigate further! Confirmed. Thanks! [08:13:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) 05Stalled→03Resolved [08:17:23] (03CR) 10Btullis: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:18:25] (03PS1) 10Hashar: tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695) [08:23:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:23:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:25:49] (03CR) 10Slyngshede: [V: 03+2] LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede) [08:25:51] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede) [08:26:44] (03Abandoned) 10AikoChou: changeprop: allow retries for liftwing streams with 500 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/954969 (owner: 10AikoChou) [08:34:52] (03PS1) 10Elukey: role::deployment_server::kubernets: add config for rec-api-ng [labs/private] - 10https://gerrit.wikimedia.org/r/955882 [08:35:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::deployment_server::kubernets: add config for rec-api-ng [labs/private] - 10https://gerrit.wikimedia.org/r/955882 (owner: 10Elukey) [08:41:39] (03PS1) 10Slyngshede: P:IDM Enable LDAP validators for usernames. [puppet] - 10https://gerrit.wikimedia.org/r/955884 [08:48:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955884 (owner: 10Slyngshede) [08:52:37] (03PS1) 10Elukey: admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 [08:53:22] (03PS2) 10Elukey: admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 [08:53:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:54:48] (03PS1) 10AikoChou: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) [08:55:14] (03PS1) 10Elukey: profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887 [08:55:34] (03PS4) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [08:58:19] (03CR) 10AikoChou: ml-services: add annotations for inference_services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [08:58:45] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey) [08:59:14] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey) [08:59:56] (03Abandoned) 10AikoChou: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [09:00:14] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey) [09:00:36] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [09:01:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey) [09:02:19] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) I almost forgot, for `analytics-privatedata-users` I'm assuming @acooper needs a kerberos principal as well, details available on https... [09:02:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43181/console" [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey) [09:03:42] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::k8s::deployment_server: add config for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/955887 (owner: 10Elukey) [09:04:57] (03CR) 10Elukey: [C: 03+2] admin_ng: add the rec-api-ng's namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/955885 (owner: 10Elukey) [09:10:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:11:41] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [09:11:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:13:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:15:10] (03Abandoned) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:15:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:15:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:16:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:16:23] (03PS2) 10Muehlenhoff: Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867) [09:16:59] (03PS1) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) [09:17:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:19:25] (03CR) 10Muehlenhoff: [C: 03+2] Decom furud [puppet] - 10https://gerrit.wikimedia.org/r/955859 (https://phabricator.wikimedia.org/T347867) (owner: 10Muehlenhoff) [09:22:59] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts furud.codfw.wmnet [09:25:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:28:25] (03CR) 10David Caro: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [09:28:31] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:28:58] (03PS2) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [09:29:21] (03Merged) 10jenkins-bot: rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:29:27] (03PS1) 10Majavah: P:puppetserver: fix reports location [puppet] - 10https://gerrit.wikimedia.org/r/955890 [09:29:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: furud.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:30:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:13] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10Peachey88) [09:31:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: furud.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:31:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:31:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts furud.codfw.wmnet [09:31:53] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `furud.codfw.wmnet` - furud.codfw.wmnet (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physi... [09:34:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [09:36:50] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:38:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [09:43:02] (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:46:37] !log restart fifo-log-demux@notpurge.service in cp4052 [09:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:19] (03CR) 10FNegri: "PCC is failing because something is requiring "<= bullseye", I don't think is this file but I'm not finding where that requirement is comi" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [09:47:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:49:06] (03PS5) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [09:50:14] (03CR) 10FNegri: "https://puppet-compiler.wmflabs.org/output/955841/43183/" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [09:51:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:33] (03PS6) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [09:52:02] 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) [09:52:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:54:23] 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) [09:55:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-rw1001.wikimedia.org with OS bookworm [09:56:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:08] (03CR) 10FNegri: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [10:00:52] (03PS1) 10Joal: Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) [10:03:07] (03CR) 10CI reject: [V: 04-1] Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal) [10:03:23] (03PS1) 10Filippo Giunchedi: citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563) [10:03:27] (03PS1) 10Filippo Giunchedi: citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563) [10:04:10] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul [10:05:07] (03PS2) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) [10:05:43] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [10:05:44] (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [10:06:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage [10:07:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:10:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage [10:13:20] (03PS3) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:15:32] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43184/console" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:16:31] (03PS2) 10Joal: Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) [10:17:07] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43185/console" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:17:32] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Ensure standalone puppet works with puppet7 - https://phabricator.wikimedia.org/T345702 (10jbond) 05Open→03Resolved a:03jbond This is working currently [10:17:35] (03CR) 10FNegri: [V: 03+1] "Found the issue: ceph::common was enforcing" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:17:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [10:17:43] (03CR) 10Majavah: [C: 04-1] "The Galera change LGTM. Not sure about the Ceph one, which a) preferrably would be a separate patch and b) might have issues with the newe" [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:24:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-rw1001.wikimedia.org with OS bookworm [10:24:34] (03CR) 10FNegri: [V: 03+1] Galera: allow installing debian-hosted packages for Bookworm or later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:26:40] (03CR) 10Slyngshede: [C: 03+2] P:IDM Enable LDAP validators for usernames. [puppet] - 10https://gerrit.wikimedia.org/r/955884 (owner: 10Slyngshede) [10:27:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-rw2001.wikimedia.org with OS bookworm [10:28:09] (03PS2) 10Hashar: update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) [10:28:11] (03PS1) 10Hashar: update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901 [10:30:23] (03PS4) 10FNegri: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [10:30:25] (03PS1) 10FNegri: ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) [10:31:23] (03CR) 10Hashar: "That is for upgrading tox to version 4 :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [10:31:50] (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [10:32:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:33:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:35:02] (03CR) 10Jbond: [C: 03+1] "thanks <3. the fact that `sudo puppet config --section server print reportdir ` dosen't show this dir is mildly frustrating" [puppet] - 10https://gerrit.wikimedia.org/r/955890 (owner: 10Majavah) [10:35:23] (03CR) 10Majavah: [C: 03+2] P:puppetserver: fix reports location [puppet] - 10https://gerrit.wikimedia.org/r/955890 (owner: 10Majavah) [10:36:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:38:16] (03CR) 10Jbond: [C: 03+1] "lgtm Q inline" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:39:28] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) @colewhite thanks [10:41:50] RECOVERY - Check systemd state on puppetserver2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:10] (03CR) 10Muehlenhoff: Use a single ensure for managing the nftables state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:46:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage [10:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage [10:50:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:53:20] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:09] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [10:55:29] (03PS1) 10Muehlenhoff: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/955904 (https://phabricator.wikimedia.org/T331699) [11:01:07] (03CR) 10Muehlenhoff: [C: 03+2] Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/955904 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [11:03:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52333 and previous config saved to /var/cache/conftool/dbconfig/20230908-110331-arnaudb.json [11:03:35] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:04:58] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:05:24] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:06:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-rw2001.wikimedia.org with OS bookworm [11:07:07] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:07:24] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:09:21] (03CR) 10Kamila Součková: [C: 03+1] services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [11:10:54] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) 05Stalled→03Open [11:11:47] (03PS3) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) [11:12:58] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) Thanks I added the SSH key. I'll ask Mark to approve. [11:13:12] (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [11:14:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [11:14:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [11:16:33] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10mark) Approved. [11:17:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1005.wikimedia.org [11:17:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:18:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P52334 and previous config saved to /var/cache/conftool/dbconfig/20230908-111838-arnaudb.json [11:19:31] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [11:20:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1005.wikimedia.org - jmm@cumin2002" [11:21:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1005.wikimedia.org - jmm@cumin2002" [11:21:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:01] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1005.wikimedia.org on all recursors [11:21:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1005.wikimedia.org on all recursors [11:21:20] (03CR) 10Jbond: [C: 03+2] puppetmaster::servers: remove puppetservers from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:21:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1005.wikimedia.org - jmm@cumin2002" [11:22:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1005.wikimedia.org - jmm@cumin2002" [11:23:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica1005.wikimedia.org with OS bookworm [11:24:03] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm [11:33:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P52335 and previous config saved to /var/cache/conftool/dbconfig/20230908-113344-arnaudb.json [11:34:50] (03PS1) 10Jbond: puppetserver: fix ssl permissions [puppet] - 10https://gerrit.wikimedia.org/r/955908 [11:36:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica1005.wikimedia.org with reason: host reimage [11:38:38] (03PS1) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [11:39:27] (03PS2) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [11:39:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica1005.wikimedia.org with reason: host reimage [11:42:14] (03CR) 10Jbond: [C: 03+2] puppetserver: fix ssl permissions [puppet] - 10https://gerrit.wikimedia.org/r/955908 (owner: 10Jbond) [11:42:20] (03PS1) 10Jbond: puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490) [11:45:03] (03PS1) 10Hnowlan: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) [11:45:28] (03PS1) 10Jbond: puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914 [11:45:41] (03CR) 10Jbond: [C: 03+2] puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914 (owner: 10Jbond) [11:45:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetserver: correct ssl dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/955914 (owner: 10Jbond) [11:46:34] (03PS2) 10Hnowlan: trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) [11:48:36] (03PS2) 10Jbond: puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490) [11:48:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52336 and previous config saved to /var/cache/conftool/dbconfig/20230908-114850-arnaudb.json [11:48:53] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:48:56] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:49:06] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:49:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52337 and previous config saved to /var/cache/conftool/dbconfig/20230908-114911-arnaudb.json [11:53:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-replica1005.wikimedia.org with OS bookworm [11:53:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1005.wikimedia.org [11:54:04] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm completed: - ldap-rep... [11:54:54] (03CR) 10Jbond: [C: 03+2] puppetmaster: add pupetserveres back to git private [puppet] - 10https://gerrit.wikimedia.org/r/955911 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:55:17] (03PS1) 10Filippo Giunchedi: nagios: emit warnings from check_dsh_groups [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118) [11:55:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org [11:55:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:58:01] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [11:58:25] (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/955916 [11:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [11:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:59:28] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors [11:59:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors [11:59:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:01:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff) [12:04:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:05:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:05:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:41] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors [12:05:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors [12:05:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica1006.wikimedia.org [12:14:34] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) [12:14:48] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) [12:16:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org [12:16:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:17:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:18:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica1006.wikimedia.org [12:18:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica1006.wikimedia.org [12:18:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:18:59] (03CR) 10Elukey: [C: 03+1] "Great work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [12:20:39] (03CR) 10Jbond: [C: 03+1] "change seems fine to me but I'm not the one to make the call on if its the right policy decision" [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [12:20:45] (03PS5) 10TTO: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) [12:21:01] (03CR) 10TTO: "Thanks for the comments, @Urbanecm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [12:21:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955909 (owner: 10Muehlenhoff) [12:21:32] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you John, I'm ok with a sanity check only!" [puppet] - 10https://gerrit.wikimedia.org/r/955915 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [12:22:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:23:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:23:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:23:23] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica1006.wikimedia.org on all recursors [12:23:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica1006.wikimedia.org on all recursors [12:23:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:23:52] (03CR) 10JMeybohm: [C: 03+1] rest-gateway: set SNI when using ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [12:24:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica1006.wikimedia.org - jmm@cumin2002" [12:26:23] (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [12:27:51] (03CR) 10Elukey: ml-services: deployment settings for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [12:27:53] (03CR) 10Ayounsi: Junos: Add more info on commit errors (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [12:29:12] (03PS1) 10Muehlenhoff: Fix cloudbackup alias [puppet] - 10https://gerrit.wikimedia.org/r/955923 [12:29:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica1006.wikimedia.org with OS bookworm [12:29:31] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [12:29:50] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: fix indentation and add missing descr for Lift Wing [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955769 (owner: 10Elukey) [12:29:58] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm [12:31:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1005.wikimedia.org [12:31:50] (03PS1) 10Brouberol: Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 [12:32:46] (03CR) 10Ilias Sarantopoulos: [C: 03+2] services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [12:32:51] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff) [12:33:36] (03Merged) 10jenkins-bot: services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [12:35:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1005.wikimedia.org [12:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica1006.wikimedia.org with reason: host reimage [12:39:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica1006.wikimedia.org with reason: host reimage [12:40:15] jouncebot: next [12:40:15] In 18 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230909T0700) [12:43:39] (03PS2) 10Hnowlan: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) [12:45:27] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: set SNI when using ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [12:45:44] (03CR) 10JMeybohm: [C: 04-1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [12:46:18] (03Merged) 10jenkins-bot: rest-gateway: set SNI when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955912 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [12:48:40] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) > to reduce load on LVS hosts My recollection is that it wasn't really about raw load or PPS at the LVSes. It was that our Linux kernel settings ha... [12:48:48] (03PS1) 10Muehlenhoff: profile::piwik::database: Enforce type for port [puppet] - 10https://gerrit.wikimedia.org/r/955927 [12:49:17] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:18] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05Open→03In progress p:05Triage→03Medium [12:49:37] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:49:38] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) [12:49:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:50:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955927 (owner: 10Muehlenhoff) [12:50:05] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:48] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:51:11] (03PS1) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) [12:51:13] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:51:33] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:51:40] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:51:49] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:52:07] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:52:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/... [12:53:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-replica1006.wikimedia.org with OS bookworm [12:53:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1006.wikimedia.org [12:53:29] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm completed: - ldap-rep... [12:53:31] (03CR) 10CI reject: [V: 04-1] puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [12:56:04] (03PS4) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) [12:57:34] (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [12:57:41] PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:57:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational pac... [12:59:59] !log isaranto@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [13:00:11] (03PS1) 10Hnowlan: rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396) [13:00:11] !log isaranto@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [13:01:31] !log isaranto@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [13:01:51] !log isaranto@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [13:03:43] (03PS1) 10Btullis: Add snapshot101[4-7] to the dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) [13:04:39] (03CR) 10Elukey: [C: 03+1] ml-services: deployment settings for the recommendation-api-ng (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:05:07] !log isaranto@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [13:05:21] !log isaranto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [13:05:24] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis) [13:05:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol) [13:06:22] (03CR) 10JMeybohm: [C: 03+1] citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [13:06:31] (03CR) 10JMeybohm: [C: 03+1] citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [13:08:17] (03PS2) 10Btullis: Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) [13:11:12] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) [13:11:26] (03PS5) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) [13:12:33] (03CR) 10Kevin Bazira: ml-services: deployment settings for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:12:35] (03PS1) 10Ilias Sarantopoulos: fix: enwiktionary in API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) [13:15:33] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) thanks! @acooper RSA keys are being deprecated in some parts of our infrastructure already (T336769), so I'm wondering if you could pro... [13:15:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:18:07] (03PS1) 10FNegri: [cluster::cloud_management] Don't install prod cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) [13:18:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis) [13:18:39] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri) [13:19:53] RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:20:41] (03CR) 10Elukey: [C: 03+1] "Added a nit for the commit msg, the rest looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [13:20:45] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396) (owner: 10Hnowlan) [13:20:58] (03PS2) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) [13:21:00] (03PS1) 10Jbond: check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939 [13:21:43] (03Merged) 10jenkins-bot: rest-gateway: route requests to media-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/955929 (https://phabricator.wikimedia.org/T336396) (owner: 10Hnowlan) [13:21:45] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [13:21:49] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the reviews :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:22:15] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43189/console" [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri) [13:23:20] (03Merged) 10jenkins-bot: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955886 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [13:23:23] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [13:23:35] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10acooper) I followed these instructions already which requested rsa type (maybe worth updating the instructions if ed25519 is preferred now?) https:... [13:23:37] (03Merged) 10jenkins-bot: ml-services: deployment settings for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/955018 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:23:47] (03CR) 10CI reject: [V: 04-1] puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [13:23:55] (03CR) 10CI reject: [V: 04-1] check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939 (owner: 10Jbond) [13:23:55] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:13] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:24:29] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:25:11] (03PS1) 10Vgutierrez: admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) [13:25:58] (03CR) 10ArielGlenn: [C: 03+1] "Thanks, these are marked as spares but ought to get the updates since they won't be spare for long." [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis) [13:28:47] (03PS2) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) [13:29:06] (03PS3) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) [13:29:26] (03CR) 10Ilias Sarantopoulos: services: update Lift Wing's config in the API-Gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [13:29:47] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul) [13:30:08] (03CR) 10Ilias Sarantopoulos: [C: 03+2] services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [13:30:36] (03Abandoned) 10Bking: sshd_config: disable ssh-rsa public key signature algorithm [puppet] - 10https://gerrit.wikimedia.org/r/834340 (https://phabricator.wikimedia.org/T318345) (owner: 10Bking) [13:30:55] (03Merged) 10jenkins-bot: services: update Lift Wing's config in the API-Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955933 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [13:32:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) >>! In T345877#9152356, @acooper wrote: > I followed these instructions already which requested rsa type (maybe w... [13:34:13] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:34:51] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:35:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @papaul I've done some testing and I'm confident the IP GW moves for the row subnets to the Spines can be done gracefully. I've yet to wo... [13:37:48] !log isaranto@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [13:37:56] !log isaranto@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [13:38:47] (03PS3) 10Amire80: Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203 [13:38:48] !log isaranto@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [13:39:08] !log isaranto@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [13:39:38] !log isaranto@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [13:39:54] !log isaranto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [13:43:22] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) This did get broken with the migration to the new puppetdbs as we migrated cumin to use... [13:44:15] (03PS1) 10Amire80: Add Wikimedia Deutschland's tech news blog [puppet] - 10https://gerrit.wikimedia.org/r/955941 [13:44:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney thanks for the update. I think we can reuse those the MPO [13:53:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) just to be the clear the RSA key is totally valid at this point, I just wanted to save @acooper more "pain" furth... [13:54:32] (03CR) 10Vgutierrez: [C: 04-2] "blocked till we get all the required approvals" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [13:55:41] (03PS1) 10Ssingh: 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) [13:56:12] (03CR) 10Andrew Bogott: "bookworm will install ceph-common version 16.2.11+ds-2. On Bullseye we're running 15.2.16-1" [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [13:58:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52340 and previous config saved to /var/cache/conftool/dbconfig/20230908-135803-arnaudb.json [13:58:07] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:58:26] (03CR) 10Ayounsi: [C: 03+1] Add static network defs and DHCP config for new codfw subnets [puppet] - 10https://gerrit.wikimedia.org/r/954896 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [13:59:43] (03CR) 10Ayounsi: [C: 03+1] Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [14:06:31] (03CR) 10Andrew Bogott: [C: 03+1] "David is not 100% sure that this will be backwards-compatible, but let's find out!" [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [14:07:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:43] (03PS1) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) [14:13:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P52341 and previous config saved to /var/cache/conftool/dbconfig/20230908-141309-arnaudb.json [14:17:13] (03CR) 10Vgutierrez: [C: 04-2] "(SSH key verified OOB via Slack)" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:00] (03PS1) 10Jbond: P:cumin::master: drop puppet constant change check [puppet] - 10https://gerrit.wikimedia.org/r/955949 (https://phabricator.wikimedia.org/T345909) [14:20:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for criticality, not a blocker" [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [14:20:30] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) Replaced optic and cable again @cmooney @Eevans [14:20:55] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:43] (03PS2) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) [14:21:50] (03CR) 10Jbond: puppet-agent: create a alertmanager check for changing puppet runs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [14:21:57] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) We are monitoring this error it has been 12 days with no faults [14:24:13] (03CR) 10Jbond: [C: 03+2] puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [14:24:49] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:25:26] (03Merged) 10jenkins-bot: puppet-agent: create a alertmanager check for changing puppet runs [alerts] - 10https://gerrit.wikimedia.org/r/955945 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [14:25:38] (03CR) 10Jbond: [C: 03+2] P:cumin::master: drop puppet constant change check [puppet] - 10https://gerrit.wikimedia.org/r/955949 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [14:26:11] (03CR) 10Bking: "Adding ServiceOps teammates since this is related to 955032 ." [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson) [14:26:46] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt moss-be1003 - jclark@cumin1001" [14:27:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt moss-be1003 - jclark@cumin1001" [14:27:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:27:51] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host moss-be1003 [14:28:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host moss-be1003 [14:28:15] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host moss-be1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:28:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P52342 and previous config saved to /var/cache/conftool/dbconfig/20230908-142815-arnaudb.json [14:29:07] RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [14:29:08] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [14:33:27] 10SRE, 10ops-codfw, 10serviceops-radar: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) p:05Triage→03Medium a:03Jhancock.wm error found in the lifecycle log `CPU 2 machine check error detected.` I powered down the server and drained the flea power. waited 5 minutes. server is ba... [14:39:37] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:41:39] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt stat1011 - jclark@cumin1001" [14:42:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt stat1011 - jclark@cumin1001" [14:42:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:42:37] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host stat1011 [14:42:52] (03PS3) 10Bking: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [14:43:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T343198)', diff saved to https://phabricator.wikimedia.org/P52343 and previous config saved to /var/cache/conftool/dbconfig/20230908-144321-arnaudb.json [14:43:27] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:43:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host stat1011 [14:43:55] (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [14:44:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED [14:45:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:46:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [14:48:18] (03PS14) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [14:48:30] (03CR) 10Bking: "I did the easy stuff (I think), skillfully avoiding Rakefile ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [14:50:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:49] (03PS1) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) [14:54:51] (03PS2) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) [14:55:54] (03CR) 10Stevemunene: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:56:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [14:58:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:58:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:01:17] (03PS3) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) [15:02:46] (03PS2) 10Ssingh: 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) [15:07:51] (03PS4) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) [15:08:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:00] (03CR) 10Brouberol: [C: 03+2] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol) [15:11:44] ^ misclick. I removed my vote [15:13:36] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011.eqiad.wmne'] [15:13:58] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:15:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-be1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:16:00] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be1003.eqiad.wmnet'] [15:16:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['moss-be1003.eqiad.wmnet'] [15:16:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [15:17:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [15:19:20] (03CR) 10BBlack: [C: 03+1] 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:19:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1042 - D 8. U 31. port 21 CableID 1899 kubernetes1043 - D 8. U 32. port 19 CableID 1902 kubernetes1044 - D 8. U 33. port 34 CableID PUR-0023000004 kuber... [15:21:09] (03PS1) 10Ssingh: hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) [15:22:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['stat1011.eqiad.wmne'] [15:26:38] (03CR) 10Ssingh: "We should deploy this on Monday as renaming the configuration checks has proven to be a bit tricky historically." [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:27:32] (03CR) 10Ssingh: [C: 03+2] 27.35.198.in-addr.arpa: update PTR for 198.35.27.27 [dns] - 10https://gerrit.wikimedia.org/r/955943 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:27:41] !log running authdns-update for CR 955943 [15:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:15] (03PS1) 10Ssingh: wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) [15:34:36] (03CR) 10Ssingh: "On the other hand, this is mostly a NOOP change other than the anycast-hc side of things, so I am fine with merging it today and finishing" [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:37:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [15:37:15] (03CR) 10Btullis: datahub: add oidc production settings (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [15:37:16] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [15:39:52] (03CR) 10David Caro: [C: 03+1] ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [15:42:33] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:44:01] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [15:44:08] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [15:44:36] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes1027 - jclark@cumin1001" [15:45:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes1027 - jclark@cumin1001" [15:45:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:57] (03CR) 10Majavah: [C: 03+1] Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [15:49:18] (03CR) 10FNegri: [C: 03+2] ceph::common: allow bookworm and later versions [puppet] - 10https://gerrit.wikimedia.org/r/955902 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [15:51:35] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1027 [15:51:56] (03CR) 10BBlack: [C: 03+1] hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:52:36] (03CR) 10FNegri: [C: 03+2] Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [15:52:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:52:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1027 [15:53:38] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED [15:53:52] (03PS4) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) [15:54:43] (03PS1) 10Cparle: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) [15:54:53] (03PS5) 10Andrea Denisse: superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) [15:55:04] (03CR) 10Cparle: [C: 04-2] Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle) [15:55:54] (03PS15) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [15:56:07] (03CR) 10Andrea Denisse: superset: Move superset metrics to statsd-exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [15:57:36] (03CR) 10Stevemunene: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [15:59:05] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/955776/43190/" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [16:01:49] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:05:55] hi folks [16:06:13] was someone working on anything related to the DNS updates in netbox? [16:06:35] File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 228, in _collect_device [16:06:38] if self.addresses[primary.id].dns_name: [16:06:41] KeyError: 14575 [16:10:32] (03PS1) 10Bking: dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) [16:12:18] (03CR) 10DCausse: [C: 03+1] dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [16:12:21] (03CR) 10Bking: [C: 03+2] dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [16:13:08] (03Merged) 10jenkins-bot: dse-k8s-services: init rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/955969 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [16:16:58] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:18:25] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:20:31] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) opened service request with Dell: 175561524 [16:27:34] (03PS6) 10Urbanecm: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [16:28:58] (03CR) 10Urbanecm: [C: 03+1] Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [16:33:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1032 - C 6. U 14. port 14 CableID 3220 kubernetes1033 - C 6. U 15. port 17 CableID 3223 kubernetes1034 - C 6. U 16. port 13 CableID 3219 kubernetes1035... [16:37:44] (03PS1) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) [16:38:08] (03CR) 10CI reject: [V: 04-1] wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [16:39:09] (03PS1) 10Andrew Bogott: designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974 [16:39:37] (03CR) 10FNegri: [C: 03+1] designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974 (owner: 10Andrew Bogott) [16:40:14] (03PS1) 10Andrew Bogott: wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975 [16:40:57] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9152542, @Jclark-ctr wrote: > Replaced optic and cable again @cmooney @Eevans Thanks @Jclark-ctr. Unfortunately it didn't work. :( @cmooney,... [16:41:31] (03PS1) 10Andrew Bogott: designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976 [16:41:52] (03CR) 10Andrew Bogott: [C: 03+2] designatemakedomain: don't install for python2 [puppet] - 10https://gerrit.wikimedia.org/r/955974 (owner: 10Andrew Bogott) [16:42:21] (03CR) 10FNegri: [C: 03+1] wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975 (owner: 10Andrew Bogott) [16:42:30] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: don't install python2 versions [puppet] - 10https://gerrit.wikimedia.org/r/955975 (owner: 10Andrew Bogott) [16:43:54] (03CR) 10FNegri: [C: 03+1] designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976 (owner: 10Andrew Bogott) [16:44:03] (03CR) 10Andrew Bogott: [C: 03+2] designate-sink: don't install python2 versions of our sink plugins [puppet] - 10https://gerrit.wikimedia.org/r/955976 (owner: 10Andrew Bogott) [16:44:50] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) I think this is fixed. I am seeing four disks in the idrac and bios. Can someone confirm? [16:47:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:13:16] !log reprepro copy bookworm-wikimedia bullseye-wikimedia prometheus-memcached-exporter # T345810 [17:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:21] T345810: [openstack] Upgrade codfw hosts to bookworm - https://phabricator.wikimedia.org/T345810 [17:15:05] (03PS1) 10Majavah: openstack: use stock mariadb on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955977 [17:15:20] (03PS2) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) [17:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:19:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:20:30] (03CR) 10Andrew Bogott: [C: 03+2] openstack: use stock mariadb on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955977 (owner: 10Majavah) [17:20:46] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:24:58] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [17:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:49:55] hm, that was a bit of an ugly spike [17:51:29] But I guess no worse than some yesterday [17:54:47] > This alert means something is currently very wrong [17:55:06] Seems this alert fires frequently enough that perhaps that isn't the case? [18:09:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:14:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:14:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:19:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:25:39] (03PS1) 10Kimberly Sarabia: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) [18:28:01] (03CR) 10Jdlrobson: [C: 03+1] Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia) [18:33:19] hello. is there anyone around today to deploy a beta cluster only patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/955980 [18:45:38] kimberly_sarabia: is there any urgency? [18:46:08] no urgency, it can wait [18:46:41] kimberly_sarabia: it might be worth posting in #wikimedia-releng but it is Friday evening / afternoon [18:49:05] no worries. ill schedule it for mon [19:12:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:17:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:30:02] (03PS2) 10Milimetric: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) [19:57:12] (03PS4) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [19:57:45] (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [20:06:34] (03PS5) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [20:07:08] (03CR) 10CI reject: [V: 04-1] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [20:13:03] (03PS6) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [20:14:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:19:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:24:42] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:26:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED [20:27:26] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes102 - jclark@cumin1001" [20:28:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes102 - jclark@cumin1001" [20:28:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:35] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:13] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:52:30] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes102 - jclark@cumin1001" [20:53:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes102 - jclark@cumin1001" [20:53:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:56:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [20:56:57] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:57:00] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1029 [20:57:05] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1030 [20:57:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1028 [20:58:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1029 [20:58:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1028 [20:58:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1031 [20:58:08] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1032 [20:58:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1030 [20:58:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1031 [20:58:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1033 [20:58:46] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1034 [20:59:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1034 [20:59:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1033 [20:59:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1032 [20:59:29] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1035 [20:59:32] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1036 [20:59:38] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1037 [21:00:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1036 [21:00:31] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1038 [21:00:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1035 [21:00:49] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1039 [21:00:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1037 [21:01:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1038 [21:02:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host kubernetes1039 [21:02:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1037 [21:02:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1037 [21:02:15] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1039 [21:03:29] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1038 [21:03:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1039 [21:03:55] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1040 [21:04:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1040 [21:04:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1038 [21:04:51] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1042 [21:05:49] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1043 [21:06:18] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1041 [21:06:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1042 [21:06:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1043 [21:07:11] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1044 [21:07:14] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1045 [21:07:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1041 [21:08:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1045 [21:08:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1044 [21:08:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:08:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:08:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52345 and previous config saved to /var/cache/conftool/dbconfig/20230908-210844-arnaudb.json [21:08:48] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:09:02] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1046 [21:09:05] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1047 [21:09:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1047 [21:09:10] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1048 [21:09:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1048 [21:09:23] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1047 [21:09:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1047 [21:09:31] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1048 [21:09:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1048 [21:09:34] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1049 [21:09:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1049 [21:09:44] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1050 [21:09:48] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1051 [21:09:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1050 [21:10:00] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1050 [21:10:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1050 [21:10:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1051 [21:10:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1046 [21:10:17] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1052 [21:10:23] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1053 [21:10:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1052 [21:10:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1053 [21:10:36] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1054 [21:10:40] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1055 [21:10:43] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1056 [21:10:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1054 [21:10:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1055 [21:10:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1056 [21:14:49] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1028.mgmt.eqiad.wmnet with reboot policy FORCED [21:14:53] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [21:14:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1030.mgmt.eqiad.wmnet with reboot policy FORCED [21:14:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1031.mgmt.eqiad.wmnet with reboot policy FORCED [21:14:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1032.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1033.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1034.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1035.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1036.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:34:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1028.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1030.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1032.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1033.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1034.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1036.mgmt.eqiad.wmnet with reboot policy FORCED [21:34:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1035.mgmt.eqiad.wmnet with reboot policy FORCED [21:40:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) [21:40:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [21:41:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10odimitrijevic) Approved! [21:44:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [22:36:07] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:36:24] looking [22:36:55] PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb6_443: Servers cp3068.esams.wmnet, cp3070.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:37:07] (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:23] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1079.eqiad.wmnet, cp1085.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1081.eqiad.wmnet, cp1085.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmne [22:37:23] 7.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:37:33] PROBLEM - PyBal backends health check on lvs6003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp601 [22:37:33] wmnet, cp6016.drmrs.wmnet are marked down but pooled: testlb6_443: Servers cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb6_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled https://wikit [22:37:34] media.org/wiki/PyBal [22:37:35] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp6010.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp601 [22:37:35] wmnet are marked down but pooled: testlb6_443: Servers cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled: textlb6_443: Servers cp6011.drmrs.wmnet, cp6010.drmrs.wmnet, cp6013.drmrs.wmnet, cp6012.drmrs.wmnet, cp6009.drmrs.wmnet, cp6015.drmrs.wmnet, cp6014.drmrs.wmnet, cp6016.drmrs.wmnet are marked down but pooled https://wikit [22:37:35] media.org/wiki/PyBal [22:38:21] RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:38:49] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:38:59] RECOVERY - PyBal backends health check on lvs6003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:39:01] RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:39:40] (LVSHighRX) firing: Excessive RX traffic on lvs3008:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:41:07] (ProbeDown) resolved: (10) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:41:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:42:07] (ProbeDown) resolved: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:41] (LVSHighRX) resolved: Excessive RX traffic on lvs3008:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:46:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:10:01] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:15:01] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown