[01:35:01] !incidents [01:35:02] You're not allowed to perform this action. [01:38:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:51:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9997452 (10Papaul) I did more testing today again - I downloand the lpxelinux file we have on apt.wikimedia and copy it to my tftp node - modify dhcpd... [02:04:07] 06SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T370494 (10Davidalexander529) 03NEW [02:08:10] 06SRE, 10Maps: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T370494#9997465 (10Bugreporter) 05Open→03Invalid Incomplete request. [02:27:38] (03CR) 10Eileen: [C:03+1] "I agree with this change" [puppet] - 10https://gerrit.wikimedia.org/r/1054952 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [02:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:53:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:50] (03CR) 10Krinkle: [C:03+1] MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [03:01:28] (03CR) 10Krinkle: [C:03+1] "Haven't tested myself, but given now validation is specific to as-of-yet unused code paths, this is low risk. LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [03:01:42] (03CR) 10Dwisehaupt: "Thanks @emcnaughton@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1054952 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [03:18:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:37:50] !issync [03:37:51] Syncing #wikimedia-operations (requested by legoktm) [03:37:52] Set /cs flags #wikimedia-operations JJMC89 +Aiotv [03:37:54] Set /cs flags #wikimedia-operations wmopbot -o [03:38:01] oops? [03:39:24] 23:39:03 [ChanServ] Flags +o were set on wmopbot in #wikimedia-operations. [03:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [05:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [05:10:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [05:10:47] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240719T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:33] (03CR) 10Dzahn: [C:03+1] ci: keep python2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1055147 (https://phabricator.wikimedia.org/T367544) (owner: 10Hashar) [06:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240719T0700) [07:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:05:33] !log elukey@cumin1002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [08:08:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2438.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:08:27] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2438 [08:15:07] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:15:21] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:15:38] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:15:47] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:16:13] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:16:36] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:27:32] (03CR) 10Filippo Giunchedi: [C:03+1] Thanos: use new-style swift storage layout for forthcoming backends [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [08:34:04] (03PS1) 10Ilias Sarantopoulos: ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) [08:34:12] (03CR) 10CI reject: [V:04-1] ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [08:35:33] (03PS2) 10Ilias Sarantopoulos: ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) [08:38:52] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [08:40:36] (03PS1) 10Btullis: Add a PostgreSQL database for Growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1055385 (https://phabricator.wikimedia.org/T365839) [08:41:16] (03PS3) 10Ilias Sarantopoulos: ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) [08:44:38] (03CR) 10Kevin Bazira: [C:03+1] ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [08:51:35] (03PS1) 10Btullis: Remove superfluous superset postgresql databases from an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1055388 (https://phabricator.wikimedia.org/T347710) [09:02:28] (03PS1) 10Ilias Sarantopoulos: kserve-inference: add container config in transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 [09:03:16] (03CR) 10Brouberol: [C:03+1] Add a PostgreSQL database for Growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1055385 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [09:03:24] (03CR) 10Btullis: [C:03+2] Add a PostgreSQL database for Growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1055385 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [09:03:34] (03CR) 10Brouberol: [C:03+1] Remove superfluous superset postgresql databases from an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1055388 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [09:03:43] (03CR) 10Btullis: [C:03+2] Remove superfluous superset postgresql databases from an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1055388 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [09:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:34] (03PS2) 10Ilias Sarantopoulos: kserve-inference: add container config in transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 [09:10:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:10:59] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [09:10:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:21:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [09:21:52] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [09:22:12] (03PS4) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [09:23:16] (03CR) 10Tchanders: Enable temporary accounts on testwiki and loginwiki (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [09:28:40] (03PS1) 10Btullis: Add kubeconfig files for growthbook on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1055390 (https://phabricator.wikimedia.org/T365839) [09:29:41] (03PS2) 10Btullis: Add kubeconfig files for growthbook on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1055390 (https://phabricator.wikimedia.org/T365839) [09:31:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [09:31:58] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [09:33:02] (03CR) 10Kevin Bazira: [C:03+1] "seems good to me. we might need another opinion before merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 (owner: 10Ilias Sarantopoulos) [09:35:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [09:35:13] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [09:35:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [09:35:22] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [09:36:52] (03PS1) 10Btullis: Add a growthbook namespace to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055391 (https://phabricator.wikimedia.org/T365839) [09:41:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [09:41:53] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [09:42:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [09:42:06] (03CR) 10Klausman: [C:03+1] kserve-inference: add container config in transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 (owner: 10Ilias Sarantopoulos) [09:46:09] (03CR) 10Brouberol: [C:03+1] Add a growthbook namespace to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055391 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [09:46:35] (03CR) 10Brouberol: [C:03+1] Add kubeconfig files for growthbook on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1055390 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [09:46:41] (03PS1) 10Southparkfan: changeprop beta: replace jobrunner with Bullseye instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) [09:47:08] (03CR) 10Southparkfan: [C:04-1] "Do not merge yet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) (owner: 10Southparkfan) [09:52:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9997915 (10elukey) This is awesome Papaul! I tried various configs for sretest2001 (this is a Supermicro node, not Dell): ` host sretest2001 { hard... [09:54:51] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest2001.codfw.wmnet [09:56:30] (03Abandoned) 10Ilias Sarantopoulos: ml-services: test dummy predictor_host in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055384 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [09:58:01] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.convert-disks (exit_code=97) for host mw2439 [10:00:16] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:00:28] (03CR) 10EoghanGaffney: [C:03+1] vrts: use curl with -x flag for proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1055297 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [10:00:35] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:05:23] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:06:50] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:08:02] (03PS1) 10Physikerwelt: Enable MathJax rendering in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055395 (https://phabricator.wikimedia.org/T370507) [10:09:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055395 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [10:12:48] (03PS1) 10Physikerwelt: Enable optional MathJax rendering in everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055397 (https://phabricator.wikimedia.org/T370507) [10:13:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [10:13:23] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [10:13:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055397 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [10:13:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [10:21:08] (03CR) 10Jelto: [C:03+1] ci: keep python2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1055147 (https://phabricator.wikimedia.org/T367544) (owner: 10Hashar) [10:27:07] (03PS1) 10Kamila Součková: Revert "benthos/mw_accesslog_metrics: Add buffer" [puppet] - 10https://gerrit.wikimedia.org/r/1055399 [10:28:14] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [10:30:45] (03CR) 10CI reject: [V:04-1] Revert "benthos/mw_accesslog_metrics: Add buffer" [puppet] - 10https://gerrit.wikimedia.org/r/1055399 (owner: 10Kamila Součková) [10:33:21] (03PS2) 10Kamila Součková: Revert "benthos/mw_accesslog_metrics: Add buffer" [puppet] - 10https://gerrit.wikimedia.org/r/1055399 [10:37:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [10:37:42] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [10:37:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [10:38:49] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [10:41:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [10:41:09] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [10:41:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [10:41:29] (03CR) 10Btullis: [C:03+2] Add kubeconfig files for growthbook on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1055390 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [10:49:03] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.convert-disks (exit_code=97) for host mw2439 [10:53:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9998084 (10elukey) Some questions raised on IRC's dcops chan: * Is it a problem with `lpxelinux.0`, the NIC firmwares interacting with it (say using HT... [10:54:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [10:54:06] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [10:54:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [10:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:31] (03CR) 10Btullis: [C:03+2] Add a growthbook namespace to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055391 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240719T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240719T1100). [11:00:23] (03PS1) 10Ilias Sarantopoulos: ml-services: update image in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055411 (https://phabricator.wikimedia.org/T370408) [11:02:26] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update image in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055411 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [11:02:34] (03Merged) 10jenkins-bot: Add a growthbook namespace to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055391 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [11:05:01] (03PS1) 10Southparkfan: changeprop beta: replace jobrunner with Bullseye instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) [11:05:09] (03CR) 10Southparkfan: changeprop beta: replace jobrunner with Bullseye instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) (owner: 10Southparkfan) [11:05:13] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:05:41] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:06:03] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update image in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055411 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [11:06:58] (03Merged) 10jenkins-bot: ml-services: update image in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055411 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [11:07:17] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:07:46] (03PS1) 10Southparkfan: Beta: replace jobrunner04 with jobrunner05 [puppet] - 10https://gerrit.wikimedia.org/r/1055412 (https://phabricator.wikimedia.org/T370487) [11:10:54] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [11:17:41] (03PS4) 10CDobbins: purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) [11:33:01] 06SRE, 06Infrastructure-Foundations, 10netops: Adjust Icinga VRRP check to return OK if OID not found - https://phabricator.wikimedia.org/T370516 (10cmooney) 03NEW p:05Triage→03Medium [11:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367856)', diff saved to https://phabricator.wikimedia.org/P66839 and previous config saved to /var/cache/conftool/dbconfig/20240719-113412-marostegui.json [11:34:17] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:41:51] (03Abandoned) 10CDobbins: varnish: add better error page when HTTP status code 429 is returned [puppet] - 10https://gerrit.wikimedia.org/r/1035011 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [11:49:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P66840 and previous config saved to /var/cache/conftool/dbconfig/20240719-114919-marostegui.json [11:51:12] (03PS19) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [11:55:39] (03PS1) 10Cathal Mooney: Disable VRRP Icinga check for cr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1055424 (https://phabricator.wikimedia.org/T370516) [11:55:55] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Disable VRRP for cr1-codfw - https://phabricator.wikimedia.org/T370516#9998234 (10cmooney) [11:56:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Disable VRRP for cr1-codfw - https://phabricator.wikimedia.org/T370516#9998237 (10cmooney) [11:56:28] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Disable VRRP for cr1-codfw - https://phabricator.wikimedia.org/T370516#9998242 (10cmooney) [11:59:40] (03PS2) 10Cathal Mooney: Disable VRRP Icinga check for cr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1055424 (https://phabricator.wikimedia.org/T370516) [11:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:01:03] (03PS6) 10Slyngshede: Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [12:01:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:03:39] (03CR) 10Ilias Sarantopoulos: [C:03+2] kserve-inference: add container config in transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 (owner: 10Ilias Sarantopoulos) [12:04:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P66841 and previous config saved to /var/cache/conftool/dbconfig/20240719-120426-marostegui.json [12:06:29] (03Merged) 10jenkins-bot: kserve-inference: add container config in transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055389 (owner: 10Ilias Sarantopoulos) [12:06:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:09:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [12:09:40] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [12:09:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [12:10:18] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [12:12:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [12:12:49] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [12:12:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [12:13:26] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2439 [12:18:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2439 [12:18:52] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2439.codfw.wmnet [12:19:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2439.codfw.wmnet [12:19:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367856)', diff saved to https://phabricator.wikimedia.org/P66842 and previous config saved to /var/cache/conftool/dbconfig/20240719-121933-marostegui.json [12:19:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [12:19:38] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:19:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [12:20:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:23:11] (03CR) 10Filippo Giunchedi: [C:03+1] Disable VRRP Icinga check for cr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1055424 (https://phabricator.wikimedia.org/T370516) (owner: 10Cathal Mooney) [12:23:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365998 - depooling db1195 - s1 db1202 - s7 db1203 - s8', diff saved to https://phabricator.wikimedia.org/P66843 and previous config saved to /var/cache/conftool/dbconfig/20240719-122320-arnaudb.json [12:23:25] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [12:24:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:24:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.convert-disks (exit_code=0) for host mw2439 [12:25:23] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9998293 (10ABran-WMF) @Marostegui and I will be absent on tuesday, hosts have been depooled and are ready. [12:39:10] (03CR) 10Filippo Giunchedi: [C:03+1] "Sth for next week for sure" [puppet] - 10https://gerrit.wikimedia.org/r/1055399 (owner: 10Kamila Součková) [12:47:02] (03PS12) 10Arnaudb: mariadb: tweaks monitoring thresholds for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) [12:47:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:47:37] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:47:54] 06SRE, 06Infrastructure-Foundations, 10netops: Evalute effect of inbound MEDs set by transites - https://phabricator.wikimedia.org/T370520 (10cmooney) 03NEW p:05Triage→03Low [12:49:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:53:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [13:00:56] (03CR) 10Vgutierrez: [C:04-1] varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [13:04:25] (03PS1) 10JMeybohm: Prometheus: Add recording rules computing commonly used envoy histograms [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) [13:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:04] 06SRE, 06Infrastructure-Foundations, 10netops: Evalute effect of inbound MEDs set by transites - https://phabricator.wikimedia.org/T370520#9998409 (10cmooney) [13:10:13] (03PS1) 10Lucas Werkmeister (WMDE): Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) [13:10:47] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] "Do not deploy before 29 July – but it can be reviewed already." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [13:10:57] (03CR) 10Cathal Mooney: [C:03+2] Disable VRRP Icinga check for cr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1055424 (https://phabricator.wikimedia.org/T370516) (owner: 10Cathal Mooney) [13:10:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:10:59] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [13:10:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:12:44] (03CR) 10Ottomata: Produce a limited set of event streams on private wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [13:13:50] (03CR) 10JMeybohm: "This is slow stuff from https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s if you want to double check (which would be nice " [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [13:21:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [13:31:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy GRACEFUL [13:33:36] (03CR) 10Ebernhardson: Produce a limited set of event streams on private wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [13:36:39] (03CR) 10Clément Goubert: [C:03+2] kubernetes: rename 4 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1055237 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [13:38:12] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9998494 (10Clement_Goubert) I *tried* very hard to automate it [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1054531 | with a cookbook ]], but the... [13:39:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2432 to wikikube-worker2035 [13:39:38] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:40:41] 06SRE, 06collaboration-services, 06Traffic, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#9998507 (10Jelto) [13:41:11] 06SRE, 06Infrastructure-Foundations, 10netops: Disable VRRP for cr1-codfw - https://phabricator.wikimedia.org/T370516#9998505 (10cmooney) 05Open→03Resolved Check is now removed. [13:42:07] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2432 to wikikube-worker2035 - cgoubert@cumin1002" [13:45:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2432 to wikikube-worker2035 - cgoubert@cumin1002" [13:45:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:50] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2035 [13:46:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2035 [13:46:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2432 to wikikube-worker2035 [13:48:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2433 to wikikube-worker2036 [13:48:27] (03PS1) 10Ottomata: Remove docroot/mediawiki.org/beacon/event/index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055443 (https://phabricator.wikimedia.org/T353817) [13:48:31] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:51:19] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:51:19] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2433 to wikikube-worker2036 - cgoubert@cumin1002" [13:52:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2433 to wikikube-worker2036 - cgoubert@cumin1002" [13:52:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:34] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2036 [13:53:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2036 [13:53:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2433 to wikikube-worker2036 [13:54:19] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2438 to wikikube-worker2037 [13:55:10] 06SRE, 06Infrastructure-Foundations, 10netops: Evalute effect of inbound MEDs set by transites - https://phabricator.wikimedia.org/T370520#9998532 (10cmooney) [13:55:21] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:56:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:57:31] 06SRE, 06Infrastructure-Foundations, 10netops: Evalute effect of inbound MEDs set by transites - https://phabricator.wikimedia.org/T370520#9998533 (10cmooney) [13:57:49] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2438 to wikikube-worker2037 - cgoubert@cumin1002" [13:59:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2438 to wikikube-worker2037 - cgoubert@cumin1002" [13:59:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:59:49] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2037 [14:00:42] 06SRE, 06Infrastructure-Foundations, 10netops: Evalute effect of inbound MEDs set by transits - https://phabricator.wikimedia.org/T370520#9998543 (10cmooney) [14:01:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:01:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2037 [14:01:42] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: service=thanos-web,name=titan1002.eqiad.wmnet [14:01:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2438 to wikikube-worker2037 [14:02:18] !log herron@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thanos-web,name=titan1002.eqiad.wmnet [14:02:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2439 to wikikube-worker2038 [14:02:33] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: service=thanos-web,name=titan1001.eqiad.wmnet [14:02:46] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:03:27] !log herron@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thanos-web,name=titan1001.eqiad.wmnet [14:03:50] 10SRE-swift-storage, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Port to Prometheus load_average check - https://phabricator.wikimedia.org/T370526 (10fgiunchedi) 03NEW [14:05:15] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2439 to wikikube-worker2038 - cgoubert@cumin1002" [14:06:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2439 to wikikube-worker2038 - cgoubert@cumin1002" [14:06:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:52] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2038 [14:07:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2038 [14:08:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2439 to wikikube-worker2038 [14:08:10] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1054952 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [14:09:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2035.codfw.wmnet with OS bullseye [14:09:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2036.codfw.wmnet with OS bullseye [14:10:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2037.codfw.wmnet with OS bullseye [14:10:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2038.codfw.wmnet with OS bullseye [14:13:33] (03PS2) 10Elukey: Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) [14:13:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#9998583 (10Jhancock.wm) hey, moved one of the connections to port 13 on asw-df-codfw [14:13:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#9998584 (10Jhancock.wm) a:03Jhancock.wm [14:16:47] (03PS3) 10Elukey: Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) [14:25:37] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage [14:28:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage [14:29:06] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Port or delete "git repo needs merge" icinga check - https://phabricator.wikimedia.org/T370530 (10fgiunchedi) 03NEW [14:29:19] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage [14:34:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage [14:34:19] (03CR) 10Andrew Bogott: [C:03+2] Beta: replace jobrunner04 with jobrunner05 [puppet] - 10https://gerrit.wikimedia.org/r/1055412 (https://phabricator.wikimedia.org/T370487) (owner: 10Southparkfan) [14:35:14] (03CR) 10Andrew Bogott: [C:03+2] changeprop beta: replace jobrunner with Bullseye instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) (owner: 10Southparkfan) [14:36:22] (03Merged) 10jenkins-bot: changeprop beta: replace jobrunner with Bullseye instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055394 (https://phabricator.wikimedia.org/T370487) (owner: 10Southparkfan) [14:36:47] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2037.codfw.wmnet with OS bullseye [14:36:57] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2038.codfw.wmnet with OS bullseye [14:37:32] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: update domain name for IRC RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055306 (https://phabricator.wikimedia.org/T369919) (owner: 10Southparkfan) [14:37:33] 06SRE, 10ConfirmEdit (CAPTCHA extension), 10WMF-General-or-Unknown: Remove words with apostrophes from captcha wordlist - https://phabricator.wikimedia.org/T370531#9998689 (10Reedy) [14:37:50] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:38:11] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on centrallog1002.eqiad.wmnet with reason: network upgrade [14:38:15] (03Merged) 10jenkins-bot: LabsServices: update domain name for IRC RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055306 (https://phabricator.wikimedia.org/T369919) (owner: 10Southparkfan) [14:38:25] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on centrallog1002.eqiad.wmnet with reason: network upgrade [14:38:29] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9998693 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a4ef3cbd-d61a-4ca8-9633-ad28b8df65c1) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: ne... [14:38:36] 06SRE, 10ConfirmEdit (CAPTCHA extension), 10WMF-General-or-Unknown: Remove words with apostrophes from captcha wordlist - https://phabricator.wikimedia.org/T370531#9998686 (10Reedy) p:05Triage→03Low [15:00:23] (03CR) 10JHathaway: [C:03+1] Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 (owner: 10Slyngshede) [15:00:37] FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:13] (03CR) 10JHathaway: [C:03+1] Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede) [15:05:45] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:05:45] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [15:05:45] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:07:48] ^ deleted a persistently slow pod [15:08:46] (03PS1) 10Cathal Mooney: Support SSW performing DHCP relay for hosts connected to ASW [homer/public] - 10https://gerrit.wikimedia.org/r/1055458 (https://phabricator.wikimedia.org/T369274) [15:09:21] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2038.codfw.wmnet with OS bullseye [15:10:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2038.codfw.wmnet with OS bullseye [15:10:37] FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:29] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:12:24] (03CR) 10Cathal Mooney: [C:03+2] Support SSW performing DHCP relay for hosts connected to ASW [homer/public] - 10https://gerrit.wikimedia.org/r/1055458 (https://phabricator.wikimedia.org/T369274) (owner: 10Cathal Mooney) [15:12:57] (03Merged) 10jenkins-bot: Support SSW performing DHCP relay for hosts connected to ASW [homer/public] - 10https://gerrit.wikimedia.org/r/1055458 (https://phabricator.wikimedia.org/T369274) (owner: 10Cathal Mooney) [15:15:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding gerrit2003 to codfw - jhancock@cumin2002" [15:16:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding gerrit2003 to codfw - jhancock@cumin2002" [15:16:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host gerrit2003.mgmt.codfw.wmnet with reboot policy FORCED [15:17:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2037.codfw.wmnet with reason: host reimage [15:19:19] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2037.codfw.wmnet with reason: host reimage [15:23:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#9998809 (10cmooney) 05Open→03Resolved woot! thanks Jenn :) [15:24:19] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2002.codfw.wmnet [15:27:29] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:27:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit2003.mgmt.codfw.wmnet with reboot policy FORCED [15:28:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2038.codfw.wmnet with reason: host reimage [15:29:19] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:37] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:32:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2038.codfw.wmnet with reason: host reimage [15:34:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit2003'] [15:34:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['gerrit2003'] [15:34:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit2003'] [15:34:59] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9998858 (10VRiley-WMF) 05Open→03Resolved Relocated the server and physically relabeled the cable. This is now completed. [15:35:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['gerrit2003'] [15:35:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit2003'] [15:37:38] (03PS3) 10Bking: knative-serving: Switch activator to use Calico NP/k8s services (1/9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [15:37:38] (03CR) 10Bking: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [15:41:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2037.codfw.wmnet with OS bullseye [15:43:03] (03CR) 10Urbanecm: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [15:43:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['gerrit2003'] [15:44:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit2003'] [15:44:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit2003'] [15:52:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2038.codfw.wmnet with OS bullseye [15:52:57] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#9998924 (10Jhancock.wm) [15:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:43] (03PS1) 10Southparkfan: deployment-prep: add two new appservers [puppet] - 10https://gerrit.wikimedia.org/r/1055464 (https://phabricator.wikimedia.org/T361387) [15:57:15] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: add two new appservers [puppet] - 10https://gerrit.wikimedia.org/r/1055464 (https://phabricator.wikimedia.org/T361387) (owner: 10Southparkfan) [15:59:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:00:40] (03PS1) 10Brouberol: superset-next: upgrade to v4.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055468 (https://phabricator.wikimedia.org/T370152) [16:01:41] (03CR) 10Btullis: [C:03+1] "Great, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055468 (https://phabricator.wikimedia.org/T370152) (owner: 10Brouberol) [16:01:50] (03CR) 10Brouberol: [C:03+2] superset-next: upgrade to v4.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055468 (https://phabricator.wikimedia.org/T370152) (owner: 10Brouberol) [16:11:50] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:13:32] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:14:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:34] (03CR) 10BCornwall: [V:03+1 C:03+1] "Validated using https://gitlab.wikimedia.org/-/snippets/146" [dns] - 10https://gerrit.wikimedia.org/r/1055230 (owner: 10Ncmonitor) [16:20:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:20:23] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:40:42] (03PS20) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [16:43:26] 06SRE, 10Observability-Alerting: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135#9999026 (10lmata) [16:46:47] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:48:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:56] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:06:20] (03PS21) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:12:45] !log adding irb ints for row c/d vlans to codfw leaf switches in those rows T364095 [17:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:52] T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 [17:13:16] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:13:18] (03PS22) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:14:43] (03PS4) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [17:17:41] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:20:54] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new irb ints codfw row c and d - cmooney@cumin1002" [17:21:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new irb ints codfw row c and d - cmooney@cumin1002" [17:21:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:55] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543 (10RobH) 03NEW [17:25:32] (03PS23) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:27:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9999216 (10Jhancock.wm) franio2001 eth0 <-> FASW-C8A eth-0/0/25 eth1 <-> FASW-C8B eth-1/0/25 franio2002 eth0 <-> FASW-C8A eth-0/0/26 eth1 <-> FASW-C8B eth-1/0/26 franio20... [17:29:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:10] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9999221 (10cmooney) >>! In T365998#9998292, @ABran-WMF wrote: > @Marostegui and I will be absent on tuesday, hosts have been dep... [17:33:52] (03PS24) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:40:24] (03PS25) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:42:54] (03PS1) 10Catrope: beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) [17:42:56] (03PS1) 10Catrope: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) [17:43:32] (03CR) 10CI reject: [V:04-1] beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [17:43:34] (03CR) 10CI reject: [V:04-1] Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [17:44:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:51] (03PS2) 10Catrope: beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) [17:45:30] (03CR) 10CI reject: [V:04-1] beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [17:46:01] (03PS2) 10Catrope: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) [17:46:39] (03CR) 10CI reject: [V:04-1] Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [17:46:54] (03PS3) 10Catrope: beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) [17:46:54] (03PS3) 10Catrope: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) [17:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:02:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:07:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:07:27] (03CR) 10Dzahn: [C:03+2] crm: switch civicrm to use smarty4 and don't pull extensions [puppet] - 10https://gerrit.wikimedia.org/r/1054952 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:08:25] (03CR) 10Dzahn: [C:03+2] ci: keep python2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1055147 (https://phabricator.wikimedia.org/T367544) (owner: 10Hashar) [18:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:52] (03PS26) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [18:46:14] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545 (10RobH) 03NEW [18:47:25] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#9999340 (10RobH) [18:49:34] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546 (10RobH) 03NEW [18:50:08] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#9999361 (10RobH) [18:53:06] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9999368 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/25; - member ge-0/0/26; - member ge-0/0/27; -... [18:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:24] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:36] (03PS1) 10Papaul: Add franio200[2-3] to DNS file [dns] - 10https://gerrit.wikimedia.org/r/1055481 [19:18:07] (03CR) 10Dzahn: [C:03+1] Add franio200[2-3] to DNS file [dns] - 10https://gerrit.wikimedia.org/r/1055481 (owner: 10Papaul) [19:18:38] (03PS27) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [19:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:56] 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T370552 (10phaultfinder) 03NEW [19:34:11] (03PS28) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [19:38:11] papaul: there seems to be a duplicate IP on mgmt network.. new [19:39:56] mutante: 10.65 looks like eqiad [19:40:02] will look later [19:40:23] ack, ty [19:44:17] 2 different Dell MACs. IP kafka-main1007.mgmt [19:44:43] mutante: yes https://netbox.wikimedia.org/ipam/ip-addresses/16934/ [19:45:04] VRiley: see above [19:45:44] VRiley: reference is https://phabricator.wikimedia.org/T370552 [19:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:47:32] (03PS1) 10Andrew Bogott: git-puppet-sync-upstream: perform the final deploy as git_user [puppet] - 10https://gerrit.wikimedia.org/r/1055482 (https://phabricator.wikimedia.org/T364492) [19:48:39] mutante: Thanks. Currently, I'm testing out a MB swap, and it's not looking successful. I apologize for the errors it was throwing. [19:49:08] (03PS2) 10Andrew Bogott: git-puppet-sync-upstream: perform the final deploy as git_user [puppet] - 10https://gerrit.wikimedia.org/r/1055482 (https://phabricator.wikimedia.org/T364492) [19:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:42] VRiley: ah, that makes sense. thanks, no problem. if we know it's that, good [19:49:55] (03CR) 10Andrew Bogott: [C:03+2] git-puppet-sync-upstream: perform the final deploy as git_user [puppet] - 10https://gerrit.wikimedia.org/r/1055482 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:51:58] 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T370552#9999506 (10Dzahn) @VRiley-WMF was working on this, swapping the mainboard. that explains the 2 different MACs [19:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:54:09] (03PS1) 10Ebernhardson: beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) [19:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:57:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:02:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:02:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:04:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#9999540 (10VRiley-WMF) After swapping out the MB, it'll boot. However, it's consistently throwing errors with a few DIMM slots (even after replacing the memory in those slots). Since this serve... [20:05:54] 10ops-eqiad, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T370552#9999542 (10phaultfinder) [20:05:55] (03PS13) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [20:07:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:09:19] FIRING: [33x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556 (10ops-monitoring-bot) 03NEW [20:15:37] FIRING: [33x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:19] FIRING: [33x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:54] (03CR) 10Andrew Bogott: [C:03+2] openstack: nova: Ensure libvirt is running when declaring secrets [puppet] - 10https://gerrit.wikimedia.org/r/1043058 (owner: 10Majavah) [20:20:37] FIRING: [33x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [20:24:19] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:53] (03Merged) 10jenkins-bot: beta: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055479 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [20:25:37] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:29:19] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:19] (03PS1) 10Andrew Bogott: cloudvirt1061 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1055486 (https://phabricator.wikimedia.org/T364457) [20:34:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bookworm [20:36:08] (03CR) 10Andrew Bogott: [C:03+2] cloudvirt1061 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1055486 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [20:39:19] FIRING: [35x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:19] FIRING: [35x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:19] FIRING: [35x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:48] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:52:00] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:53:17] (03PS1) 10Dzahn: doc: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055488 [20:53:17] (03PS1) 10Dzahn: aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 [20:53:17] (03PS1) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 [20:53:18] (03PS1) 10Dzahn: ci: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055491 [20:53:19] (03PS1) 10Dzahn: lists: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055492 [20:53:21] (03PS1) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 [20:53:25] (03PS1) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 [20:53:29] (03PS1) 10Dzahn: vrts: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055495 [20:53:33] (03PS1) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055496 [20:53:41] (03CR) 10CI reject: [V:04-1] doc: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (owner: 10Dzahn) [20:55:47] (03PS1) 10JHathaway: pcc: update flask code [puppet] - 10https://gerrit.wikimedia.org/r/1055497 (https://phabricator.wikimedia.org/T367547) [20:56:41] (03PS2) 10Dzahn: doc: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055488 [20:57:51] (03CR) 10JHathaway: [C:03+2] pcc: update flask code [puppet] - 10https://gerrit.wikimedia.org/r/1055497 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [20:57:52] (03PS1) 10BCornwall: haproxy: Calculate rate of haproxy restarts [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) [20:58:58] (03PS29) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [21:04:27] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:27] RESOLVED: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:37] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS bookworm [21:15:16] (03PS1) 10JHathaway: pcc: bump max body size [puppet] - 10https://gerrit.wikimedia.org/r/1055499 (https://phabricator.wikimedia.org/T367547) [21:18:50] (03CR) 10CI reject: [V:04-1] pcc: bump max body size [puppet] - 10https://gerrit.wikimedia.org/r/1055499 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [21:19:19] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:37] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:19] FIRING: [34x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:08] (03PS1) 10JHathaway: pcc: update default for facts upload script [puppet] - 10https://gerrit.wikimedia.org/r/1055500 (https://phabricator.wikimedia.org/T367547) [21:33:21] (03PS1) 10JHathaway: Revert "openstack: nova: Ensure libvirt is running when declaring secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1055501 [21:33:46] (03CR) 10JHathaway: [C:03+2] pcc: update default for facts upload script [puppet] - 10https://gerrit.wikimedia.org/r/1055500 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [21:34:03] (03CR) 10JHathaway: [C:03+2] Revert "openstack: nova: Ensure libvirt is running when declaring secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1055501 (owner: 10JHathaway) [21:34:52] (03CR) 10CDobbins: varnish: show better error for 429s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [21:35:14] (03CR) 10CDobbins: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [21:35:59] (03CR) 10JHathaway: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1055499 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [21:42:52] (03PS1) 10Andrew Bogott: git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) [21:43:15] (03CR) 10CI reject: [V:04-1] git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [21:44:21] (03PS2) 10Andrew Bogott: git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) [21:44:57] (03CR) 10CI reject: [V:04-1] git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [21:51:04] (03PS3) 10Andrew Bogott: git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) [21:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:53:31] (03CR) 10Andrew Bogott: [C:04-1] "This isn't ready for merge yet because it presumes arbitrary users are writing to prometheus." [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [21:55:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:23] (03CR) 10JHathaway: [C:03+2] pcc: bump max body size [puppet] - 10https://gerrit.wikimedia.org/r/1055499 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [22:30:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [22:35:03] (03PS4) 10JHathaway: logstash: add postfix filters & patterns [puppet] - 10https://gerrit.wikimedia.org/r/1037571 [22:35:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [22:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:15:37] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:19] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1055508 [23:38:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1055508 (owner: 10TrainBranchBot) [23:44:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:45:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions