[00:09:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:16:02] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954378 [00:39:02] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954378 (owner: 10TrainBranchBot) [00:54:46] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954378 (owner: 10TrainBranchBot) [01:04:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T345592 (10phaultfinder) [01:07:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:12:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T0200) [02:07:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.25 [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954379 (https://phabricator.wikimedia.org/T343727) [02:07:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.25 [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954379 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [02:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:55] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.25 [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954379 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [02:33:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T0300) [03:00:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:25] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954764 (https://phabricator.wikimedia.org/T343727) [03:01:27] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954764 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [03:02:06] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954764 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [03:02:39] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.25 refs T343727 [03:02:42] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [03:28:30] 10SRE-swift-storage, 10Commons: File not found on commons - https://phabricator.wikimedia.org/T345522 (10Shizhao) 05Open→03Invalid It can be shown today [03:36:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:41:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:59:08] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.25 refs T343727 (duration: 56m 29s) [03:59:11] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [04:02:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:26] (03PS3) 10KartikMistry: Update MinT to 2023-09-04-051105-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) [05:04:31] * kart_ updating MinT and cxserver [05:05:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:10:47] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-09-04-051105-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [05:12:26] (03Merged) 10jenkins-bot: Update MinT to 2023-09-04-051105-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [05:22:32] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:23:11] (03PS3) 10KartikMistry: Update cxserver to 2023-08-29-191442-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T345170) [05:25:03] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:30:26] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:36:54] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:41:56] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:46:26] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:55:09] !log Updated MinT to 2023-09-04-051105-production (T336683) [05:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:13] T336683: Enable MinT support for languages with no Wikipedia yet - https://phabricator.wikimedia.org/T336683 [05:55:25] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-08-29-191442-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T345170) (owner: 10KartikMistry) [05:56:11] (03Merged) 10jenkins-bot: Update cxserver to 2023-08-29-191442-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T345170) (owner: 10KartikMistry) [05:57:57] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:58:19] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T0600) [06:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T0600). [06:00:36] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:01:00] ^^ I'm finishing cxserver deployment in a few minutes [06:01:11] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:04:11] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:04:47] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:05:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:06:27] !log Updated cxserver to 2023-08-29-191442-production (T345170, T343450) [06:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:31] T343450: Enable MinT for closely-related languages based on community input - https://phabricator.wikimedia.org/T343450 [06:06:31] T345170: Post-creation work for tlywiki - https://phabricator.wikimedia.org/T345170 [06:10:31] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1130.eqiad.wmnet with OS bullseye [06:24:03] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1130.eqiad.wmnet with reason: host reimage [06:26:33] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1130.eqiad.wmnet with reason: host reimage [06:28:25] (03PS1) 10Tim Starling: Enable source maps on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 [06:29:41] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1131.eqiad.wmnet with OS bullseye [06:30:42] (03CR) 10Tim Starling: [C: 03+2] Customise $wgSitename on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952563 (https://phabricator.wikimedia.org/T181908) (owner: 10Tim Starling) [06:31:22] (03Merged) 10jenkins-bot: Customise $wgSitename on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952563 (https://phabricator.wikimedia.org/T181908) (owner: 10Tim Starling) [06:33:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:43:20] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1131.eqiad.wmnet with reason: host reimage [06:45:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [06:46:19] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1131.eqiad.wmnet with reason: host reimage [06:47:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:47:50] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:49:00] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:49:00] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Labs only change, just avoiding undeployed changes (duration: 09m 25s) [06:50:12] (03PS1) 10Vgutierrez: admin: Add new SSH key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/954879 (https://phabricator.wikimedia.org/T345132) [06:51:51] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1130.eqiad.wmnet with OS bullseye [06:52:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:55:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [06:59:10] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS bullseye [07:00:05] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T0700) [07:00:05] Aca, phuedx, aanzx, and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] * Aca waves [07:03:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [07:03:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [07:07:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [07:07:28] o/ [07:07:37] Sorry I'm late [07:07:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) 05Open→03Stalled [07:08:13] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10MatthewVernon) Thanks for letting me know; given you're planning to come back to this credential later in the FY, I'm going to leave it in place (since the process for adding/remo... [07:08:23] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1131.eqiad.wmnet with OS bullseye [07:09:59] I'm late too. [07:11:40] (03CR) 10Elukey: [C: 03+1] Update the maximum message size in kafka for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/954690 (https://phabricator.wikimedia.org/T344688) (owner: 10Btullis) [07:13:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) [07:13:40] Anyone deploying today? :) [07:14:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [07:14:42] No deployer listed seems available today? I'll start with my patch then. [07:15:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry) [07:15:24] I'll go with other patches if time permits. [07:15:44] (03Merged) 10jenkins-bot: Enable Section and Content Translation in 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry) [07:16:35] !log kartik@deploy1002 Started scap: Backport for [[gerrit:953756|Enable Section and Content Translation in 7 Wikipedias (T343211)]] [07:16:40] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [07:18:15] !log kartik@deploy1002 kartik: Backport for [[gerrit:953756|Enable Section and Content Translation in 7 Wikipedias (T343211)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:20:10] !log kartik@deploy1002 kartik: Continuing with sync [07:22:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [07:22:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [07:23:50] !log failover ganeti masters in esams to ganeti3007/ganeti3008 [07:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) 05Open→03Resolved Sounds good to us. Thanks for your help! Closing this in favour of a more detailed rollout plan to come later this year [07:26:40] `07:24:59 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw1420.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw2300.codfw.wmnet', 'mw1404.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw2289.codfw.wmnet', 'mw1398.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1486.eqiad.wmnet'] (ran as mwdeploy@mw2448.codfw.wmnet) returned [255]: ssh: connect to host mw2448.codfw.wmnet [07:26:40] port 22: Connection timed out` [07:27:22] PROBLEM - ganeti-wconfd running on ganeti3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:27:25] mw2448.codfw.wmnet seems down? [07:32:21] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:953756|Enable Section and Content Translation in 7 Wikipedias (T343211)]] (duration: 15m 45s) [07:32:24] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [07:33:25] Aca: I can go ahead with your patches if you're around to test. [07:33:35] Yess, please [07:33:52] Cool. Deploying first. Will ping once to test on mwdebug. [07:35:16] (03PS4) 10Filippo Giunchedi: otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) [07:35:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954240 (https://phabricator.wikimedia.org/T345513) (owner: 10Acamicamacaraca) [07:36:12] (03Merged) 10jenkins-bot: Enable AbuseFilter blocks on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954240 (https://phabricator.wikimedia.org/T345513) (owner: 10Acamicamacaraca) [07:36:38] !log kartik@deploy1002 Started scap: Backport for [[gerrit:954240|Enable AbuseFilter blocks on shwiki (T345513)]] [07:36:41] T345513: Enable AbuseFilter blocks on shwiki - https://phabricator.wikimedia.org/T345513 [07:37:02] (03CR) 10Filippo Giunchedi: "The host:port now is correct, although I'm not sure what's the best way to get tls certs via cert-manager (without "mesh" module? or easie" [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [07:38:15] !log kartik@deploy1002 kartik and aleksandar: Backport for [[gerrit:954240|Enable AbuseFilter blocks on shwiki (T345513)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:38:31] Aca: Please test patch on shwiki and let me know if it is OK. [07:38:33] testing it now [07:38:48] (I mean using mwdebug :)) [07:39:08] yeah, thats what I was referring to [07:39:40] (03PS10) 10KartikMistry: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [07:40:14] block option is now available in AbuseFilter, seems fine [07:40:21] kart_: probably best to file a task about mw2448 or use -sre. Here it'll get lost. [07:41:54] Aca: OK. Going ahead. [07:42:03] !log kartik@deploy1002 kartik and aleksandar: Continuing with sync [07:42:06] (03PS6) 10Anzx: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) [07:42:14] (03PS3) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) [07:43:04] RhinosF1|Away: msged in -sre. [07:45:25] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS bullseye [07:46:55] !log depool mw2448 (unreachable) [07:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:31] Aca: For 2nd patch there is still -2 from taavi? [07:51:33] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10MoritzMuehlenhoff) [07:51:50] (03CR) 10JMeybohm: otel-collector: export traces to jaeger (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [07:53:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:53:43] That patch was -2'ed because of lack of consensus. I started an RfC later on shwiki Village Pump, and consensus was reached. I tried contacting taavi via Phabricator, but I didn't get the respose. [07:54:21] jouncebot: next [07:54:21] In 2 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1000) [07:55:06] Aca: no problem. I was reading task/discussion. [07:55:08] kart_: o/ [07:55:48] I'd need to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/954593, not urgent, can I get into the queue? (I can deploy it myself with scap backport later on) [07:56:07] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:954240|Enable AbuseFilter blocks on shwiki (T345513)]] (duration: 19m 29s) [07:56:08] elukey: sure. We still have 4 more patches to go ;) [07:56:10] T345513: Enable AbuseFilter blocks on shwiki - https://phabricator.wikimedia.org/T345513 [07:56:28] elukey: [07:56:31] I can probably go with Aca's second patch, but can't deploy more than that. [07:57:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [07:57:34] elukey: Please go ahead after I finish ^^ [07:58:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:35] phuedx: aanzx - I'm sorry but we have to reschedule your patches to the next available window. I've meeting to attend in few minutes (will be here till aca's patch is deployed) [07:58:50] ok i will schedule for later [08:00:17] (03CR) 10KartikMistry: [C: 03+2] Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [08:00:42] (03PS2) 10Ayounsi: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) [08:00:51] (03CR) 10KartikMistry: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [08:00:59] (03Merged) 10jenkins-bot: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [08:01:23] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:28] !log kartik@deploy1002 Started scap: Backport for [[gerrit:949171|Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki (T344306)]] [08:01:31] T344306: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki - https://phabricator.wikimedia.org/T344306 [08:01:50] ah. We needed to remove -2 before approval by scap. [08:02:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:20] (03CR) 10CI reject: [V: 04-1] Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [08:03:07] !log kartik@deploy1002 aleksandar and kartik: Backport for [[gerrit:949171|Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki (T344306)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:03:11] Aca: Can you test https://gerrit.wikimedia.org/r/c/949171/ on mwdebug? [08:04:37] yepp [08:05:02] Let me know once it is all good to go. [08:05:29] (03PS1) 10Volans: tox.ini: use sphinx-build instead of setup.py [software/homer] - 10https://gerrit.wikimedia.org/r/954882 [08:05:53] kart_ good to go ^.^ [08:06:07] New protection options are now available in dropdown menu [08:06:20] Awesome. [08:06:24] !log kartik@deploy1002 aleksandar and kartik: Continuing with sync [08:06:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:08:26] (03CR) 10Volans: [C: 03+2] tox.ini: use sphinx-build instead of setup.py [software/homer] - 10https://gerrit.wikimedia.org/r/954882 (owner: 10Volans) [08:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:10:11] (03Merged) 10jenkins-bot: tox.ini: use sphinx-build instead of setup.py [software/homer] - 10https://gerrit.wikimedia.org/r/954882 (owner: 10Volans) [08:10:33] (03PS1) 10JMeybohm: CI: Run scaffold tests on module changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/954883 [08:10:57] (03PS3) 10Volans: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [08:12:15] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:949171|Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki (T344306)]] (duration: 10m 47s) [08:12:18] T344306: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki - https://phabricator.wikimedia.org/T344306 [08:12:41] OK. We are done, Aca :) [08:12:50] Nicee, thank you! [08:12:53] elukey: you can go ahead with self deploy :) [08:13:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:13:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:18:40] kart_: thanks! [08:18:43] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [08:20:04] (03PS5) 10Filippo Giunchedi: otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) [08:20:56] (03CR) 10Filippo Giunchedi: otel-collector: export traces to jaeger (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [08:22:29] (03PS2) 10Majavah: hieradata: add cloudservices1006 to all designate fw rules [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) [08:23:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: add cloudservices1006 to all designate fw rules [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) (owner: 10Majavah) [08:23:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43142/console" [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) (owner: 10Majavah) [08:23:57] (03PS8) 10JMeybohm: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:24:12] (03CR) 10JMeybohm: [C: 04-1] mesh: add tracing support (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:25:35] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: add cloudservices1006 to all designate fw rules [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) (owner: 10Majavah) [08:25:45] (03CR) 10Volans: "LGTM, small nit inline" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [08:27:20] (03PS1) 10Majavah: hieradata: use eqiad.wmnet for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/954884 [08:27:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: use eqiad.wmnet for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/954884 (owner: 10Majavah) [08:27:50] (03CR) 10Majavah: [C: 03+2] hieradata: use eqiad.wmnet for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/954884 (owner: 10Majavah) [08:28:46] elukey: 👋 are you still going to deploy a patch? I need to re-run the train presync [08:28:55] (03PS2) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [08:28:58] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [08:29:35] jnuche: please go ahead! [08:29:37] (03CR) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:29:39] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:29:57] elukey: thanks [08:31:02] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.25 refs T343727 [08:31:05] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:31:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:32:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:33:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.977 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:34:35] (03CR) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:34:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:24] (03PS2) 10Terasail: Add 'confirmed' to Wikifunctions sysop add and remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) [08:38:28] (03PS3) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [08:39:10] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:39:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:40:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:42:32] 10SRE, 10Infrastructure-Foundations, 10netops: Maintain ROAs for currently unannounced BGP assignments - https://phabricator.wikimedia.org/T345601 (10cmooney) p:05Triage→03Low [08:43:22] 10SRE, 10Infrastructure-Foundations, 10netops: Maintain ROAs for currently unannounced BGP assignments - https://phabricator.wikimedia.org/T345601 (10ayounsi) Sounds good to me! [08:47:29] (03PS1) 10Muehlenhoff: Reinstate the fix to filter duplicate Netty War blobs [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/954885 [08:48:02] (03CR) 10Muehlenhoff: "Tested on idp2002 (where it was failing before) with a local build" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/954885 (owner: 10Muehlenhoff) [08:48:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [08:51:39] !log jnuche@deploy1002 sync-world aborted: testwikis wikis to 1.41.0-wmf.25 refs T343727 (duration: 20m 37s) [08:51:42] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:55:16] (03PS13) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:57:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [08:59:07] (03PS14) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:00:50] (03PS15) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:01:35] PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:46] !log powercycle mw1356.eqiad.wmnet [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [09:05:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [09:07:07] RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:11:47] bouncebot: now [09:12:09] jouncebot: now [09:12:09] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [09:13:25] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:51] (03CR) 10JMeybohm: [C: 03+2] CI: Run scaffold tests on module changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/954883 (owner: 10JMeybohm) [09:14:26] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:14:47] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:15:09] (03CR) 10CI reject: [V: 04-1] CI: Run scaffold tests on module changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/954883 (owner: 10JMeybohm) [09:16:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:44] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqiad ganeti-test - ayounsi@cumin1001" [09:17:25] (03PS1) 10Arnaudb: T343198: update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 [09:17:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqiad ganeti-test - ayounsi@cumin1001" [09:17:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:46] (03CR) 10CI reject: [V: 04-1] T343198: update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (owner: 10Arnaudb) [09:17:57] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:06] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-test1001 [09:20:23] (03PS2) 10Arnaudb: Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) [09:20:42] (03CR) 10CI reject: [V: 04-1] Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) (owner: 10Arnaudb) [09:21:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-test1001 [09:21:23] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-test1002 [09:22:33] (03PS3) 10Arnaudb: Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) [09:22:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-test1002 [09:24:49] (03CR) 10Ladsgroup: [C: 03+1] Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) (owner: 10Arnaudb) [09:25:19] (03CR) 10Arnaudb: [C: 03+2] Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) (owner: 10Arnaudb) [09:25:42] (03Merged) 10jenkins-bot: Update pagelinks to add pl_target_id [software/schema-changes] - 10https://gerrit.wikimedia.org/r/954381 (https://phabricator.wikimedia.org/T343198) (owner: 10Arnaudb) [09:26:33] !log ayounsi@cumin1001 START - Cookbook sre.hosts.provision for host ganeti-test1001.mgmt.eqiad.wmnet with reboot policy FORCED [09:29:57] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [09:30:50] jouncebot: next [09:30:50] In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1000) [09:31:01] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:31:11] jnuche: o/ clear to deploy or are you still working on it? [09:31:22] elukey: FIY if you're doing a scap deployment I'm rebooting parsoid servers at the moment [09:31:25] So some hosts may fail [09:31:36] claime: I can wait then, I was about to ping you next [09:32:09] elukey: good to go from my side [09:32:17] elukey: It's gonna take a while though, so might as well proceed and re-run if it fails [09:32:35] I'm at like 4 of 24 [09:33:08] claime: not in a hurry, I can do it tomorrow [09:33:11] ack [09:33:29] sorry for the delay [09:34:16] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lift wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) [09:34:28] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:36:42] (03CR) 10Elukey: [C: 03+1] ores-extension: enable lift wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [09:39:38] 10SRE, 10Infrastructure-Foundations, 10netops: Maintain ROAs for currently unannounced BGP assignments - https://phabricator.wikimedia.org/T345601 (10cmooney) 05Open→03Resolved a:03cmooney I've added ROAs for our newer RIPE /24 range and the old esams one now to help protect against hi-jack / misuse.... [09:41:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test1001.mgmt.eqiad.wmnet with reboot policy FORCED [09:43:20] !log ayounsi@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test1001'] [09:43:36] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti-test1001'] [09:45:50] (03CR) 10Vgutierrez: [C: 03+1] "SSH key validated via Slack" [puppet] - 10https://gerrit.wikimedia.org/r/954879 (https://phabricator.wikimedia.org/T345132) (owner: 10Vgutierrez) [09:46:40] (03PS1) 10Hnowlan: rest-gateway: route requests to geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/954888 (https://phabricator.wikimedia.org/T336400) [09:47:19] (03CR) 10Fabfur: [C: 03+1] "Key seems correct" [puppet] - 10https://gerrit.wikimedia.org/r/954879 (https://phabricator.wikimedia.org/T345132) (owner: 10Vgutierrez) [09:47:50] (03PS1) 10Ladsgroup: auto_schema: Fix removal of skip [software] - 10https://gerrit.wikimedia.org/r/954889 [09:48:31] (03PS1) 10Hnowlan: trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) [09:48:36] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Fix removal of skip [software] - 10https://gerrit.wikimedia.org/r/954889 (owner: 10Ladsgroup) [09:49:07] (03Merged) 10jenkins-bot: auto_schema: Fix removal of skip [software] - 10https://gerrit.wikimedia.org/r/954889 (owner: 10Ladsgroup) [09:49:28] !log failover ganeti master in esams/BY27 to ganeti3007 [09:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:52:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:52:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52247 and previous config saved to /var/cache/conftool/dbconfig/20230905-095254-arnaudb.json [09:52:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:53:09] PROBLEM - ganeti-wconfd running on ganeti3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:55:55] ^ expected [09:57:00] (03CR) 10Vgutierrez: [C: 03+2] admin: Add new SSH key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/954879 (https://phabricator.wikimedia.org/T345132) (owner: 10Vgutierrez) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1000) [10:06:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez your new key should be deployed in the next ~30 minutes. Please do not upload it to gitlab/wikitech to prevent... [10:06:59] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: rabbitmq: allow access from new designate node [puppet] - 10https://gerrit.wikimedia.org/r/954891 (https://phabricator.wikimedia.org/T345240) [10:07:07] (03PS16) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [10:07:09] (03PS1) 10Elukey: slo_template: allow spaces in dashboard names [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 [10:07:20] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954891 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [10:07:59] (03CR) 10Ladsgroup: [C: 03+1] "when should we get the party started?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [10:08:35] (03PS17) 10Elukey: Lift Wing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [10:09:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: rabbitmq: allow access from new designate node [puppet] - 10https://gerrit.wikimedia.org/r/954891 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [10:10:54] (03CR) 10Elukey: "Tested on grafana1002 with a local grizzly repo + grr preview (using the next patch as testbed)." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [10:13:41] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10Vgutierrez) @Volans any idea on how could we potentially reduce the "false positives" of this alert? we got 7 occurrences in the last 30 days that apparently wer... [10:16:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:21:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:21:58] (03PS1) 10Cathal Mooney: Add includes for Netbox generated dns for new per-rack codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/954893 (https://phabricator.wikimedia.org/T327938) [10:26:48] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:23] (03CR) 10Elukey: [C: 03+1] ores-extension: enable lift wing for most wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [10:28:33] (03CR) 10Btullis: Grant analytics-admins rights to run some git cmds as analytics-deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [10:28:36] (03CR) 10Hnowlan: [C: 03+2] device-analytics: use global AQS configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/953982 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:29:19] (03PS9) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [10:29:21] (03Merged) 10jenkins-bot: device-analytics: use global AQS configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/953982 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:29:59] (03CR) 10CI reject: [V: 04-1] mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [10:30:43] (03CR) 10JMeybohm: [C: 03+1] otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [10:31:48] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:58] (03CR) 10JMeybohm: [C: 03+1] hieradata: set jaeger components services to production [puppet] - 10https://gerrit.wikimedia.org/r/954705 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [10:33:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:34:14] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:34:19] (03CR) 10JMeybohm: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954883 (owner: 10JMeybohm) [10:36:36] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [10:36:52] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [10:41:17] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [10:41:27] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [10:42:59] (03Merged) 10jenkins-bot: CI: Run scaffold tests on module changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/954883 (owner: 10JMeybohm) [10:46:17] (03CR) 10Jbond: "see inline i dont think we need the same fix but maybe something similar" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/954885 (owner: 10Muehlenhoff) [10:47:27] (03CR) 10Jbond: Reinstate the fix to filter duplicate Netty War blobs (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/954885 (owner: 10Muehlenhoff) [10:48:07] (03PS4) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [10:48:53] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [10:54:01] (03PS1) 10Cathal Mooney: Add static network defs and DHCP config for new codfw subnets [puppet] - 10https://gerrit.wikimedia.org/r/954896 (https://phabricator.wikimedia.org/T327938) [10:55:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] [toolsdb] Enable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/954722 (https://phabricator.wikimedia.org/T345450) (owner: 10FNegri) [10:56:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [11:00:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [11:04:21] (03PS1) 10Ayounsi: Add temp ganeti-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/954898 (https://phabricator.wikimedia.org/T345602) [11:07:03] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (owner: 10Abijeet Patro) [11:07:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954898 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [11:08:03] (03CR) 10Ayounsi: [C: 03+2] Add temp ganeti-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/954898 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [11:08:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [11:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [11:09:11] !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None [11:09:14] !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [11:14:38] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10Volans) @Vgutierrez could you please elaborate on the non-actionable part? The original statement about the `insetup` role is not correct, insetup hosts are man... [11:15:38] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [11:18:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti-test1001.eqiad.wmnet with OS bullseye [11:22:34] (03CR) 10Hnowlan: [C: 03+1] Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [11:22:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [11:23:00] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [11:24:44] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard-next [11:26:42] (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service in more wikis - rollout #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (owner: 10Abijeet Patro) [11:29:49] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [11:30:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [11:33:12] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [11:36:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [11:38:04] (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service in more wikis - rollout #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (owner: 10Abijeet Patro) [11:38:59] (03PS4) 10KartikMistry: Enable MinT translation service in more wikis - rollout #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [11:40:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [11:40:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [11:42:40] (03PS1) 10Hnowlan: device-analytics: correct replica definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/954902 [11:45:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [11:45:34] (03PS1) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 [11:45:40] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [11:47:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:48:07] elukey: You can deploy if you want, parsoid hosts are done rebooting [11:49:43] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10Volans) From the last email: ` Alias cloudbackup matched 0 hosts Alias durum-esams matched 0 hosts DC aliases do not cover all hosts: flink-zk2001.codfw.wmnet,k... [11:50:20] (03CR) 10Volans: [C: 03+1] "LGTM for netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [11:50:55] (03CR) 10Clément Goubert: "Adding Moritz for apt" [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [11:51:33] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti-test1001.eqiad.wmnet with OS bullseye [11:52:10] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti-test1001.eqiad.wmnet with OS bullseye [11:52:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [11:52:49] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.2 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [11:52:54] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.2 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.2). See below for a list of changes: * Security ** Enable session key rotation: This se... [11:53:09] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.2 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [11:54:03] (03PS10) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [11:54:38] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set jaeger components services to production [puppet] - 10https://gerrit.wikimedia.org/r/954705 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [11:54:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:55:07] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:34] /17 [11:57:40] (03CR) 10Muehlenhoff: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [11:58:51] (03PS2) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 [11:58:53] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1200) [12:00:27] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:00:53] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:06:43] (03PS3) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 [12:07:12] (03Abandoned) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:10:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [12:10:27] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [12:10:30] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10LSobanski) [12:10:50] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10LSobanski) [12:11:20] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10LSobanski) [12:11:42] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10LSobanski) [12:14:34] (03CR) 10Filippo Giunchedi: [C: 03+2] otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:14:43] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test1001.eqiad.wmnet with reason: host reimage [12:16:33] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [12:16:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [12:16:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [12:16:56] (03PS2) 10Jbond: confd::file: drop relative prefix [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) [12:16:58] (03PS2) 10Jbond: P:confd: Add support for discovery facts [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849) [12:17:02] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [12:17:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [12:17:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test1001.eqiad.wmnet with reason: host reimage [12:18:49] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [12:18:59] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [12:19:56] (03CR) 10CI reject: [V: 04-1] P:confd: Add support for discovery facts [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:20:55] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti-test1002.eqiad.wmnet with OS bullseye [12:23:08] (03CR) 10Jbond: [C: 04-1] "Still needs work" [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:29:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [12:30:02] claime: ack thanks! [12:30:05] jouncebot: next [12:30:05] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1300) [12:30:38] looks like everything is clear, I'll deploy my patch [12:31:59] 10SRE, 10Wikimedia-Mailing-lists: Appoint new moderators for Italian Wikipedia mailing list - https://phabricator.wikimedia.org/T345604 (10Ruthven) I would like to add that the consensus on it.wiki is to have all appointed admins of the ML to have an **NDA signed**. WMF ones will do. The 3 elected admins all h... [12:33:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by elukey@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [12:33:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 10 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43144/console" [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:35:02] (03Merged) 10jenkins-bot: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [12:35:29] !log elukey@deploy1002 Started scap: Backport for [[gerrit:954593|Add new OAuth Rate Limiter tier for Wiki Education (T345394)]] [12:35:32] T345394: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 [12:35:46] (03PS1) 10Filippo Giunchedi: otel-coll: allow outbound 30443/tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) [12:36:24] (03CR) 10Filippo Giunchedi: mesh: add tracing support (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:37:02] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1001" [12:37:07] !log elukey@deploy1002 elukey: Backport for [[gerrit:954593|Add new OAuth Rate Limiter tier for Wiki Education (T345394)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:37:21] !log elukey@deploy1002 elukey: Continuing with sync [12:38:43] (03CR) 10JMeybohm: [C: 04-1] "This will effectively allow talking to everything that is behind some ingress on some cluster - which I think is not a good idea. I'm afra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:39:19] 10SRE-swift-storage, 10Commons, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) [12:42:24] (03PS2) 10Filippo Giunchedi: otel-coll: allow outbound 30443/tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) [12:42:32] (03CR) 10CI reject: [V: 04-1] otel-coll: allow outbound 30443/tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:42:52] (03CR) 10Filippo Giunchedi: otel-coll: allow outbound 30443/tcp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:43:16] (03PS3) 10Filippo Giunchedi: otel-coll: allow outbound 30443/tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) [12:43:18] !log elukey@deploy1002 Finished scap: Backport for [[gerrit:954593|Add new OAuth Rate Limiter tier for Wiki Education (T345394)]] (duration: 07m 49s) [12:43:21] T345394: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 [12:43:55] 10SRE-swift-storage, 10Commons, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) @AntiCompositeNumber will you upload the original file attached here? Not being an admin on commons, I cannot delete older... [12:43:55] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test1002.eqiad.wmnet with reason: host reimage [12:44:55] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS bullseye [12:45:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52252 and previous config saved to /var/cache/conftool/dbconfig/20230905-124528-arnaudb.json [12:45:31] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:45:43] (03CR) 10JMeybohm: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:47:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test1002.eqiad.wmnet with reason: host reimage [12:49:18] (03CR) 10JMeybohm: [C: 03+1] otel-coll: allow outbound 30443/tcp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:50:39] (03PS1) 10Slyngshede: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 [12:52:35] (03CR) 10Filippo Giunchedi: [C: 03+2] otel-coll: allow outbound 30443/tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/954909 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [12:54:23] (03PS2) 10Slyngshede: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 [12:54:47] (03PS4) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 [12:54:54] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [12:54:58] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [12:55:05] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [12:55:13] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [12:55:28] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [12:55:47] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10Vgutierrez) 05Open→03Declined yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty. Considering your feedback about the... [12:56:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43146/console" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1300). [13:00:05] Dreamy_Jazz, phuedx, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] \o [13:00:19] o/ [13:00:28] o/ [13:00:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P52254 and previous config saved to /var/cache/conftool/dbconfig/20230905-130034-arnaudb.json [13:00:46] o/ [13:01:06] (03PS2) 10Majavah: Disable EchoMail and EchoInteraction instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950180 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [13:01:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [13:01:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950180 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [13:01:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [13:02:16] (03Merged) 10jenkins-bot: Disable EchoMail and EchoInteraction instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950180 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [13:02:22] Dreamy_Jazz: yours is strictly speaking a backport or a config change. but let's see anyways [13:02:43] !log taavi@deploy1002 Started scap: Backport for [[gerrit:950180|Disable EchoMail and EchoInteraction instruments (T344167)]] [13:02:45] Okay, [13:02:48] *Okay. [13:02:49] T344167: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 [13:02:59] Wasn't sure how else to get the maintenance script run. [13:03:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1001" [13:03:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test1001.eqiad.wmnet with OS bullseye [13:03:39] yeah we don't have a good process for that [13:03:52] your command is missing the /maintenance/ in the script path [13:04:16] Dreamy_Jazz: https://phabricator.wikimedia.org/P52255 [13:04:17] Oh yeah [13:04:18] !log taavi@deploy1002 taavi and phuedx: Backport for [[gerrit:950180|Disable EchoMail and EchoInteraction instruments (T344167)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:04:22] phuedx: please test [13:04:34] Thanks for finding and fixing that :) [13:05:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [13:05:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [13:05:58] (03PS7) 10Majavah: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:06:03] (03PS4) 10Majavah: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:06:22] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [13:06:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2026.codfw.wmnet with OS bullseye [13:06:25] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [13:06:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2026.codfw.wmnet with OS bullseye [13:06:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2025.codfw.wmnet with reason: host reimage [13:06:48] taavi: Notifications UIs (flyout and Special:Notifications) are still working. LGTM [13:07:05] !log taavi@deploy1002 taavi and phuedx: Continuing with sync [13:07:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [13:07:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [13:07:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2027.codfw.wmnet with OS bullseye [13:08:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye [13:08:11] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) Thanks. Is there any way that TTL cap could be raised for thumbnails? [13:08:38] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1001" [13:09:01] (03CR) 10Majavah: [C: 03+2] tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:09:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10Papaul) @RKemper hello any reason why this task is assigned to to me? [13:09:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [13:09:09] (03CR) 10Majavah: [C: 03+2] tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:09:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2025.codfw.wmnet with reason: host reimage [13:09:43] (03Merged) 10jenkins-bot: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:09:49] (03Merged) 10jenkins-bot: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:10:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) @JMeybohm @jbond @Volans thanks. [13:11:00] jouncebot: nowandnext [13:11:00] For the next 0 hour(s) and 48 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1300) [13:11:00] In 0 hour(s) and 48 minute(s): DC switchover live test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1400) [13:11:22] hello [13:11:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [13:11:52] (03CR) 10Jbond: [C: 03+1] sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [13:12:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2028.codfw.wmnet with OS bullseye [13:12:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2028.codfw.wmnet with OS bullseye [13:12:41] Amir1: we should have plenty of time before the switchover [13:12:43] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10Vgutierrez) Cache revalidation can further extend this period. After the initial 24-hour limit has passed, ATS will issue a conditional request to the backe... [13:12:57] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:950180|Disable EchoMail and EchoInteraction instruments (T344167)]] (duration: 10m 14s) [13:13:00] T344167: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 [13:13:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [13:13:29] ah wait there are already backports in progress [13:13:39] yeah [13:13:48] !log taavi@deploy1002 Started scap: Backport for [[gerrit:953652|tlywiki: add metanamespace , timezone, sitename (T345316)]], [[gerrit:954050|tlywiki: Add logos (T345316)]] [13:13:51] (03PS2) 10Muehlenhoff: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [13:13:51] T345316: Initial configurations for tlywiki - https://phabricator.wikimedia.org/T345316 [13:15:12] (03CR) 10Majavah: "May I ask why my -2 on this was forcibly removed? It had been waiting for my re-review for a quite short time beforehand (plus pinging me " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [13:15:26] !log taavi@deploy1002 taavi and anzx: Backport for [[gerrit:953652|tlywiki: add metanamespace , timezone, sitename (T345316)]], [[gerrit:954050|tlywiki: Add logos (T345316)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:15:30] testing [13:15:33] aanzx: please test [13:15:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P52257 and previous config saved to /var/cache/conftool/dbconfig/20230905-131540-arnaudb.json [13:15:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [13:16:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2026.codfw.wmnet with reason: host reimage [13:16:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - ayounsi@cumin1001" [13:16:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test1002.eqiad.wmnet with OS bullseye [13:17:55] taavi: tested , looks good [13:18:04] !log taavi@deploy1002 taavi and anzx: Continuing with sync [13:18:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:18:43] Amir1: elukey: my last patch is currently syncing, if you want to deploy something afterwards that's fine by me [13:18:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2026.codfw.wmnet with reason: host reimage [13:20:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:20:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2025.codfw.wmnet with OS bullseye [13:20:35] (03PS3) 10Muehlenhoff: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [13:20:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye completed: - kubernetes2025 (**PASS*... [13:21:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2029.codfw.wmnet with OS bullseye [13:21:16] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:21:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2029.codfw.wmnet with OS bullseye [13:21:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [13:22:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [13:22:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:22:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2027.codfw.wmnet with OS bullseye [13:22:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye completed: - kubernetes2027 (**PASS*... [13:23:02] taavi: can you run namespaceDupes.php on tlywiki after sync [13:23:12] sure [13:23:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [13:23:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:24:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2030.codfw.wmnet with OS bullseye [13:24:07] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:953652|tlywiki: add metanamespace , timezone, sitename (T345316)]], [[gerrit:954050|tlywiki: Add logos (T345316)]] (duration: 10m 18s) [13:24:10] T345316: Initial configurations for tlywiki - https://phabricator.wikimedia.org/T345316 [13:24:22] aanzx: and that's a no-op :) [13:24:29] Amir1: elukey: floor is yours [13:24:39] taavi: thanks [13:24:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2030.codfw.wmnet with OS bullseye [13:25:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:25:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2028.codfw.wmnet with OS bullseye [13:25:16] RECOVERY - Host mw2448 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [13:25:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2028.codfw.wmnet with OS bullseye completed: - kubernetes2028 (**PASS*... [13:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:25:44] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10Jhancock.wm) a:03Jhancock.wm [13:26:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [13:26:21] (03PS1) 10EoghanGaffney: gitlab: Add unlock command to gitlab-backup script [puppet] - 10https://gerrit.wikimedia.org/r/954916 [13:26:48] PROBLEM - puppet last run on mw2448 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:27:01] thanks [13:27:29] awesome [13:28:05] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: enable lift wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [13:28:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) 05Stalled→03In progress key validated via Slack [13:28:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) [13:29:03] (03Merged) 10jenkins-bot: ores-extension: enable lift wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954886 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [13:30:16] (03CR) 10Slyngshede: [V: 03+1] "I'm a little sure about being able to have a option to the script within the configuration file. If that's not allowed we'll add a small w" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [13:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52258 and previous config saved to /var/cache/conftool/dbconfig/20230905-133046-arnaudb.json [13:30:47] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS bullseye [13:30:51] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:31:08] (03PS1) 10Vgutierrez: admin: Add cjming to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/954917 (https://phabricator.wikimedia.org/T345455) [13:31:50] RECOVERY - puppet last run on mw2448 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:32:34] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10Volans) >>! In T268369#9142380, @Vgutierrez wrote: > yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty. From the PoV of... [13:33:40] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:954886|ores-extension: enable lift wing for most wikis (T342115)]] [13:33:43] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [13:33:55] * elukey watches traffic on Lift Wing [13:34:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:35:17] !log ladsgroup@deploy1002 ladsgroup and isaranto: Backport for [[gerrit:954886|ores-extension: enable lift wing for most wikis (T342115)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:35:31] (03PS1) 10Ssingh: site: set proper role for durum300[34] [puppet] - 10https://gerrit.wikimedia.org/r/954918 [13:35:51] (03CR) 10DCausse: Draft: cirrus streaming updater service (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [13:36:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:36:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2026.codfw.wmnet with OS bullseye [13:36:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2026.codfw.wmnet with OS bullseye completed: - kubernetes2026 (**PASS*... [13:37:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2031.codfw.wmnet with OS bullseye [13:37:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2031.codfw.wmnet with OS bullseye [13:39:12] (03PS1) 10Esanders: Turn of DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [13:40:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2029.codfw.wmnet with reason: host reimage [13:41:58] (03CR) 10Ssingh: [C: 03+2] site: set proper role for durum300[34] [puppet] - 10https://gerrit.wikimedia.org/r/954918 (owner: 10Ssingh) [13:42:04] (03PS3) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [13:42:44] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:42:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bookworm [13:43:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum3004.esams.wmnet with OS bookworm [13:43:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2029.codfw.wmnet with reason: host reimage [13:44:35] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10ssingh) > durum-esams The hosts were not provisioned in esams but I am fixing that by provisioning them so the durum ones should go away. Thanks for the task up... [13:45:04] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) Right, and swift does say 304 quite a lot; but that isn't very helpful for thumbs - swift can only say 304 because it's storing all these thu... [13:45:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2032.codfw.wmnet with OS bullseye [13:45:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2032.codfw.wmnet with OS bullseye [13:46:10] !log ladsgroup@deploy1002 ladsgroup and isaranto: Continuing with sync [13:46:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954917 (https://phabricator.wikimedia.org/T345455) (owner: 10Vgutierrez) [13:47:31] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: eno1 has interface errors - https://phabricator.wikimedia.org/T345430 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [13:48:13] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:48:26] 10SRE, 10Cumin, 10Infrastructure-Foundations, 10observability: how to deal with cumin alias alerts - https://phabricator.wikimedia.org/T268369 (10MoritzMuehlenhoff) > From the PoV of Clinic Duty I think that the action should be to ping the host/alias owner and ask them to fix it unless is super trivial th... [13:49:28] (03PS5) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [13:49:34] (03PS4) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [13:50:01] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:50:11] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:50:33] (03CR) 10FNegri: [C: 03+2] [toolsdb] Enable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/954722 (https://phabricator.wikimedia.org/T345450) (owner: 10FNegri) [13:51:13] (03CR) 10Muehlenhoff: [C: 03+1] Grant analytics-admins rights to run some git cmds as analytics-deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [13:52:14] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:954886|ores-extension: enable lift wing for most wikis (T342115)]] (duration: 18m 33s) [13:52:19] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [13:52:35] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/954707 (owner: 10Muehlenhoff) [13:53:15] (03Abandoned) 10David Caro: cloudlb: move to wmcs prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948104 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [13:53:16] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T345633 (10brouberol) [13:53:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove visualdiff client/server from testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/954682 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [13:53:48] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:54:57] (03PS5) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [13:54:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testreduce1002.eqiad.wmnet with OS bookworm [13:55:48] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:57:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:59:33] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Jclark-ctr) Server is out of warranty @MoritzMuehlenhoff I have some out of recently decom server i can replace it with. can this be done at any time? DIMM B 6 BankLabel: B CurrentOpe... [14:00:05] kamila_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) DC switchover live test deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1400). [14:01:14] (03CR) 10Btullis: [C: 03+2] Grant analytics-admins rights to run some git cmds as analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:01:14] !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Switchover Live test - T345588 [14:01:19] T345588: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 [14:01:20] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Ladsgroup) Yeah, let me know a bit beforehand to shut down the server just in case. [14:01:29] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover Live test - T345588 started. [14:02:10] (03CR) 10Vgutierrez: [C: 03+2] admin: Add cjming to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/954917 (https://phabricator.wikimedia.org/T345455) (owner: 10Vgutierrez) [14:02:50] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10Jhancock.wm) @MoritzMuehlenhoff I got into the idrac gui and there's nothing of interest to report. There's no BIOS or CPU errors this time. I do see something in the system lifecycle logs about seve... [14:02:54] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) @Eevans we do not have that optic at our site in eqiad and never have wave2wave is a newer distributer we are using for optics and cables. it might be a m... [14:02:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:03:53] (03CR) 10Btullis: [C: 03+2] Update the maximum message size in kafka for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/954690 (https://phabricator.wikimedia.org/T344688) (owner: 10Btullis) [14:04:38] (03Merged) 10jenkins-bot: Update the maximum message size in kafka for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/954690 (https://phabricator.wikimedia.org/T344688) (owner: 10Btullis) [14:04:42] (03PS1) 10Ayounsi: Add mock TLS key for ganeti-test01.svc.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/954924 (https://phabricator.wikimedia.org/T345602) [14:04:54] (03CR) 10JMeybohm: mesh: add tracing support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:05:13] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add mock TLS key for ganeti-test01.svc.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/954924 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [14:05:42] (03PS6) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [14:05:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) 05In progress→03Resolved change should be effective in ~30 minutes after puppet runs on the impacted hosts. [14:06:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:06:17] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [14:06:49] PROBLEM - mediawiki-installation DSH group on mw2448 is CRITICAL: Host mw2448 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:06:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage [14:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:44] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Jclark-ctr) @Eevans disposed of old optic and replaced cable can you verify if error is still present? Previous eqiad onsite staff did not always dispose of defect... [14:09:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage [14:10:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [14:10:39] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345480 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm related to T344110. idrac is back up [14:10:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [14:11:44] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T345592 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. [14:12:25] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Jclark-ctr) @Ladsgroup i am available now if you would like to shut down server [14:13:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:13:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [14:13:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:13:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2029.codfw.wmnet with OS bullseye [14:13:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2029.codfw.wmnet with OS bullseye completed: - kubernetes2029 (**PASS*... [14:13:57] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Ladsgroup) sure, give me a bit. [14:14:59] (03PS11) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [14:15:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Aklapper) [14:15:22] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:15:36] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [14:15:39] (03CR) 10CI reject: [V: 04-1] mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:16:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [14:16:15] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [14:16:25] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Ladsgroup) it should be shut down now. [14:16:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [14:17:01] (03CR) 10Filippo Giunchedi: mesh: add tracing support (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:17:30] (03CR) 10Herron: "nice idea! I'm thinking it may be a bit more intuitive to follow this approach but do essentially the opposite -- accept spaces in the sl" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [14:17:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2030.codfw.wmnet with OS bullseye [14:17:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2030.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [14:18:05] (03PS1) 10Ayounsi: Add ganeti-test01.svc.eqiad.wmnet public cert [puppet] - 10https://gerrit.wikimedia.org/r/954946 (https://phabricator.wikimedia.org/T345602) [14:18:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [14:18:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:21:18] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:21:30] (03CR) 10Muehlenhoff: [C: 03+1] Add ganeti-test01.svc.eqiad.wmnet public cert [puppet] - 10https://gerrit.wikimedia.org/r/954946 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [14:21:38] (03CR) 10Ayounsi: [C: 03+2] Add ganeti-test01.svc.eqiad.wmnet public cert [puppet] - 10https://gerrit.wikimedia.org/r/954946 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [14:21:44] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:22:01] (03PS1) 10Alexandros Kosiaris: PHPFPMTooBusy: Point to public available runbook [alerts] - 10https://gerrit.wikimedia.org/r/954947 [14:22:52] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10MoritzMuehlenhoff) Thanks, let's retry with latest firmware, if anything happens again we can still open a case with Dell. [14:23:10] (03PS7) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [14:23:17] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:23:39] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:23:43] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [14:23:46] (03CR) 10CI reject: [V: 04-1] PHPFPMTooBusy: Point to public available runbook [alerts] - 10https://gerrit.wikimedia.org/r/954947 (owner: 10Alexandros Kosiaris) [14:24:11] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:24:30] (03PS12) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [14:24:36] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Jclark-ctr) Dimm Replaced @Ladsgroup it is booting up now Thanks for your assistance [14:24:41] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:25:10] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Ladsgroup) Thanks! [14:25:30] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10Jclark-ctr) a:05Papaul→03Jclark-ctr [14:25:38] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:25:47] !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Switchover Live test - T345588 [14:25:50] T345588: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 [14:25:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2031.codfw.wmnet with OS bullseye [14:25:53] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover Live test - T345588 completed. [14:26:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2031.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [14:26:04] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:26:08] !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: Datacenter Switchover Live test - T345588 [14:26:14] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [14:26:15] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Datacenter Switchover Live test - T345588 started. [14:28:45] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqiad ganeti-test vip - ayounsi@cumin1001" [14:29:01] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:29:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10Jclark-ctr) [14:29:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqiad ganeti-test vip - ayounsi@cumin1001" [14:29:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:13] (03PS8) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [14:30:42] (03CR) 10Herron: "LGTM overall, although I'll hold off on +1 while the patch below this one is sorted" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [14:30:46] PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:48] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [14:31:08] (03PS1) 10Ayounsi: ganeti-test100[12]: assign ganeti-test role [puppet] - 10https://gerrit.wikimedia.org/r/954950 (https://phabricator.wikimedia.org/T345602) [14:31:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testreduce1002.eqiad.wmnet with OS bookworm [14:31:21] (03Abandoned) 10Muehlenhoff: Reinstate the fix to filter duplicate Netty War blobs [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/954885 (owner: 10Muehlenhoff) [14:32:52] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [14:33:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2032.codfw.wmnet with OS bullseye [14:33:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2032.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [14:34:06] (03PS4) 10Muehlenhoff: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:34:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10Jclark-ctr) 05Open→03Resolved [14:35:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954950 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [14:36:04] RECOVERY - MariaDB read only x1 on db1137 is OK: Version 10.6.12-MariaDB-log, Uptime 35s, read_only: True, event_scheduler: True, 13.34 QPS, connection latency: 0.007972s, query latency: 0.015232s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:36:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:36:13] RECOVERY - mysqld processes #page on db1137 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:36:22] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) ` 2023-08-27 00:34:10 Disk 3 in Backplane 1 of Storage Controller in SL 3 is removed.` I can assure you that it wasn't physically removed. Local time, that was a Sa... [14:37:25] RECOVERY - MariaDB Replica IO: x1 #page on db1137 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:11] RECOVERY - MariaDB Replica SQL: x1 #page on db1137 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:20] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10Jhancock.wm) 05Open→03Resolved [14:38:44] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) a:03Jhancock.wm [14:39:28] (03CR) 10Ayounsi: [C: 03+2] ganeti-test100[12]: assign ganeti-test role [puppet] - 10https://gerrit.wikimedia.org/r/954950 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [14:40:22] PROBLEM - Check systemd state on elastic1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:24] The live-test for the datacenter switchover will be running past the deployment window we'd set for it. There's nothing scheduled deployment-wise until 1800UTC, but if you have something to deploy outside of it, please tell us and we'll delay a bit [14:40:36] PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:46] RECOVERY - Check systemd state on elastic1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:58] RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:09] (03PS1) 10Ayounsi: Revert "ganeti-test100[12]: assign ganeti-test role" [puppet] - 10https://gerrit.wikimedia.org/r/954940 [14:43:47] (03CR) 10Ayounsi: [C: 03+2] Revert "ganeti-test100[12]: assign ganeti-test role" [puppet] - 10https://gerrit.wikimedia.org/r/954940 (owner: 10Ayounsi) [14:48:01] (03PS1) 10Ssingh: hiera/acme-chief: add durum300[34] to authorized_hosts [puppet] - 10https://gerrit.wikimedia.org/r/954954 [14:48:33] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) [14:48:55] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:49:03] (03CR) 10Ssingh: [C: 03+2] hiera/acme-chief: add durum300[34] to authorized_hosts [puppet] - 10https://gerrit.wikimedia.org/r/954954 (owner: 10Ssingh) [14:49:13] (SystemdUnitFailed) firing: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:09] !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: Datacenter Switchover Live test - T345588 [14:50:13] T345588: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 [14:50:15] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Datacenter Switchover Live test - T345588 completed. [14:50:43] (03CR) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [14:51:33] (03Abandoned) 10Btullis: Add a dummy datahub_encryption_key value [labs/private] - 10https://gerrit.wikimedia.org/r/777752 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:53:25] (03CR) 10Ladsgroup: [C: 03+1] mariadb::packages_client: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954594 (owner: 10Muehlenhoff) [14:54:12] (SystemdUnitFailed) resolved: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:08] PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:18] PROBLEM - Check systemd state on elastic1090 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:44] RECOVERY - Check systemd state on elastic1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:44] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3004.esams.wmnet with OS bookworm [14:59:08] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:12] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:22] (03PS1) 10Bking: rdf-streaming-updater: Resolve contradictory configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954957 (https://phabricator.wikimedia.org/T344614) [14:59:46] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez) I don't think so as it's still using role `insetup::search_platform` but @bking and @RKemper should have more context about it [15:00:31] (03CR) 10Muehlenhoff: [C: 03+2] mariadb::packages_client: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954594 (owner: 10Muehlenhoff) [15:00:39] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: Resolve contradictory configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954957 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:00:41] (03PS2) 10Muehlenhoff: mariadb::packages_client: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954594 [15:01:09] (03PS2) 10Bking: rdf-streaming-updater: Resolve contradictory configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954957 (https://phabricator.wikimedia.org/T344614) [15:01:13] (03PS5) 10Arturo Borrero Gonzalez: cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) [15:01:47] (03PS1) 10Andrew Bogott: cloudservices1006: include all three (for now) pdns auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/954959 [15:03:16] PROBLEM - Check systemd state on elastic1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:54] (03CR) 10Elukey: slo_template: allow spaces in dashboard names (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [15:04:09] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1006: include all three (for now) pdns auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/954959 (owner: 10Andrew Bogott) [15:04:12] (SystemdUnitFailed) firing: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:34] !log kamila@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover Live Test - T345588 [15:04:40] RECOVERY - Check systemd state on elastic1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:42] T345588: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 [15:05:34] (03PS1) 10Papaul: Fix typo on kubernetes203[0-9] missing s at the end [puppet] - 10https://gerrit.wikimedia.org/r/954960 (https://phabricator.wikimedia.org/T342534) [15:06:08] (03PS9) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:06:38] (03CR) 10Papaul: [C: 03+2] Fix typo on kubernetes203[0-9] missing s at the end [puppet] - 10https://gerrit.wikimedia.org/r/954960 (https://phabricator.wikimedia.org/T342534) (owner: 10Papaul) [15:06:46] RECOVERY - Host mc2040 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:06:46] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:07:07] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) @jijiki is there a day this week that I can update the firmware on this server? [15:09:13] (SystemdUnitFailed) firing: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:21] (03PS3) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 [15:10:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [15:11:24] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:50] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:51] (03CR) 10CI reject: [V: 04-1] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [15:13:06] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [15:13:08] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [15:13:16] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [15:13:21] Live test for mediawiki actually starting ^ [15:13:25] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [15:13:29] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [15:14:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:09] (03PS4) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 [15:15:41] (03PS2) 10Elukey: slo_template: allow spaces in dashboard names [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 [15:15:43] (03PS18) 10Elukey: Lift Wing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:17:54] (03CR) 10CI reject: [V: 04-1] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [15:18:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2030.codfw.wmnet with OS bullseye [15:18:17] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2030.codfw.wmnet with OS bullseye [15:18:59] 10SRE, 10ops-codfw, 10serviceops: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 (10MoritzMuehlenhoff) p:05Triage→03Medium a:05Jhancock.wm→03Clement_Goubert [15:19:09] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [15:19:09] (03CR) 10Elukey: "New version ready for a review!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [15:19:20] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [15:19:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2031.codfw.wmnet with OS bullseye [15:19:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2031.codfw.wmnet with OS bullseye [15:19:35] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [15:19:50] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [15:19:50] !log kamila@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2023-09-05 15:19:50.101327 [15:20:10] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [15:20:31] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [15:20:46] PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:47] (03PS10) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:20:56] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) netbox cable id update for ssw1-a8 to lsw1-a1 and lsw-a8 [15:21:02] !log kamila@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=99) [15:21:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2032.codfw.wmnet with OS bullseye [15:21:32] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:21:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2032.codfw.wmnet with OS bullseye [15:22:12] RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:22] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:23] (03PS5) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 [15:24:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:26] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [15:24:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:30] (03CR) 10Herron: [C: 03+1] "LGTM thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [15:24:39] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [15:24:48] (03CR) 10Gmodena: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [15:24:52] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [15:24:54] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [15:25:06] (03CR) 10Herron: [C: 03+1] Lift Wing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:25:08] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [15:25:16] !log kamila@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2023-09-05 15:25:15.979250 [15:25:16] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [15:25:21] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [15:25:23] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [15:25:28] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [15:27:31] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: allow spaces in dashboard names [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954892 (owner: 10Elukey) [15:27:40] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [15:27:50] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [15:28:22] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [15:28:28] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [15:29:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:18] PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:24] (03CR) 10Vgutierrez: mtail: Record bad requests for varnish SLI metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [15:30:15] (03CR) 10Elukey: [V: 03+2 C: 03+2] Lift Wing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:30:42] RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:56] (03PS11) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:31:31] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:33:57] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Resolve contradictory configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954957 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:34:10] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [15:34:12] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:41] (03Merged) 10jenkins-bot: rdf-streaming-updater: Resolve contradictory configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954957 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:35:19] !log kamila@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover Live Test - T345588 (duration: 30m 45s) [15:35:22] T345588: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 [15:35:40] (03CR) 10Muehlenhoff: "This removes the ensure handling for nftables. Bullseye and Bookworm install the nftables package as part of the d-i bootstrapping and the" [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [15:36:01] !log Datacenter switchover live test completed (T345588) [15:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:11] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10jhathaway) >>! In T331699#9136475, @MoritzMuehlenhoff wrote: > One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely: I like this o... [15:36:12] \o/ [15:37:01] gg kamila_ [15:37:42] +1 [15:37:43] gg y'all for writing cookbooks that actually work :D [15:37:50] <3 [15:37:50] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:30] PROBLEM - Check systemd state on elastic1084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:14] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:16] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:39:54] RECOVERY - Check systemd state on elastic1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2031.codfw.wmnet with reason: host reimage [15:42:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2032.codfw.wmnet with reason: host reimage [15:44:13] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2031.codfw.wmnet with reason: host reimage [15:45:11] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: eno1 has interface errors - https://phabricator.wikimedia.org/T345430 (10aborrero) Thanks! [15:45:37] !log Repooling mw2448.eqiad.wmnet [15:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] (03PS12) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:47:14] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:47:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2032.codfw.wmnet with reason: host reimage [15:47:28] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:34] (03PS1) 10Elukey: slo_definitions: shorten the name of Revert Risk dashboard name/ids [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954964 [15:48:16] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:24] (03CR) 10Herron: [C: 03+1] slo_definitions: shorten the name of Revert Risk dashboard name/ids [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954964 (owner: 10Elukey) [15:48:28] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:16] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:19] (03CR) 10Volans: "The approach looks sane, some minor comments inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [15:49:39] !log Repooled mw2448.eqiad.wmnet - T345597 [15:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:41] T345597: mw2448 is unreachable again - https://phabricator.wikimedia.org/T345597 [15:52:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: shorten the name of Revert Risk dashboard name/ids [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/954964 (owner: 10Elukey) [15:53:03] (03PS13) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:53:19] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [15:53:36] (03CR) 10CI reject: [V: 04-1] [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:54:13] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:41] (03PS14) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) [15:56:18] PROBLEM - Check systemd state on elastic1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:22] PROBLEM - Check systemd state on elastic1086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:20] RECOVERY - Check systemd state on elastic1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2033.codfw.wmnet with OS bullseye [15:57:24] RECOVERY - Check systemd state on elastic1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2033.codfw.wmnet with OS bullseye [15:58:13] (03PS1) 10Ssingh: asw1-b*27-esams: add durum300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/954965 (https://phabricator.wikimedia.org/T329219) [15:58:16] RECOVERY - mediawiki-installation DSH group on mw2448 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:59:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:00:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:00:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:01:05] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10kamila) 05Open→03Resolved [16:03:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:03:36] PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:50] RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:06:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2032.codfw.wmnet with OS bullseye [16:07:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:07:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2031.codfw.wmnet with OS bullseye [16:07:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2032.codfw.wmnet with OS bullseye completed: - kubernetes2032 (**PASS*... [16:07:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2031.codfw.wmnet with OS bullseye completed: - kubernetes2031 (**WARN*... [16:07:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [16:07:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2030.codfw.wmnet with reason: host reimage [16:08:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2034.codfw.wmnet with OS bullseye [16:08:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2034.codfw.wmnet with OS bullseye [16:08:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2035.codfw.wmnet with OS bullseye [16:08:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2035.codfw.wmnet with OS bullseye [16:09:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2030.codfw.wmnet with reason: host reimage [16:13:44] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:46] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:13] (SystemdUnitFailed) firing: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:14] (03CR) 10Andrew Bogott: [openstack] upgrade codfw1dev to Antelope (2023.1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [16:14:48] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:50] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [16:18:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2033.codfw.wmnet with reason: host reimage [16:19:12] (SystemdUnitFailed) firing: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2033.codfw.wmnet with reason: host reimage [16:23:06] PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:06] RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:13] (SystemdUnitFailed) firing: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:23] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [16:25:27] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [16:27:09] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:28:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:28:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2030.codfw.wmnet with OS bullseye [16:28:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2030.codfw.wmnet with OS bullseye completed: - kubernetes2030 (**PASS*... [16:29:13] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2036.codfw.wmnet with OS bullseye [16:29:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye [16:30:24] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [16:30:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2034.codfw.wmnet with reason: host reimage [16:30:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [16:30:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2035.codfw.wmnet with reason: host reimage [16:31:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [16:31:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [16:32:38] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [16:33:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2034.codfw.wmnet with reason: host reimage [16:34:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2035.codfw.wmnet with reason: host reimage [16:37:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:38:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) 05Open→03Stalled the provided ssh key is already used in WMCS, please provide a new one: ` vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --user... [16:38:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) [16:38:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:38:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2033.codfw.wmnet with OS bullseye [16:38:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2033.codfw.wmnet with OS bullseye completed: - kubernetes2033 (**PASS*... [16:39:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) a:03Vgutierrez [16:39:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye [16:39:12] (SystemdUnitFailed) firing: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye [16:41:54] PROBLEM - Check systemd state on elastic1076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:06] RECOVERY - Check systemd state on elastic1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:16] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [16:43:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [16:44:12] (SystemdUnitFailed) firing: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P52259 and previous config saved to /var/cache/conftool/dbconfig/20230905-164618-ladsgroup.json [16:46:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [16:46:39] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Ladsgroup) Repooling. [16:47:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh1002.wikimedia.org with OS bookworm [16:47:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh1002.wikimedia.org with OS bookworm [16:48:02] (03PS1) 10AikoChou: changeprop: allow retries for liftwing streams with 500 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/954969 [16:49:13] (03CR) 10Krinkle: "I've updated https://gerrit.wikimedia.org/r/939790 to avoid needing the rewrite rule here. Does that look right?" [puppet] - 10https://gerrit.wikimedia.org/r/939285 (https://phabricator.wikimedia.org/T339137) (owner: 10Cwhite) [16:49:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [16:49:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [16:50:45] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:50:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:51:41] PROBLEM - Check systemd state on elastic1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:52:41] RECOVERY - Check systemd state on elastic1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:12] (03PS1) 10Andrew Bogott: cloudservices: add the new -next resolver to config for designate hosts [puppet] - 10https://gerrit.wikimedia.org/r/954970 [16:54:13] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:35] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mc2042.codfw.wmnet [16:55:02] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: add the new -next resolver to config for designate hosts [puppet] - 10https://gerrit.wikimedia.org/r/954970 (owner: 10Andrew Bogott) [16:59:09] PROBLEM - Check systemd state on elastic1093 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:53] PROBLEM - Check systemd state on elastic1094 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1700) [17:00:23] RECOVERY - Check systemd state on elastic1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:53] RECOVERY - Check systemd state on elastic1094 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P52260 and previous config saved to /var/cache/conftool/dbconfig/20230905-170122-ladsgroup.json [17:02:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage [17:02:12] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh1002.wikimedia.org with reason: host reimage [17:02:36] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10RobH) [17:02:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:02:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2034.codfw.wmnet with OS bullseye [17:02:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2034.codfw.wmnet with OS bullseye completed: - kubernetes2034 (**PASS*... [17:03:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:03:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2035.codfw.wmnet with OS bullseye [17:03:09] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2035.codfw.wmnet with OS bullseye completed: - kubernetes2035 (**WARN*... [17:03:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [17:04:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [17:04:10] (03CR) 10Andrew Bogott: [C: 03+2] "arturo, fyi: this fixed several things" [puppet] - 10https://gerrit.wikimedia.org/r/954970 (owner: 10Andrew Bogott) [17:04:13] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [17:04:15] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye [17:04:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [17:04:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [17:04:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [17:04:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye [17:05:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [17:05:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage [17:07:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:07:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh1002.wikimedia.org with reason: host reimage [17:07:40] PROBLEM - Check systemd state on elastic1100 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:44] PROBLEM - Check systemd state on elastic1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [17:08:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [17:08:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:08:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2036.codfw.wmnet with OS bullseye [17:08:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye completed: - kubernetes2036 (**PASS*... [17:08:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye [17:08:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye [17:08:42] RECOVERY - Check systemd state on elastic1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:46] RECOVERY - Check systemd state on elastic1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:58] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10Ladsgroup) Okay, I might not know what's going on so my apologies if I'm misunderstanding something. The front of CDN is not the target here, we were told... [17:09:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [17:09:13] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:11:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:11:24] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:11:42] (03CR) 10Jbond: "thanks, answering questions" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [17:12:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:12:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:14:13] (SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P52262 and previous config saved to /var/cache/conftool/dbconfig/20230905-171627-ladsgroup.json [17:16:59] 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Papaul) [17:17:59] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply security updates - bking@cumin1001 - T344587 [17:18:00] 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Papaul) p:05Triage→03Medium [17:18:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) [17:18:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:12] (SystemdUnitFailed) firing: (19) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:21:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2037.codfw.wmnet with OS bullseye [17:21:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye completed: - kubernetes2037 (**PASS*... [17:21:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:30:35] (03CR) 10Peter Fischer: Draft: cirrus streaming updater service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [17:31:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2038.codfw.wmnet with reason: host reimage [17:31:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2039.codfw.wmnet with reason: host reimage [17:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P52263 and previous config saved to /var/cache/conftool/dbconfig/20230905-173132-ladsgroup.json [17:34:03] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [17:34:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2038.codfw.wmnet with reason: host reimage [17:36:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2039.codfw.wmnet with reason: host reimage [17:40:47] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh1002.wikimedia.org with OS bookworm [17:44:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh1002.wikimedia.org with OS bookworm completed: - doh1002 (**PASS**) - Downtimed on Icinga/Al... [17:47:15] !log dcausse@deploy1002 Started deploy [airflow-dags/search@b3d43bb]: T345545: search: generalize image_suggestions_manual [17:47:18] T345545: Search indices image suggestion tags differ from the dataset used to update - https://phabricator.wikimedia.org/T345545 [17:47:42] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@b3d43bb]: T345545: search: generalize image_suggestions_manual (duration: 00m 26s) [17:47:43] (03PS2) 10Hnowlan: rest-gateway: route requests to geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/954888 (https://phabricator.wikimedia.org/T336400) [17:48:09] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) Not directly related to bookworm but observed on the dnsdist 1.8.0 upgrade (part of bookworm) that results in a broken `latency_bucket` metric for the Wikimedia DNS hosts. Reported u... [17:50:31] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:52:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:52:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:55:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:55:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2038.codfw.wmnet with OS bullseye [17:55:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye completed: - kubernetes2038 (**PASS*... [17:56:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:56:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2039.codfw.wmnet with OS bullseye [17:57:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye completed: - kubernetes2039 (**WARN*... [17:57:25] !log T345545: triggered a manual dag run to import analytics_platform_eng.image_suggestions_search_index_full/snapshot=2023-08-21 [17:57:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) [17:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:27] T345545: Search indices image suggestion tags differ from the dataset used to update - https://phabricator.wikimedia.org/T345545 [17:57:57] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10BBlack) This topic probably deserves a ~hour meeting w/ Traffic to hash out some of the potential solutions and tradeoffs, but I'm gonna try to bullet-point... [17:58:54] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) [18:00:05] brennen and jeena: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T1800). [18:02:51] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:03:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2029.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:11:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) This is all done but i have to fix DNS mgmt issue with 2029 and 2030 during the provision of those servers the serial number were swapped. Whne you login to 2029 mgmt... [18:11:36] (03CR) 10Ssingh: [C: 03+1] "Verified the subnets (and the descriptions in the comments)." [dns] - 10https://gerrit.wikimedia.org/r/954893 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [18:17:48] (03CR) 10Cathal Mooney: [C: 03+2] Add includes for Netbox generated dns for new per-rack codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/954893 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [18:18:12] !log Running authdns-update to add includes for newly assigned codfw subnets [18:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:26] (03CR) 10Acamicamacaraca: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [18:23:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2029.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:28:39] (03PS1) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [18:29:37] (03CR) 10CI reject: [V: 04-1] Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [18:31:28] 10SRE, 10Traffic, 10Epic: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) https://meta.wikimedia.org/wiki/Wikimedia_DNS is a detailed introduction of the project, including an FAQ. [18:33:54] (03PS2) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [18:36:16] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:36:22] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:38:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:53] (03PS1) 10Majavah: P:cyberbot: add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/954984 [18:39:28] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:39:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:41:28] (03CR) 10Majavah: [C: 03+2] P:cyberbot: add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/954984 (owner: 10Majavah) [18:43:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:38] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh1001.wikimedia.org with OS bookworm [18:52:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh1001.wikimedia.org with OS bookworm [18:56:51] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:59:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:59:49] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:00:02] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:01:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [19:05:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [19:08:57] (03PS1) 10Bking: rdf-streaming-updater: reduce job managers from 3 to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954985 (https://phabricator.wikimedia.org/T344614) [19:10:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:10:23] (03PS1) 10Jdlrobson: Fix unseen notifications icon [skins/MinervaNeue] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954942 (https://phabricator.wikimedia.org/T345483) [19:13:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:15:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:15:44] (03CR) 10Ryan Kemper: [C: 03+1] rdf-streaming-updater: reduce job managers from 3 to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954985 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [19:15:55] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: reduce job managers from 3 to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954985 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [19:16:41] (03Merged) 10jenkins-bot: rdf-streaming-updater: reduce job managers from 3 to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954985 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [19:16:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:10] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:18:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:16] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:19:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.485 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:37] (03PS1) 10Bartosz Dziewoński: Fix temp user popup appearing on every new page creation [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954943 (https://phabricator.wikimedia.org/T345569) [19:25:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:26:02] (03CR) 10CI reject: [V: 04-1] Fix unseen notifications icon [skins/MinervaNeue] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954942 (https://phabricator.wikimedia.org/T345483) (owner: 10Jdlrobson) [19:27:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:28:58] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:29:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:32:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh1001.wikimedia.org with OS bookworm [19:32:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh1001.wikimedia.org with OS bookworm completed: - doh1001 (**PASS**) - Downtimed on Icinga/Al... [19:40:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:42:42] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:43:14] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/d/d7/Elizabeth_Sombart,_February,_2023.jpg" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T331800 (10mdaniels5757) 05Open→03Resolved... [19:50:12] PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100% [19:58:55] (03PS2) 10DDesouza: Deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954720 (https://phabricator.wikimedia.org/T345158) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230905T2000). [20:00:04] danisztls, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hiii [20:00:50] o/ [20:01:44] (03PS2) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) [20:01:53] present [20:05:18] hi - i can deploy - pardon lateness [20:05:26] danisztls: i'll start with yours [20:06:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954720 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:06:28] cjming: :) [20:06:41] (03Merged) 10jenkins-bot: Deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954720 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:07:11] !log cjming@deploy1002 Started scap: Backport for [[gerrit:954720|Deploy Campaigns Event Discovery survey (T345158)]] [20:07:15] hi Jdlrobson -- your patch didn't pass CI -- does it just need a rebase? [20:07:16] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:08:57] !log cjming@deploy1002 cjming and dani: Backport for [[gerrit:954720|Deploy Campaigns Event Discovery survey (T345158)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:09:16] danisztls: can you test? [20:09:31] (03CR) 10Ssingh: [C: 03+1] Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [20:09:41] !log fab@deploy1002 Started deploy [airflow-dags/research@90f280e]: (no justification provided) [20:09:59] !log fab@deploy1002 Finished deploy [airflow-dags/research@90f280e]: (no justification provided) (duration: 00m 17s) [20:10:09] cjming: sure [20:10:15] cjming: you might need to force merge it. Is that possible? [20:10:29] there's a CI failure on master that's unrelated [20:10:49] Seems like an issue in Wikibase [20:11:02] Jdlrobson: i'll give it a whirl [20:11:07] cjming: looks good [20:11:07] 10SRE-swift-storage, 10Commons, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10AntiCompositeNumber) F37653248 was not [[ https://www.mediawiki.org/wiki/Phabricator/Help#File_visibility | attached ]] to this tick... [20:11:12] I can try and find a way to work around it [20:11:26] danisztls: syncing [20:11:30] !log cjming@deploy1002 cjming and dani: Continuing with sync [20:13:25] Jdlrobson: is there a task for the CI fail? [20:13:34] RhinosF1: https://phabricator.wikimedia.org/T345660 [20:14:06] (03CR) 10Clare Ming: [C: 03+2] Fix temp user popup appearing on every new page creation [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954943 (https://phabricator.wikimedia.org/T345569) (owner: 10Bartosz Dziewoński) [20:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:17:39] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:954720|Deploy Campaigns Event Discovery survey (T345158)]] (duration: 10m 27s) [20:17:42] (03CR) 10Cwhite: [C: 03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954675 (https://phabricator.wikimedia.org/T344952) (owner: 10Filippo Giunchedi) [20:17:45] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:17:51] danisztls: live! [20:18:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954942 (https://phabricator.wikimedia.org/T345483) (owner: 10Jdlrobson) [20:19:24] cjming: thanks! [20:19:29] (03Merged) 10jenkins-bot: Fix temp user popup appearing on every new page creation [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954943 (https://phabricator.wikimedia.org/T345569) (owner: 10Bartosz Dziewoński) [20:19:37] Jdlrobson: scap isn't letting me merge -- i can't force merge from gerrit UI - i can try to do it from cmd line but might need to poke someone about it [20:20:29] !log cjming@deploy1002 Started scap: Backport for [[gerrit:954943|Fix temp user popup appearing on every new page creation (T345569)]] [20:20:31] T345569: Notification that a temporary account has been created appears when already logged-in and creating a page with DiscussionTools - https://phabricator.wikimedia.org/T345569 [20:20:32] in the meantime, i'll continue with MatmaRex's patch [20:21:05] Jdlrobson: i found the cause of CI failure [20:21:12] i think [20:21:14] cjming: Jdlrobson: i wonder if you could try to backport a revert of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/946922 to fix that problem? (i'm just guessing, haven't tested) [20:21:15] \o/ [20:21:30] MatmaRex RhinosF1: <3 [20:21:43] there were a lot of recent changes on gerrit mentioning "@wdio/sync" (which is the bad package), one of them is probably the cause, but i don't know enough about this [20:21:45] MatmaRex: i've got it on master too [20:21:51] oh, maybe RhinosF1 knows more [20:21:55] so i think it's an existing issue [20:22:01] !log cjming@deploy1002 cjming and matmarex: Backport for [[gerrit:954943|Fix temp user popup appearing on every new page creation (T345569)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:22:02] MatmaRex: https://gerrit.wikimedia.org/r/c/integration/config/+/954332/ was merged today [20:22:15] MatmaRex: wanna test? [20:22:26] like 90 minutes ago, i was looking at recent changes [20:22:33] it's a guess [20:22:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:38] is James_F here? [20:22:54] I don't know what to do here as this bug is pretty urgent to fix in production [20:22:54] cjming: yeah, looking [20:23:09] could we revert the config patch temporarily ? [20:23:21] or disable selenium for Minerva? [20:23:25] 10SRE-swift-storage, 10Commons, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10AntiCompositeNumber) 05Open→03Resolved I restored the image by copying from the duplicate. I do not think it is necessary to att... [20:23:28] Jdlrobson: no idea, we can look for another relenger [20:23:59] Can look in 10. [20:24:05] you could probably comment out things in package.json somewhere [20:24:07] or he can show now [20:25:15] cjming: are you happy to wait for james in 10 minutes? [20:25:38] RhinosF1: np - i can wait [20:25:50] Merge and deploy it. [20:25:50] cjming: my change looks good on testwiki. sorry for the delay [20:26:02] MatmaRex: np - syncing [20:26:04] !log cjming@deploy1002 cjming and matmarex: Continuing with sync [20:27:24] James_F: sorry - merge and deploy which? Jon's patch? [20:29:41] Yes, CI is always advisory. [20:29:49] Am now back from dinner. [20:30:01] James_F: can i please bash that [20:30:08] RhinosF1: Go for it. [20:30:13] What's broken this time? [20:30:40] James_F: https://phabricator.wikimedia.org/T345660 [20:31:14] https://bash.toolforge.org/quip/8akKZ4oBxE1_1c7shyei [20:31:18] Hmm, did this happen today or previously? The switch to node16 was a few weeks ago. [20:31:36] James_F: today [20:31:40] (03CR) 10Clare Ming: [V: 03+2] Fix unseen notifications icon [skins/MinervaNeue] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954942 (https://phabricator.wikimedia.org/T345483) (owner: 10Jdlrobson) [20:32:02] Specifically it was https://gerrit.wikimedia.org/r/c/integration/config/+/946960 [20:32:03] or at least last 3, it's a master CI fail [20:32:06] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:954943|Fix temp user popup appearing on every new page creation (T345569)]] (duration: 11m 37s) [20:32:09] T345569: Notification that a temporary account has been created appears when already logged-in and creating a page with DiscussionTools - https://phabricator.wikimedia.org/T345569 [20:32:23] MatmaRex: your change should be live [20:32:28] thanks cjming! [20:32:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:38] Right, so it's breaking fallout from the npm 7->8 change? How odd, that shouldn't have broken anything. [20:33:02] !log cjming@deploy1002 Started scap: Backport for [[gerrit:954942|Fix unseen notifications icon (T345483)]] [20:33:04] T345483: Mobile notifications are nearly invisible - https://phabricator.wikimedia.org/T345483 [20:33:10] James_F: https://github.com/wikimedia/mediawiki-skins-MinervaNeue/commit/286920543a854310a2c81fc95b664e5b3957d788 was last change. 3 hours ago [20:33:15] yes i meant last 3 hours [20:33:54] maybe the next change is the real problem? https://gerrit.wikimedia.org/r/c/integration/config/+/954334 says "Drop node14 images, unused" [20:34:38] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:954942|Fix unseen notifications icon (T345483)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:34:40] Jdlrobson: looks like it went thru - can you test? [20:37:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:39:09] MatmaRex: No. [20:40:05] MatmaRex: That's a repo-only change. The problem (I think) is that Wikibase's code doesn't work on CI's versions of things, but CI was lying for the past four weeks and not actually running the code as written because of their package-lock.json file, or somesuch. [20:40:17] Jdlrobson: shall i sync? [20:40:31] yeah, sorry, i think i suggested some bad guesses [20:40:35] wmf-quibble jobs are now back to node16+npm7 so the previous behaviour should be restored for repos that aren't Wikibase themselves. [20:40:51] i'm now comparing the failing run: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/56860/consoleFull to a recent successful run: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/56817/consoleFull [20:41:21] and the EBADENGINE for @wdio/sync was already there, the new failure is the conflicting peer dependencies [20:42:06] (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Move codfw hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/954134 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:42:09] looking [20:43:16] cjming: confirmed as fixed [20:43:21] woohoo - syncing [20:43:22] can I also backport it to wmf15? [20:43:40] *wmf25 [20:43:53] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:43:58] (03PS1) 10Jdlrobson: Fix unseen notifications icon [skins/MinervaNeue] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954945 (https://phabricator.wikimedia.org/T345483) [20:44:13] Wait, how is wdio/sync in Wikibase at all? [20:44:39] Jdlrobson: i think it does automatically -- do you want to confirm? [20:45:26] otherwise happy to do wmf25 too [20:46:08] (03CR) 10Bking: [C: 03+2] flink-zk: Move codfw hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/954134 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:46:59] i think the branch got cut already? [20:47:13] Yes, it's Tuesday. [20:47:18] oh - then sure -- can you give me a patch number? [20:47:29] cjming: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/954945 [20:47:52] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:47] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:954942|Fix unseen notifications icon (T345483)]] (duration: 16m 45s) [20:49:51] T345483: Mobile notifications are nearly invisible - https://phabricator.wikimedia.org/T345483 [20:49:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954945 (https://phabricator.wikimedia.org/T345483) (owner: 10Jdlrobson) [20:56:26] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:20] OK, found the bugs in two Wikibase extensions, merges appreciated on them tagged against T345660 so I can cherry-pick to deployed and release branches. [20:59:24] T345660: wikibase@0.1.0 install:bridge fails with webpack dependency error - https://phabricator.wikimedia.org/T345660 [21:03:28] cjming: is this live now? [21:03:31] Hey all - I’d like to deploy an updated security mitigation for T336027. Has the backport window concluded? [21:04:06] Jdlrobson: still waiting for it to merge -- almost there [21:04:13] sbassett: almost done [21:04:14] ack [21:04:19] ok tx [21:04:40] cjming: btw I can't verify this one as the wmf25 branch has not been rolled out [21:04:47] so feel free to sync as soon as it merges [21:05:15] wmf.25 is live on test.wikipedia.org [21:06:07] James_F: so it is :) [21:06:12] James_F: so it is! [21:06:32] But that doesn't mean you can test everything there. :-) [21:07:25] (03Merged) 10jenkins-bot: Fix unseen notifications icon [skins/MinervaNeue] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/954945 (https://phabricator.wikimedia.org/T345483) (owner: 10Jdlrobson) [21:07:56] !log cjming@deploy1002 Started scap: Backport for [[gerrit:954945|Fix unseen notifications icon (T345483)]] [21:07:59] T345483: Mobile notifications are nearly invisible - https://phabricator.wikimedia.org/T345483 [21:08:43] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm [21:09:33] !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:954945|Fix unseen notifications icon (T345483)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:09:42] Jdlrobson: did you want to verify? [21:10:08] cjming: yep might as well [21:12:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [21:13:00] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [21:15:43] cjming: lgtm [21:15:48] yay! syncing [21:15:50] !log cjming@deploy1002 jdlrobson and cjming: Continuing with sync [21:15:52] thanks for the help and sorry for the overruning! [21:16:06] no worries! glad it all worked out [21:16:22] Ok, going to deploy my security mitigation now... [21:16:33] sounds good - thanks for your patience [21:16:35] !log end of UTC late backport window [21:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:43] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [21:16:51] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [21:17:31] I will take this time to roll the train to group0 [21:17:53] or are you doing a deployment sbassett ? [21:18:06] Was just going to say [21:18:16] I can wait [21:18:38] Was going to real quick, to PS.php, if that’s ok. cjming needs to release their scap lock tho... [21:18:51] yeah go ahead [21:18:56] it's almost finished syncing [21:19:00] Ah ok [21:19:27] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:43] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:954945|Fix unseen notifications icon (T345483)]] (duration: 13m 46s) [21:21:50] Jdlrobson: changes should be live [21:21:51] T345483: Mobile notifications are nearly invisible - https://phabricator.wikimedia.org/T345483 [21:21:57] sbassett: all yours [21:22:08] tx! [21:22:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:27:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:28:10] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [21:28:19] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [21:28:26] !log Deployed updated security mitigation for T336027 [21:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:42] jeena: should be good, my deploy seems stable [21:28:55] thanks! [21:29:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955005 (https://phabricator.wikimedia.org/T343727) [21:29:43] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955005 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [21:30:52] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955005 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [21:34:12] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [21:34:20] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [21:38:09] !log mwmaint1002: `/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue` (trying to reproduce T344428) [21:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:12] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [21:38:34] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.25 refs T343727 [21:38:37] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [21:41:20] PROBLEM - Check systemd state on mw2442 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:18] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9142803, @Jclark-ctr wrote: > @Eevans disposed of old optic and replaced cable can you verify if error is still present? Previous eqiad onsit... [21:53:49] 10SRE-swift-storage, 10Commons, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) [22:11:09] !log mwmaint1002: `/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --batch-size=20 --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue ` (debugging T344428, lowered batch size [100 -> 20]) [22:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:12] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [22:22:31] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2001.codfw.wmnet with OS bookworm [22:31:04] (ProbeDown) firing: (5) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:31:07] (ProbeDown) firing: (8) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:31:28] PROBLEM - PyBal backends health check on lvs6003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp6015.drmrs.wmnet are marked down but pooled: testlb6_443: Servers cp6012.drmrs.wmnet, cp6009.drmrs.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:31:35] * brett here [22:32:10] * rzl too [22:32:22] ddos? [22:32:31] 200k+ req/s [22:32:52] RECOVERY - PyBal backends health check on lvs6003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:34:54] (03PS8) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [22:34:56] (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [22:34:58] (03PS1) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [22:35:00] (03PS1) 10Ebernhardson: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 [22:36:04] (ProbeDown) resolved: (10) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:36:07] (ProbeDown) resolved: (16) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:57:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update DNS entries for kubernetes2029 and 2030 - pt1979@cumin2002" [22:59:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update DNS entries for kubernetes2029 and 2030 - pt1979@cumin2002" [22:59:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:12:16] (03CR) 10Jdlrobson: "Given https://phabricator.wikimedia.org/T345414 and T345672 perhaps this should be temporarily reverted, at least on enwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [23:13:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:22:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:22:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:24:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 5.386 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:24:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.454 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:30:28] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2001.codfw.wmnet [23:31:09] (03PS1) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) [23:34:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:32] !log bking@cumin1001 START - Cookbook sre.dns.netbox [23:36:44] (03CR) 10Jon Harald Søby: "@Nikki, could you run a Commons query to see if it would be necessary to add `ak` there as well, or if adding it to the Wikidata setting s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [23:37:14] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [23:41:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.538 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:44:31] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [23:44:31] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:44:32] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2001.codfw.wmnet