[00:14:54] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:24] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:24:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:08] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954378 (owner: 10TrainBranchBot) [00:31:40] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [00:39:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955008 [00:39:08] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955008 (owner: 10TrainBranchBot) [00:43:32] RECOVERY - Host mr1-eqiad.oob is UP: PING WARNING - Packet loss = 75%, RTA = 0.60 ms [00:45:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) 05Open→03Resolved mgmt DNS for k8s2029 and 2030 fixed. @akosiaris all yours the last node will be track @T345650 [00:54:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955008 (owner: 10TrainBranchBot) [01:06:52] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:16] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:27] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:59] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:04] (03PS1) 10Ssingh: Release 1.5.3-5 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) [02:14:09] 10SRE, 10Traffic, 10Patch-For-Review: Package libvmod-re2 for Debian 12/Bookworm - https://phabricator.wikimedia.org/T345663 (10ssingh) The above patch fixes the issue with CI passing. Once reviewed, we can merge and close this task. [02:28:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [02:33:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [02:33:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:02] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:38] (03PS1) 10Tim Starling: Disable parser tests [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955047 (https://phabricator.wikimedia.org/T345515) [05:11:17] (03PS1) 10Tim Starling: Avoid linking to invalid filenames in error message [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955048 (https://phabricator.wikimedia.org/T345672) [05:11:42] (03CR) 10Tim Starling: [C: 03+2] Disable parser tests [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955047 (https://phabricator.wikimedia.org/T345515) (owner: 10Tim Starling) [05:11:52] (03CR) 10Tim Starling: [C: 03+2] Avoid linking to invalid filenames in error message [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955048 (https://phabricator.wikimedia.org/T345672) (owner: 10Tim Starling) [05:13:31] (03Merged) 10jenkins-bot: Disable parser tests [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955047 (https://phabricator.wikimedia.org/T345515) (owner: 10Tim Starling) [05:13:47] (03Merged) 10jenkins-bot: Avoid linking to invalid filenames in error message [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955048 (https://phabricator.wikimedia.org/T345672) (owner: 10Tim Starling) [05:19:27] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:11] !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.25/extensions/Phonos: Fix UBN client-side error from malformed Phonos tags T345672 (duration: 06m 51s) [05:28:14] T345672: Error: Unable to parse title - https://phabricator.wikimedia.org/T345672 [05:46:34] (03Abandoned) 10Muehlenhoff: Add some ferm->nft migration steps to the firewall class [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T0600) [06:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:07:06] (03PS1) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [06:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:08:23] (03PS2) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [06:09:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:30] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:14] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:14] (03PS2) 10Muehlenhoff: Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703 [06:34:05] (03CR) 10Muehlenhoff: [C: 03+2] Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703 (owner: 10Muehlenhoff) [06:38:38] (03CR) 10Tim Starling: IS: Enable Phonos on all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951042 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [06:38:59] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:14] (03CR) 10Fabfur: [C: 03+1] "LGTM (do we want to specify in the patch name that this one is from WMF?)" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [07:00:06] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T0700). [07:00:06] abi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:09] OK. I can deploy patch from abi. He seems not around though. Let me check. [07:06:44] abijeet: Let's start deployment. [07:07:04] (03PS5) 10KartikMistry: Enable MinT translation service in more wikis - rollout #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:08:21] kart_, OK [07:08:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:09:31] (03Merged) 10jenkins-bot: Enable MinT translation service in more wikis - rollout #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954634 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:10:09] !log kartik@deploy1002 Started scap: Backport for [[gerrit:954634|Enable MinT translation service in more wikis - rollout #2 (T341445)]] [07:10:15] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:11:47] !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:954634|Enable MinT translation service in more wikis - rollout #2 (T341445)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:12:13] abijeet: can you test patch using mwdebug? [07:12:20] kart_, ok, testing [07:12:29] Let me know if everything looks good to go ahead. [07:14:41] kart_, looks good on all the wikis. I see translations from MinT appearing. [07:14:59] Cool [07:15:12] Going ahead with deployment [07:15:16] !log kartik@deploy1002 abi and kartik: Continuing with sync [07:15:48] kart_, thanks! [07:16:21] (03PS1) 10Ayounsi: eqiad ganeti test setup [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) [07:18:29] (03PS2) 10Ayounsi: eqiad ganeti test setup [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) [07:20:04] good morning [07:20:26] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [07:21:15] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:954634|Enable MinT translation service in more wikis - rollout #2 (T341445)]] (duration: 11m 05s) [07:21:18] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:25:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [07:31:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:00] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:23] (03PS2) 10Alexandros Kosiaris: PHPFPMTooBusy: Point to public available runbook [alerts] - 10https://gerrit.wikimedia.org/r/954947 [07:44:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:56] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:04] (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger: match production opensearch replica/shard settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954675 (https://phabricator.wikimedia.org/T344952) (owner: 10Filippo Giunchedi) [07:51:07] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [07:51:28] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:04:39] jouncebot: now [08:04:39] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [08:04:50] Ah cause the deployment calendar is outdated [08:06:37] jouncebot: refresh [08:06:38] I refreshed my knowledge about deployments. [08:06:43] jouncebot: now [08:06:43] For the next 1 hour(s) and 53 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T0800) [08:06:47] :) [08:08:22] I am going to run the MediaWiki train after it got reassigned to me for this week [08:10:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955288 (https://phabricator.wikimedia.org/T343727) [08:10:35] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955288 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [08:11:14] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955288 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [08:16:04] (03PS3) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [08:16:16] (03CR) 10Muehlenhoff: eqiad ganeti test setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:17:19] (03PS4) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [08:18:27] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.25 refs T343727 [08:18:30] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:19:24] (03PS3) 10Ayounsi: eqiad ganeti test setup [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) [08:19:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:30] (03PS5) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [08:20:44] (03CR) 10Muehlenhoff: PHPFPMTooBusy: Point to public available runbook (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/954947 (owner: 10Alexandros Kosiaris) [08:22:00] (03CR) 10Ayounsi: eqiad ganeti test setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:22:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:22:09] (03CR) 10Ayounsi: [C: 03+2] eqiad ganeti test setup [puppet] - 10https://gerrit.wikimedia.org/r/955284 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:22:13] (03PS6) 10Cathal Mooney: Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) [08:23:19] (03CR) 10Cathal Mooney: [C: 03+2] Add includes for new /24s used in EVPN underlay network codfw [dns] - 10https://gerrit.wikimedia.org/r/954980 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [08:23:34] php-fpm restarting [08:24:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:44] (03CR) 10Filippo Giunchedi: "LGTM, I'll leave it to mw folks to comment on the buckets themselves" [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [08:24:59] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.25 refs T343727 (duration: 06m 31s) [08:25:02] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:29:44] PROBLEM - Check systemd state on ganeti-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:50] (03PS1) 10Urbanecm: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) [08:36:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:37:12] (03CR) 10Vgutierrez: [C: 03+1] Release 1.5.3-5 (032 comments) [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [08:38:37] (03PS2) 10Urbanecm: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) [08:39:53] jouncebot: nowandnext [08:39:53] For the next 1 hour(s) and 20 minute(s): MediaWiki train - Utc Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T0800) [08:39:53] In 1 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1000) [08:40:16] hashar: Can you ping me when you're done? I have appservers to reboot :) [08:40:23] (03PS6) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [08:40:57] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:41:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10phuedx) [08:42:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2001.wikimedia.org [08:43:28] (03PS13) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [08:43:30] (03PS1) 10Filippo Giunchedi: otel-col: enable grpc-http [deployment-charts] - 10https://gerrit.wikimedia.org/r/955290 (https://phabricator.wikimedia.org/T320563) [08:44:18] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [08:46:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:46:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2001.wikimedia.org [08:48:37] claime: I am digging the logs [08:48:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org [08:49:48] 10SRE, 10Infrastructure-Foundations, 10netops: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) 05Open→03Resolved I'm gonna close this for now. I used the following tooling to create the necessary in eqiad/codfw for recent expansion. An i... [08:49:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [08:50:45] (03PS7) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [08:51:24] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:51:59] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) Indeed blocked on the optics arriving, but to clarify the cable runs have been done we just need the optics to slot in and connect. @Jclark-ctr correct... [08:52:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org [08:53:33] (03PS8) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [08:54:13] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:56:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:56:46] claime: you can do the reboots, logs look quite so far and I will continue monitoring them [08:57:30] hashar: Awesome, thanks for the ping <3 [08:58:00] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:58:46] (03CR) 10Muehlenhoff: [C: 03+2] autoinstall: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954277 (owner: 10Muehlenhoff) [08:59:44] (03CR) 10Muehlenhoff: [C: 03+2] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:00:20] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Ensure standalone puppet works with puppet7 - https://phabricator.wikimedia.org/T345702 (10jbond) [09:00:33] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Ensure standalone puppet works with puppet7 - https://phabricator.wikimedia.org/T345702 (10jbond) p:05Triage→03Medium [09:01:15] (03PS1) 10Jbond: git-sync-upstream: Fix environment when setting gitusers [puppet] - 10https://gerrit.wikimedia.org/r/955291 (https://phabricator.wikimedia.org/T345702) [09:01:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:02:18] 10SRE, 10Data-Engineering, 10Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10BTullis) Should we resolve this ticket now? The change has been applied to all clusters other than kafka-logging, and it's been noted... [09:05:04] !log Draining transport circuits landing on cr1-codfw card 1/1 prior to reset (T345583) [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:05:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:05:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10brouberol) Sorry about the confusion. I have deleted `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPd+Ekept47K0yIJ91ByVo4q6TAbgVzzxIqfq6k1X0L8 brouberol@wikimedia.org` from toolfo... [09:05:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52265 and previous config saved to /var/cache/conftool/dbconfig/20230906-090541-arnaudb.json [09:05:44] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:06:12] (03PS1) 10Muehlenhoff: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 [09:07:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [09:08:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) 05Stalled→03In progress Thanks, ` vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --username brouberol --uid 45143 --email brouberol@wikimedia.o... [09:08:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) [09:08:57] (03CR) 10Muehlenhoff: [C: 03+2] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [09:09:56] I have filed 4 new errors so far but none seems worth rolling back ( https://phabricator.wikimedia.org/T343727#9145336 ) [09:10:12] (03PS1) 10Jbond: cloud: puppetmaster7 update enc uri [puppet] - 10https://gerrit.wikimedia.org/r/955293 (https://phabricator.wikimedia.org/T345702) [09:10:35] (03CR) 10Jbond: [C: 03+2] cloud: puppetmaster7 update enc uri [puppet] - 10https://gerrit.wikimedia.org/r/955293 (https://phabricator.wikimedia.org/T345702) (owner: 10Jbond) [09:11:03] (03PS2) 10Jbond: cloud: puppetmaster7 update enc uri [puppet] - 10https://gerrit.wikimedia.org/r/955293 (https://phabricator.wikimedia.org/T345702) [09:14:01] (03PS3) 10Hnowlan: rest-gateway: route requests to geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/954888 (https://phabricator.wikimedia.org/T336400) [09:15:35] !log Shutting cr1-codfw port xe-1/1/1:1 to cr2-codfw before card 1/1 reset (T345583) [09:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:33] (03PS2) 10Muehlenhoff: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 [09:19:27] (03CR) 10JMeybohm: [C: 03+1] otel-col: enable grpc-http [deployment-charts] - 10https://gerrit.wikimedia.org/r/955290 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:19:27] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [09:21:52] (03CR) 10Filippo Giunchedi: [C: 03+2] otel-col: enable grpc-http [deployment-charts] - 10https://gerrit.wikimedia.org/r/955290 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:21:57] (03PS1) 10Vgutierrez: admin: Add brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) [09:21:59] (03PS2) 10Filippo Giunchedi: otel-col: enable grpc-http [deployment-charts] - 10https://gerrit.wikimedia.org/r/955290 (https://phabricator.wikimedia.org/T320563) [09:22:01] (03CR) 10Filippo Giunchedi: [V: 03+2] otel-col: enable grpc-http [deployment-charts] - 10https://gerrit.wikimedia.org/r/955290 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:22:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) 05In progress→03Stalled `analytics_privatedata_users` membership requires approval from @odimitrijevic or @Milimetric per https://ge... [09:22:42] (03CR) 10Vgutierrez: [C: 04-2] "blocked till we get the required approvals" [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [09:22:52] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route requests to geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/954888 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:23:19] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [09:23:21] !log Resetting PIC 1/1 on cr1-codfw to enable port et-1/1/5 at 100G (T345583) [09:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:34] (03Merged) 10jenkins-bot: rest-gateway: route requests to geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/954888 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:23:35] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [09:23:39] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [09:23:48] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [09:26:09] !log disable puppet to switch puppetdbs gerrit:954622 [09:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:28:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:28:30] (03CR) 10Hnowlan: [C: 03+2] device-analytics: correct replica definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/954902 (owner: 10Hnowlan) [09:29:40] (03Merged) 10jenkins-bot: device-analytics: correct replica definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/954902 (owner: 10Hnowlan) [09:29:42] PROBLEM - Check systemd state on ganeti-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) [09:31:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:36:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:36:19] (03CR) 10Jbond: [C: 03+2] puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [09:37:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) I've updated the request title and description because @brouberol will also require access to the `ops` group as an SRE on the #data... [09:37:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) [09:37:52] (03PS1) 10Muehlenhoff: nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) [09:38:18] (03CR) 10CI reject: [V: 04-1] nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:41:11] 10SRE, 10Infrastructure-Foundations, 10netops: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) p:05Triage→03Low [09:42:56] (03CR) 10Filippo Giunchedi: [C: 03+2] mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:43:45] (03Merged) 10jenkins-bot: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:44:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) @brouberol - You can feel free to submit your own patch to Gerrit if you wish, adding yourself to the `data.yaml` file. We can then... [09:44:31] (03PS1) 10Ayounsi: DNS add A for ganeti-test01.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/955299 (https://phabricator.wikimedia.org/T345602) [09:45:03] 10SRE, 10Infrastructure-Foundations, 10netops: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10Volans) SGTM, I would even consider a shorter time span :) [09:46:00] (03CR) 10Ayounsi: [C: 03+2] DNS add A for ganeti-test01.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/955299 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [09:48:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) >>! In T345633#9145470, @BTullis wrote: > @brouberol - You can feel free to submit your own patch to Gerrit if you wish, adding your... [09:49:31] !log enable puppet post switch puppetdbs gerrit:954622 [09:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] !log Making cr1-codfw VRRP primary for connections to row C and D prior to card 1/1 reset (T345583) [09:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:23] !log ayounsi@cumin1001 START - Cookbook sre.dns.wipe-cache ganeti-test01.svc.eqiad.wmnet on all recursors [09:51:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti-test01.svc.eqiad.wmnet on all recursors [09:52:13] (03CR) 10Btullis: "I believe that we also want to add brouberol to the ops group." [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [09:54:19] (03CR) 10Vgutierrez: [C: 04-2] admin: Add brouberol user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [09:57:15] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:57:27] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:57:45] !log de-activating peering sessions at DE-CIX Dallas on cr2-codfw prior to card 1/1 reset (T345583) [09:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:36] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:59:40] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:59:54] (03PS1) 10Jbond: puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1000) [10:00:18] (03CR) 10CI reject: [V: 04-1] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:00:36] (03CR) 10Jbond: [V: 03+2] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:02:54] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 118, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43150/console" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:04:51] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:05:12] (03PS2) 10Jbond: puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) [10:05:15] (03CR) 10CI reject: [V: 04-1] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:05:36] (03CR) 10CI reject: [V: 04-1] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:05:55] (03CR) 10Jbond: [C: 03+2] "updated to exlcude wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:06:18] (03CR) 10Jbond: [V: 03+2] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:07:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43151/console" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:08:46] !log Draining cr2-codfw transport cct's to eqdfw and eqiad prior to card 1/1 reset (T345583) [10:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:04] PROBLEM - Check systemd state on puppetserver2001 is CRITICAL: CRITICAL - degraded: The following units failed: remove_old_puppet_reports.service,sync-puppet-ca.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:56] (03PS3) 10Jbond: puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) [10:13:19] (03CR) 10CI reject: [V: 04-1] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:14:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43152/console" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:15:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:15:44] !log shut cr2-codfw xe-1/1/1:3 interface to cr1-codfw ahead of card 1/1 reset (T345583) [10:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:20:54] RECOVERY - Check systemd state on ganeti-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:21:59] (03CR) 10Btullis: datahub: add oidc production settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:22:02] RECOVERY - Check systemd state on ganeti-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:30] (03PS2) 10Muehlenhoff: nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) [10:22:39] (03PS3) 10Muehlenhoff: nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) [10:23:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetdb: add a motd to inform users that theses serveres are no longer live [puppet] - 10https://gerrit.wikimedia.org/r/955303 (https://phabricator.wikimedia.org/T280353) (owner: 10Jbond) [10:25:16] (03CR) 10Btullis: "I wouldn't use the extra `.set.` structure in the values, personally" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:27:09] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10jijiki) >>! In T344110#9143064, @Jhancock.wm wrote: > @jijiki is there a day this week that I can update the firmware on this server? Hey @Jhancock.wm we ne... [10:27:51] !log Resetting PIC 1/1 on cr2-codfw to enable et-1/1/5 at 100G (T345583) [10:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:10] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [10:28:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [10:28:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10joanna_borun) Approved [10:30:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:30:24] (03PS1) 10Jbond: prometheus: convert and to lower case [puppet] - 10https://gerrit.wikimedia.org/r/955307 [10:30:49] (03CR) 10Jbond: [V: 03+2 C: 03+2] prometheus: convert and to lower case [puppet] - 10https://gerrit.wikimedia.org/r/955307 (owner: 10Jbond) [10:32:37] (03PS2) 10Vgutierrez: admin: Add brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) [10:34:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [10:35:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [10:37:26] (03PS1) 10Muehlenhoff: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) [10:38:59] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) Cheers, I've amended the patch to include the ops membership (already approved by @joanna_borun). CR still blocked till we get @o... [10:39:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [10:40:15] 10SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691 (10Ladsgroup) The first email didn't go through multiple mailing lists because each one have different acceptance criteria (different size limit, membership cri... [10:42:36] (03CR) 10Btullis: datahub: add oidc production settings (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:43:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:46:37] (03PS1) 10Jbond: prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) [10:47:11] (03CR) 10Stevemunene: datahub: add oidc production settings (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:48:58] (03CR) 10CI reject: [V: 04-1] prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:51:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43153/console" [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:55:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I guess it would be beneficial to have spec tests to check the different cases. They can be introduced at a later patch, let me know" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:56:44] (03CR) 10Muehlenhoff: firewall::service Check for presence of srange/drange in the nftables path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:58:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [10:58:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [10:59:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43154/console" [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:02:08] (03PS9) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [11:03:11] (03CR) 10CI reject: [V: 04-1] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [11:04:27] (03PS2) 10Jbond: prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) [11:04:29] (03PS1) 10Jbond: prometheus: sort data to reduce puppet diffs [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) [11:05:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [11:05:10] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) (owner: 10Jbond) [11:05:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [11:05:31] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:06:53] (03CR) 10CI reject: [V: 04-1] prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:08:37] (03PS6) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:09:01] (03CR) 10CI reject: [V: 04-1] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:10:06] (03CR) 10FNegri: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:10:49] (03PS7) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:11:12] (03PS2) 10Jbond: prometheus: sort data to reduce puppet diffs [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) [11:11:14] (03PS3) 10Jbond: prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) [11:11:22] (03CR) 10CI reject: [V: 04-1] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:12:32] (03PS8) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:12:56] (03CR) 10CI reject: [V: 04-1] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:13:33] (03PS9) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:13:42] (03CR) 10CI reject: [V: 04-1] prometheus: drop monitoring::openapi_service and prometheus::blackbox_check_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:13:58] (03CR) 10jenkins-bot: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:14:35] (03PS10) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:16:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43156/console" [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:18:07] (03PS4) 10Jbond: prometheus: drop unused resources [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) [11:18:22] (03PS5) 10Jbond: prometheus: drop unused resources [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) [11:19:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52266 and previous config saved to /var/cache/conftool/dbconfig/20230906-111908-arnaudb.json [11:19:13] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:27:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [11:27:04] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [11:29:38] (03CR) 10Btullis: datahub: add oidc production settings (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [11:32:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [11:33:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [11:34:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P52267 and previous config saved to /var/cache/conftool/dbconfig/20230906-113414-arnaudb.json [11:39:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10MoritzMuehlenhoff) [11:41:51] (03PS1) 10Muehlenhoff: Add Phabricator reference [puppet] - 10https://gerrit.wikimedia.org/r/955317 [11:44:32] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T345726 (10Stevemunene) [11:45:07] (03CR) 10Jbond: "lgtm however you are not using them anywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:49:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P52268 and previous config saved to /var/cache/conftool/dbconfig/20230906-114921-arnaudb.json [11:50:22] (03CR) 10Muehlenhoff: [C: 03+2] Add Phabricator reference [puppet] - 10https://gerrit.wikimedia.org/r/955317 (owner: 10Muehlenhoff) [11:55:34] (03PS4) 10Muehlenhoff: nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) [11:55:59] (03CR) 10Muehlenhoff: nft base sets: Read additional host groups from Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:01:25] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10Stevemunene) [12:01:53] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10Stevemunene) [12:03:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [12:03:44] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet [12:04:15] (03PS1) 10Urbanecm: changeprop: Rule for refreshUserImpactJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) [12:04:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52269 and previous config saved to /var/cache/conftool/dbconfig/20230906-120427-arnaudb.json [12:04:29] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:04:30] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:04:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:04:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52270 and previous config saved to /var/cache/conftool/dbconfig/20230906-120448-arnaudb.json [12:08:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet [12:10:09] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [12:10:18] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: sort data to reduce puppet diffs [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) (owner: 10Jbond) [12:10:24] (03PS1) 10Hnowlan: rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) [12:10:45] (03PS2) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:10:47] (03PS1) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [12:10:51] (03PS2) 10Urbanecm: changeprop: Rule for refreshUserImpactJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) [12:11:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:11:24] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:11:27] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [12:11:29] (03CR) 10Jbond: "lgtm but see inline i think its time to genralise the host to ips conversion" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:12:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:14:08] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:15:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "I forgot about I48e93ac8f21be of course, which is basically the same, up to you if you'd like to merge this or that" [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) (owner: 10Jbond) [12:18:41] (03PS2) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [12:18:43] (03PS3) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:21:06] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [12:22:30] (03PS3) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [12:22:32] (03PS4) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:24:56] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [12:26:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:26:25] (03CR) 10Muehlenhoff: wmflib::hosts2ips: Add new function (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [12:27:17] (03PS4) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [12:27:19] (03PS5) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:27:53] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [12:31:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:32:09] (03CR) 10Paladox: [C: 04-1] C:gerrit Link account creation to IDM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede) [12:33:12] (03PS1) 10Muehlenhoff: testreduce: Setup rsync for data transfer [puppet] - 10https://gerrit.wikimedia.org/r/955325 (https://phabricator.wikimedia.org/T345220) [12:34:56] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:56] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:06] 10SRE, 10Infrastructure-Foundations, 10netops: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) 05Open→03Resolved [12:37:15] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10cmooney) [12:38:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:42:46] (03PS1) 10Anzx: bnwikisource: update legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955327 (https://phabricator.wikimedia.org/T345666) [12:49:10] (03PS10) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [12:49:28] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43157/console" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [12:50:01] (03CR) 10Muehlenhoff: [C: 03+2] testreduce: Setup rsync for data transfer [puppet] - 10https://gerrit.wikimedia.org/r/955325 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [12:50:26] (03CR) 10Stevemunene: datahub: add oidc production settings (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1300). [13:00:05] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:02:02] i can deploy [13:02:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955327 (https://phabricator.wikimedia.org/T345666) (owner: 10Anzx) [13:03:08] (03Merged) 10jenkins-bot: bnwikisource: update legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955327 (https://phabricator.wikimedia.org/T345666) (owner: 10Anzx) [13:03:37] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955327|bnwikisource: update legacy vector logo (T345666)]] [13:03:41] T345666: Update logo for Bangla Wikisource - https://phabricator.wikimedia.org/T345666 [13:05:20] !log taavi@deploy1002 taavi and anzx: Backport for [[gerrit:955327|bnwikisource: update legacy vector logo (T345666)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:05:27] aanzx: please test [13:05:36] Ok [13:07:18] taavi: looks good [13:07:24] thanks, syncing [13:07:26] !log taavi@deploy1002 taavi and anzx: Continuing with sync [13:07:57] and I think I need to be purging the caches for those logos once the sync is done [13:14:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [13:14:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [13:14:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [13:14:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [13:16:21] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet [13:16:22] !log bking@cumin1001 START - Cookbook sre.dns.netbox [13:18:52] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [13:19:17] (03CR) 10Ayounsi: [C: 03+1] asw1-b*27-esams: add durum300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/954965 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:19:27] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:39] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [13:19:39] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:19:39] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors [13:19:43] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2001.codfw.wmnet on all recursors [13:20:06] (03CR) 10Ssingh: [C: 03+2] asw1-b*27-esams: add durum300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/954965 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:20:10] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [13:20:54] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [13:21:13] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955327|bnwikisource: update legacy vector logo (T345666)]] (duration: 17m 35s) [13:21:16] T345666: Update logo for Bangla Wikisource - https://phabricator.wikimedia.org/T345666 [13:21:29] !log taavi@mwmaint1002 ~ $ cat logos-to-purge.txt | mwscript purgeList.php --wiki enwiki # T345666 [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:50] !log homer "asw1-b*27-esams*" commit "add durum300[34]" [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:11] aanzx: that's all done I think [13:22:50] taavi: thanks [13:26:18] (03PS1) 10Ladsgroup: Pin pagelinks normalization stage to old in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955331 (https://phabricator.wikimedia.org/T345732) [13:27:24] 10SRE, 10Infrastructure-Foundations, 10netops: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10ayounsi) I thought that was not possible but it got introduced recently (in 16.1). +1 [13:33:12] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm [13:33:22] (03CR) 10Ssingh: Release 1.5.3-5 (031 comment) [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [13:35:55] (03PS2) 10Ssingh: Release 1.5.3-5 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) [13:36:17] (03PS1) 10Filippo Giunchedi: cxserver: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955333 (https://phabricator.wikimedia.org/T320563) [13:36:19] (03PS1) 10Filippo Giunchedi: cxserver: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/955334 (https://phabricator.wikimedia.org/T320563) [13:36:21] (03CR) 10Ssingh: Release 1.5.3-5 (031 comment) [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [13:36:22] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [13:36:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [13:36:29] (03PS1) 10Muehlenhoff: Fix rsync config [puppet] - 10https://gerrit.wikimedia.org/r/955335 [13:38:11] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [13:38:32] !log sudo ethtool -G eno1 rx 1000 on conf2004 T345738 [13:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:35] T345738: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 [13:39:09] (03CR) 10Muehlenhoff: [C: 03+2] Fix rsync config [puppet] - 10https://gerrit.wikimedia.org/r/955335 (owner: 10Muehlenhoff) [13:40:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [13:42:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [13:42:52] (03CR) 10Sergio Gimeno: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [13:45:51] (03PS4) 10Stevemunene: [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [13:46:22] (03CR) 10CI reject: [V: 04-1] [WIP] admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [13:47:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [13:47:28] (03CR) 10Vgutierrez: [C: 03+1] Release 1.5.3-5 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [13:49:10] (03CR) 10Ssingh: [C: 03+2] Release 1.5.3-5 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/955043 (https://phabricator.wikimedia.org/T345663) (owner: 10Ssingh) [13:51:11] (03PS1) 10ArielGlenn: move dumps-related workers and nfs shares from core platform to data engineering [puppet] - 10https://gerrit.wikimedia.org/r/955338 [13:53:09] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1349.eqiad.wmnet [13:53:23] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1351.eqiad.wmnet [13:54:51] !log powercycling mw1351.eqiad.wmnet [13:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:07] !log powercycling mw1349.eqiad.wmnet [13:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:10] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [13:59:55] !log repooling mw1351.eqiad.wmnet [13:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1400) [14:00:42] (03CR) 10Muehlenhoff: "Nice, this is going in the right direction! Couple of comments inline." [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [14:01:35] taavi: mw1349 is not coming back up, despite trying to hard reset through management interface, so it'll stay pooled=invalid [14:01:45] I'll open a DCops task [14:02:37] (03CR) 10Muehlenhoff: "Looks good, one question inline." [puppet] - 10https://gerrit.wikimedia.org/r/955338 (owner: 10ArielGlenn) [14:04:04] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:04:54] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 4.08 ms [14:05:06] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:09] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 (10Clement_Goubert) [14:06:11] What the hell [14:06:12] ok [14:06:32] RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:37] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:07:28] I'll try and reboot it again through mgmt interface to see if it still warrants dcops investigation [14:08:10] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:08:59] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:24] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 (10Clement_Goubert) Server came back up as I pressed submit, however there still is an issue with the management interface. It does powercycle the server when asked, but states `Unable to perf... [14:09:44] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:13] ^expected [14:10:15] (03CR) 10Btullis: datahub: add oidc production settings (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:11:26] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:14:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:35] (03CR) 10Jbond: [C: 03+2] prometheus: sort data to reduce puppet diffs [puppet] - 10https://gerrit.wikimedia.org/r/955312 (https://phabricator.wikimedia.org/T345717) (owner: 10Jbond) [14:15:39] (03CR) 10Jbond: [C: 03+2] prometheus: drop unused resources [puppet] - 10https://gerrit.wikimedia.org/r/955309 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [14:16:01] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 (10MoritzMuehlenhoff) You could try running "racadm racreset" over the serial console, it solved similar cases for me in the past. It will kick you off the current SSH connection to the mgmt,... [14:16:24] (03CR) 10Muehlenhoff: [C: 03+2] nft base sets: Read additional host groups from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/955297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:16:52] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:17:26] (03PS1) 10Filippo Giunchedi: prometheus: fix arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/955341 [14:17:28] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:17:32] (03PS5) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [14:17:41] (03CR) 10Jbond: wmflib::hosts2ips: Add new function (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:18:18] (03CR) 10CI reject: [V: 04-1] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:18:40] Restarting appserver reboots [14:18:50] !log Restarting appserver reboots [14:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [14:18:59] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:12] (03PS6) 10Jbond: wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 [14:19:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) kubernetes1047 - E 3. U 39. port 36 Cableid 502576 kubernetes1048 - E 3. U 40. port 37 Cableid 502577 kubernetes1049 - E 3. U 41. port 45 Cableid 502578 kubernet... [14:20:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/955341 (owner: 10Filippo Giunchedi) [14:22:31] !log Leaving mw1349.eqiad.wmnet pooled=invalid until management interface investigation - T345741 [14:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:33] T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 [14:22:49] (03PS1) 10Cathal Mooney: Set system console user timeout for Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/955342 (https://phabricator.wikimedia.org/T345710) [14:24:12] RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:13] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service Failed on elastic1092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:28:52] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 (10Clement_Goubert) 05Open→03Resolved >>! In T345741#9146405, @MoritzMuehlenhoff wrote: > You could try running "racadm racreset" over the serial console, it solved similar cases for me in... [14:30:20] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:48] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:30:57] ^expected [14:31:01] (03CR) 10Jbond: [C: 03+2] wmflib::hosts2ips: Add new function [puppet] - 10https://gerrit.wikimedia.org/r/955321 (owner: 10Jbond) [14:31:40] (03PS6) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:31:45] !log Repooling mw1349.eqiad.wmnet - T345741 [14:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet - https://phabricator.wikimedia.org/T345741 [14:31:51] (03CR) 10Jbond: "child patch merged now" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:32:19] (03PS7) 10Jbond: firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:33:45] jouncebot: nowandnext [14:33:45] For the next 0 hour(s) and 26 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1400) [14:33:45] In 2 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1700) [14:34:08] Amir1: Be advised I'm rebooting appservers, so deployment to some hosts may fail [14:34:43] yeah, decided to do that in two hours instead [14:34:59] 10SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691 (10Trizek-WMF) I posted the message in two batches: # to all mailing lists, CC-ed - it was rejected (not held) by some lists, as Ladsgroupe explained - yo... [14:39:44] PROBLEM - Host mw1354 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Jclark-ctr) cloudvirt1062. E 4 U 41 Port 44 CableID 230304500288 cloudvirt1063. E 4 U 42 Port 45 CableID 230304500134 cloudvirt1064. E 4 U 43 Port 46... [14:40:19] mw1354 down expected, currently being powercycled [14:41:02] RECOVERY - Host mw1354 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [14:41:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:42:51] (03CR) 10Muehlenhoff: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:43:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:43:48] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T345744 (10phaultfinder) [14:48:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-be2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:52:38] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2001.codfw.wmnet with OS bookworm [14:52:38] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet [14:54:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52272 and previous config saved to /var/cache/conftool/dbconfig/20230906-145439-arnaudb.json [14:56:23] (03CR) 10Jbond: [C: 03+2] firewall::service Check for presence of srange/drange in the nftables path [puppet] - 10https://gerrit.wikimedia.org/r/955308 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:56:43] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Vgutierrez) having this in place would have prevented a ncredir related page already. I'm happy to have this opt-in per cookbook (personally I'... [14:57:46] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:57:56] PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:59:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [14:59:16] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:59:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [14:59:22] RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Thanos [15:00:21] (03PS3) 10Muehlenhoff: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 [15:01:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [15:05:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [15:05:30] PROBLEM - Host mw1368 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:30] PROBLEM - Host mw1366 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:34] PROBLEM - Host mw1365 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:44] PROBLEM - Host mw1367 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [15:06:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [15:06:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:07:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['moss-be2003'] [15:08:31] On it [15:08:34] (mw hosts down) [15:09:18] (03CR) 10ArielGlenn: move dumps-related workers and nfs shares from core platform to data engineering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955338 (owner: 10ArielGlenn) [15:09:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P52273 and previous config saved to /var/cache/conftool/dbconfig/20230906-150945-arnaudb.json [15:11:04] RECOVERY - Host mw1365 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:11:08] (03PS1) 10Muehlenhoff: nftables_base_sets: Skip NETWORK_INFRA for now [puppet] - 10https://gerrit.wikimedia.org/r/955350 [15:11:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955338 (owner: 10ArielGlenn) [15:12:14] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:12:58] RECOVERY - Host mw1367 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:13:40] RECOVERY - Host mw1368 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:16:32] (03PS2) 10Hnowlan: rest-gateway: preserve cluster hostname when using ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/955320 (https://phabricator.wikimedia.org/T336400) [15:17:21] (03CR) 10Will Doran: [C: 03+1] "I approve this change" [puppet] - 10https://gerrit.wikimedia.org/r/955338 (owner: 10ArielGlenn) [15:17:42] RECOVERY - Host mw1366 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:21:07] (03CR) 10Jbond: [C: 03+1] nftables_base_sets: Skip NETWORK_INFRA for now [puppet] - 10https://gerrit.wikimedia.org/r/955350 (owner: 10Muehlenhoff) [15:22:53] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:23:01] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10Volans) In terms of feasibility the only way to "resume" is to install a signal handler for SIGINT that asks the user to either resume or continue **but ther... [15:23:36] (03CR) 10Muehlenhoff: [C: 03+2] nftables_base_sets: Skip NETWORK_INFRA for now [puppet] - 10https://gerrit.wikimedia.org/r/955350 (owner: 10Muehlenhoff) [15:23:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) 05Open→03Resolved a:03jbond All systems now using the new puppetdb's [15:24:33] (03PS7) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12 and 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [15:24:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P52274 and previous config saved to /var/cache/conftool/dbconfig/20230906-152451-arnaudb.json [15:24:54] (03PS4) 10Muehlenhoff: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 [15:26:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [15:27:26] (03Abandoned) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [15:32:49] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [15:33:05] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [15:35:06] (03PS1) 10Elukey: Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) [15:35:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) So, `ethtool -G eno1 rx 1000` apparently did the [trick](https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=con... [15:37:17] (03PS2) 10Elukey: Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) [15:38:07] !log sudo ethtool -G eno1 rx 1000 on conf2005, conf2006 to test out the theory. T345738 [15:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:11] T345738: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 [15:38:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [15:39:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [15:39:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52275 and previous config saved to /var/cache/conftool/dbconfig/20230906-153957-arnaudb.json [15:40:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:40:01] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:40:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:41:16] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [15:42:01] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1384.eqiad.wmnet [15:42:08] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1364.eqiad.wmnet [15:42:13] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1373.eqiad.wmnet [15:42:18] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1385.eqiad.wmnet [15:42:22] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1387.eqiad.wmnet [15:50:01] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:54:51] (03PS2) 10FNegri: [openstack] upgrade codfw1dev to Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) [15:55:04] (03CR) 10FNegri: [openstack] upgrade codfw1dev to Antelope (2023.1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [15:55:56] !bash MatmaRex: i want to make a joke about C++ version adoption but i need your help to workshop it [15:55:56] Amir1: Stored quip at https://bash.toolforge.org/quip/9kU1a4oBhuQtenzvGGZh [15:56:32] not that I have anything smart to say but I never knew you could call bash from here, that's a TIL for me [16:04:52] (03CR) 10Andrew Bogott: [C: 03+1] "let's try this!" [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [16:11:09] (03PS1) 10Majavah: P:prometheus::ops: add ensure filter for envoy [puppet] - 10https://gerrit.wikimedia.org/r/955363 [16:14:48] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [16:14:55] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [16:16:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43158/console" [puppet] - 10https://gerrit.wikimedia.org/r/955363 (owner: 10Majavah) [16:17:24] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:18:47] (03CR) 10FNegri: [C: 03+2] [openstack] upgrade codfw1dev to Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [16:20:00] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10BCornwall) TBH I think that cookbooks should be *less* verbose, which will help punctuate the more important information. IMO the current output of co... [16:24:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10cmooney) We removed switch asw-b1-codfw as it no longer had any servers connected (they were moved to cloudsw1-b1-codfw). The correlation between th... [16:25:34] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove entries for cloudweb2002-dev - cmooney@cumin1001" [16:28:47] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) From install1004, I can see my attempts on Aug 31 failing with `no free leases`, which would seem to suggest that (at least for some subset of attempts), that it w... [16:33:36] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:33:59] (JobUnavailable) resolved: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:03] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10BCornwall) The cookbooks should already stop/prompt the user when built with `confirm_on_failure()`. Anything more interactive is probably not a good UX and... [16:35:02] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:59] (JobUnavailable) resolved: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS bullseye [16:42:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove entries for cloudweb2002-dev - cmooney@cumin1001" [16:42:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:50] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [16:43:55] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [16:51:28] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [16:51:35] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [16:57:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10VRiley-WMF) db1229 - B 6. U 29. Port 23 CableID 1896 db1230 - C 3. U 23. Port 8 CableID 3310 db1231 - C 6. U 11. Port 15 CableID 3221 db1232 - D 3. U 14. Port 22 CableID 3687 d... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1700) [17:05:43] !log Upload libvmod-re2_1.5.3-5_amd64 to bookworm-wikimedia [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:57] 10SRE, 10Traffic: Package libvmod-re2 for Debian 12/Bookworm - https://phabricator.wikimedia.org/T345663 (10BCornwall) 05In progress→03Resolved Thanks, sukhe! [17:07:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:09:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:10:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:15:24] (03PS2) 10Esanders: Turn off DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [17:19:44] (03PS1) 10Esanders: Enable edit check on en/fr beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955368 (https://phabricator.wikimedia.org/T345658) [17:20:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:21:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:54] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [17:24:00] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) As per the [[ https://wikitech.wikimedia.org/wiki/SRE/Production_access#Debugging | debugging instructions here ]], I'm unfortunately stuck and can't seem to connect via SSH.... [17:25:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [17:29:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:29] (03PS1) 10Jbond: sre.hosts.provision: Preform a check for the serial number [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 [17:38:32] (03CR) 10Volans: sre.hosts.provision: Preform a check for the serial number (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 (owner: 10Jbond) [17:38:35] (03CR) 10Jeena Huneidi: "Hi! If you would like this patch to be backported please add it to a backport window: https://wikitech.wikimedia.org/wiki/Backport_windows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [17:39:56] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10cmooney) I'm not 100% sure what's going on here. What I can say: * During PXEboot or from Linux booted from debian ISO the eno1 interface show's "UP" on the host side *... [17:41:07] (03CR) 10Jbond: "Looking at the code i think that the issue can only happen if you:" [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 (owner: 10Jbond) [17:41:59] (03CR) 10Jbond: [C: 04-1] "i see now why this wont work" [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 (owner: 10Jbond) [17:48:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1132.eqiad.wmnet with OS bullseye [17:50:47] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Aklapper) @ppenloglou: Could you please try `bast6003` instead of `bast6002` per https://gerrit.wikimedia.org/r/c/operations/puppet/+/954597/ ? Thanks! :) [17:55:17] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [17:55:24] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors:... [17:56:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:58:15] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030'] [17:58:31] (03PS2) 10Majavah: taskgen: log number of unknown python files [puppet] - 10https://gerrit.wikimedia.org/r/954285 [17:58:36] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Oh my god, it worked immediately. Thank you so much!! [17:58:40] (03PS3) 10Majavah: taskgen: also match pytest files [puppet] - 10https://gerrit.wikimedia.org/r/954286 [17:58:50] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1030'] [18:00:05] hashar and jeena: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T1800). [18:00:10] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030'] [18:00:17] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['restbase1030'] [18:00:19] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030'] [18:00:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030'] [18:00:35] jeena: I rolled it this morning. I have filed a few tasks but nothing worth being a blocker as far as I understood it :] [18:01:05] hashar: 👍 [18:02:23] (03CR) 10Majavah: [C: 03+2] taskgen: log number of unknown python files [puppet] - 10https://gerrit.wikimedia.org/r/954285 (owner: 10Majavah) [18:02:31] (03CR) 10Majavah: [C: 03+2] taskgen: also match pytest files [puppet] - 10https://gerrit.wikimedia.org/r/954286 (owner: 10Majavah) [18:12:08] (03CR) 10Jbond: "did you mean to merge this one? i assumed it was just for debugging?" [puppet] - 10https://gerrit.wikimedia.org/r/954285 (owner: 10Majavah) [18:13:16] (03CR) 10Jbond: "FTR i have no objection" [puppet] - 10https://gerrit.wikimedia.org/r/954285 (owner: 10Majavah) [18:13:55] (03PS1) 10Bartosz Dziewoński: Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955054 (https://phabricator.wikimedia.org/T264765) [18:14:06] (03PS1) 10Bartosz Dziewoński: Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955055 (https://phabricator.wikimedia.org/T264765) [18:14:17] (03PS1) 10Bartosz Dziewoński: TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955056 (https://phabricator.wikimedia.org/T345648) [18:14:22] (03PS1) 10MusikAnimal: Delay loading ext.phonos module until user clicks [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955057 (https://phabricator.wikimedia.org/T345414) [18:14:28] (03PS1) 10Bartosz Dziewoński: TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955058 (https://phabricator.wikimedia.org/T345648) [18:15:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:15:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:15:41] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:15:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:16:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T343198)', diff saved to https://phabricator.wikimedia.org/P52276 and previous config saved to /var/cache/conftool/dbconfig/20230906-181602-arnaudb.json [18:16:07] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:17:18] i hate it when i'm doing backports and gerrit skips a change number [18:17:41] 955054, 955055, 955056, and… 955058??? whyyyy [18:17:46] (03PS1) 10Bking: flink-zk: Use insetup::serach_platform role [puppet] - 10https://gerrit.wikimedia.org/r/955376 (https://phabricator.wikimedia.org/T345754) [18:18:09] oh, i see why. haha :D but it happens for no reason sometimes too [18:19:18] (03CR) 10Jbond: [C: 04-2] sre.hosts.provision: Preform a check for the serial number (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 (owner: 10Jbond) [18:19:30] (03Abandoned) 10Jbond: sre.hosts.provision: Preform a check for the serial number [cookbooks] - 10https://gerrit.wikimedia.org/r/955370 (owner: 10Jbond) [18:24:58] MatmaRex: are you sure it's for no reason [18:34:21] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10cmooney) Ok so I downgraded the NIC firmware to 21.80.9 but the pattern is the exact same. I'd possibly try another SFP just in case, and another cable. If it was an amp... [18:35:38] (03PS2) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [18:56:19] (03PS1) 10Eevans: thanos: Create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955382 (https://phabricator.wikimedia.org/T345765) [18:58:58] (03PS1) 10Eevans: Add mock password for wdqs_savepoints [labs/private] - 10https://gerrit.wikimedia.org/r/955383 (https://phabricator.wikimedia.org/T345765) [18:59:54] (03PS2) 10Bking: flink-zk: Use insetup::search_platform role [puppet] - 10https://gerrit.wikimedia.org/r/955376 (https://phabricator.wikimedia.org/T345754) [19:04:28] (03PS1) 10DLynch: Edit check: Turn on when ecenable=1 is set [extensions/VisualEditor] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955059 (https://phabricator.wikimedia.org/T345297) [19:06:41] (03CR) 10Bking: [C: 03+1] thanos: Create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955382 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:06:49] (03CR) 10Bking: [C: 03+1] Add mock password for wdqs_savepoints [labs/private] - 10https://gerrit.wikimedia.org/r/955383 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:08:14] (03CR) 10HMonroy: [C: 03+2] Delay loading ext.phonos module until user clicks [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955057 (https://phabricator.wikimedia.org/T345414) (owner: 10MusikAnimal) [19:10:07] (03Merged) 10jenkins-bot: Delay loading ext.phonos module until user clicks [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955057 (https://phabricator.wikimedia.org/T345414) (owner: 10MusikAnimal) [19:10:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955057 (https://phabricator.wikimedia.org/T345414) (owner: 10MusikAnimal) [19:10:54] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:955057|Delay loading ext.phonos module until user clicks (T345414)]] [19:11:01] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [19:12:37] !log hmonroy@deploy1002 hmonroy and musikanimal: Backport for [[gerrit:955057|Delay loading ext.phonos module until user clicks (T345414)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [19:12:45] !log hmonroy@deploy1002 hmonroy and musikanimal: Continuing with sync [19:13:42] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:17] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10BCornwall) @Vgutierrez Is this something that should be addressed in the cookbook? Your idea of automatically including it in cookbooks with d... [19:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:18:53] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:955057|Delay loading ext.phonos module until user clicks (T345414)]] (duration: 07m 58s) [19:18:59] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [19:19:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:22] (03CR) 10Eevans: [V: 03+2 C: 03+2] Add mock password for wdqs_savepoints [labs/private] - 10https://gerrit.wikimedia.org/r/955383 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:21:51] (03CR) 10Eevans: [C: 03+2] thanos: Create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955382 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:26:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10BCornwall) It looks like confirmation is already shown in the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/hea... [19:26:59] (03PS1) 10Jdlrobson: Add wikispecies logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955385 (https://phabricator.wikimedia.org/T341252) [19:27:01] (03PS1) 10Jdlrobson: Disable wordmark on Gothic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955386 (https://phabricator.wikimedia.org/T341253) [19:27:03] (03PS1) 10Jdlrobson: Wikimania logos and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955387 (https://phabricator.wikimedia.org/T341254) [19:28:43] (03PS1) 10Eevans: Revert "thanos: Create replacement user for wdqs savepoints" [puppet] - 10https://gerrit.wikimedia.org/r/955060 [19:29:21] (03CR) 10Eevans: [C: 03+2] Revert "thanos: Create replacement user for wdqs savepoints" [puppet] - 10https://gerrit.wikimedia.org/r/955060 (owner: 10Eevans) [19:30:50] (03PS1) 10DDesouza: Undeploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955388 (https://phabricator.wikimedia.org/T345158) [19:32:05] (03PS3) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) [19:34:53] (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Use insetup::search_platform role [puppet] - 10https://gerrit.wikimedia.org/r/955376 (https://phabricator.wikimedia.org/T345754) (owner: 10Bking) [19:36:23] (03CR) 10Bking: [C: 03+2] flink-zk: Use insetup::search_platform role [puppet] - 10https://gerrit.wikimedia.org/r/955376 (https://phabricator.wikimedia.org/T345754) (owner: 10Bking) [19:39:46] (03PS1) 10Clare Ming: Bump mediawiki_history_snapshot to 2023-08 [puppet] - 10https://gerrit.wikimedia.org/r/955389 [19:42:33] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/955389 (owner: 10Clare Ming) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T2000). [20:00:05] sergi0, MatmaRex, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] here [20:00:21] hi [20:00:37] hi [20:00:50] o/ i can deploy [20:01:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [20:02:33] (03Merged) 10jenkins-bot: GrowthExperiments: enable add a link in 12 and 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [20:02:53] (03CR) 10Majavah: [C: 03+2] Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955054 (https://phabricator.wikimedia.org/T264765) (owner: 10Bartosz Dziewoński) [20:02:59] (03CR) 10Majavah: [C: 03+2] Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955055 (https://phabricator.wikimedia.org/T264765) (owner: 10Bartosz Dziewoński) [20:03:04] !log taavi@deploy1002 Started scap: Backport for [[gerrit:948144|GrowthExperiments: enable add a link in 12 and 13th round of wikis (T308137 T308138)]] [20:03:05] (03CR) 10Majavah: [C: 03+2] TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955056 (https://phabricator.wikimedia.org/T345648) (owner: 10Bartosz Dziewoński) [20:03:08] T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 [20:03:09] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [20:03:09] (03CR) 10Majavah: [C: 03+2] TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955058 (https://phabricator.wikimedia.org/T345648) (owner: 10Bartosz Dziewoński) [20:04:40] !log taavi@deploy1002 taavi and sgimeno: Backport for [[gerrit:948144|GrowthExperiments: enable add a link in 12 and 13th round of wikis (T308137 T308138)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:04:48] sergi0: please test [20:04:54] on it [20:06:24] taavi: things looking good on my end [20:07:14] !log taavi@deploy1002 taavi and sgimeno: Continuing with sync [20:10:06] Jdlrobson: looking at your patches, logos/README.md says `local_wordmark` and `local_tagline` should only be used 'if you are in a hurry'. why are those being used here? [20:10:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:10:49] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2002.codfw.wmnet [20:10:51] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:11:02] taavi: that README looks outdated [20:11:18] we have 200+ projects to update [20:11:35] it's unrealistic to have these in commons from the offset until they are minimized and confirmed to be working [20:12:52] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:13:07] we've had to do quite a few follow ups due to bad SVGS e.g. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946623 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/944318 [20:13:17] We can programmatically upload these to commons once all the projects have logos [20:13:20] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:948144|GrowthExperiments: enable add a link in 12 and 13th round of wikis (T308137 T308138)]] (duration: 10m 16s) [20:13:24] T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 [20:13:24] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [20:13:37] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:13:37] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:37] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2002.codfw.wmnet on all recursors [20:13:41] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2002.codfw.wmnet on all recursors [20:13:59] (03CR) 10Majavah: [C: 03+2] Add wikispecies logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955385 (https://phabricator.wikimedia.org/T341252) (owner: 10Jdlrobson) [20:14:06] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:14:08] (03CR) 10Majavah: [C: 03+2] Disable wordmark on Gothic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955386 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:14:23] (03CR) 10Majavah: [C: 03+2] Wikimania logos and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955387 (https://phabricator.wikimedia.org/T341254) (owner: 10Jdlrobson) [20:14:38] Jdlrobson: ah. could you ensure the README gets updated please? [20:14:50] (03Merged) 10jenkins-bot: Add wikispecies logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955385 (https://phabricator.wikimedia.org/T341252) (owner: 10Jdlrobson) [20:14:53] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:14:56] (03Merged) 10jenkins-bot: Disable wordmark on Gothic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955386 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:14:59] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2002.codfw.wmnet with OS bookworm [20:15:07] (03Merged) 10jenkins-bot: Wikimania logos and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955387 (https://phabricator.wikimedia.org/T341254) (owner: 10Jdlrobson) [20:15:40] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955385|Add wikispecies logo (T341252)]], [[gerrit:955386|Disable wordmark on Gothic Wikipedia (T341253)]], [[gerrit:955387|Wikimania logos and taglines (T341254)]] [20:15:46] T341254: Provide taglines for Wikimania projects - https://phabricator.wikimedia.org/T341254 [20:15:46] T341252: Provide wordmark and icon for Species Wiki - https://phabricator.wikimedia.org/T341252 [20:15:46] T341253: Provide wordmark and tagline for Gothic Wikipedia - https://phabricator.wikimedia.org/T341253 [20:16:30] taavi: sure [20:17:10] (03Merged) 10jenkins-bot: Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955054 (https://phabricator.wikimedia.org/T264765) (owner: 10Bartosz Dziewoński) [20:17:18] !log taavi@deploy1002 jdlrobson and taavi: Backport for [[gerrit:955385|Add wikispecies logo (T341252)]], [[gerrit:955386|Disable wordmark on Gothic Wikipedia (T341253)]], [[gerrit:955387|Wikimania logos and taglines (T341254)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XW [20:17:18] D option) [20:19:39] (03Merged) 10jenkins-bot: Article: Check permissions before showing link to view deleted revision [core] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955055 (https://phabricator.wikimedia.org/T264765) (owner: 10Bartosz Dziewoński) [20:19:45] (03Merged) 10jenkins-bot: TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/955056 (https://phabricator.wikimedia.org/T345648) (owner: 10Bartosz Dziewoński) [20:19:47] (03Merged) 10jenkins-bot: TopicSubscriptionsPager: Handle invalid titles [extensions/DiscussionTools] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955058 (https://phabricator.wikimedia.org/T345648) (owner: 10Bartosz Dziewoński) [20:20:57] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:58] Jdlrobson: please test [20:22:44] taavi: looking thanks [20:24:00] LGTM taavi please sync [20:24:04] !log taavi@deploy1002 jdlrobson and taavi: Continuing with sync [20:24:08] thanks, syncing [20:27:30] thanks taavi [20:30:05] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955385|Add wikispecies logo (T341252)]], [[gerrit:955386|Disable wordmark on Gothic Wikipedia (T341253)]], [[gerrit:955387|Wikimania logos and taglines (T341254)]] (duration: 14m 25s) [20:30:12] T341254: Provide taglines for Wikimania projects - https://phabricator.wikimedia.org/T341254 [20:30:12] T341252: Provide wordmark and icon for Species Wiki - https://phabricator.wikimedia.org/T341252 [20:30:13] T341253: Provide wordmark and tagline for Gothic Wikipedia - https://phabricator.wikimedia.org/T341253 [20:30:22] Jdlrobson: all done [20:30:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:47] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955054|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955055|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955056|TopicSubscriptionsPager: Handle invalid titles (T345648)]], [[gerrit:955058|TopicSubscriptionsPager: Handle invalid titles (T345648)]] [20:30:50] T345648: Wikimedia\Assert\ParameterTypeException: Bad value for parameter $target: must be a LinkTarget|PageReference when viewing Special:TopicSubscriptions - https://phabricator.wikimedia.org/T345648 [20:31:02] MatmaRex: doing yours now [20:31:26] thanks [20:32:29] !log taavi@deploy1002 matmarex and taavi: Backport for [[gerrit:955054|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955055|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955056|TopicSubscriptionsPager: Handle invalid titles (T345648)]], [[gerrit:955058|TopicSubscriptionsPager: Handle invalid titles (T345648)]] synced to the tes [20:32:29] tservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:32:56] MatmaRex: please test [20:34:04] taavi: looks good [20:34:24] !log taavi@deploy1002 matmarex and taavi: Continuing with sync [20:34:27] syncing [20:34:44] yay thanks taavi for your help today! I'll follow up with the readme improvements now. [20:35:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:39:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk2002.codfw.wmnet with reason: host reimage [20:40:30] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955054|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955055|Article: Check permissions before showing link to view deleted revision (T264765)]], [[gerrit:955056|TopicSubscriptionsPager: Handle invalid titles (T345648)]], [[gerrit:955058|TopicSubscriptionsPager: Handle invalid titles (T345648)]] (duration: 09m 42s) [20:40:33] T345648: Wikimedia\Assert\ParameterTypeException: Bad value for parameter $target: must be a LinkTarget|PageReference when viewing Special:TopicSubscriptions - https://phabricator.wikimedia.org/T345648 [20:40:45] aand done [20:41:10] (03PS1) 10Jdlrobson: Update README clarifying the use of local images. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955393 [20:41:28] taavi: what do you think? [20:42:26] i would prefer if someone familiar with the system +1'd that patch before merging it [20:42:34] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk2002.codfw.wmnet with reason: host reimage [20:43:43] thanks taavi [20:43:57] 👍 [20:45:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [20:48:53] (03CR) 10Bking: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [20:52:42] (SystemdUnitFailed) firing: nginx.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:14] (03CR) 10Ryan Kemper: [C: 03+1] wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [20:53:27] (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [20:56:00] (03Merged) 10jenkins-bot: wdqs.data-transfer: Keep downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [20:56:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T343198)', diff saved to https://phabricator.wikimedia.org/P52277 and previous config saved to /var/cache/conftool/dbconfig/20230906-205626-arnaudb.json [20:56:29] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:56:34] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk2002.codfw.wmnet with OS bookworm [20:56:34] !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host flink-zk2002.codfw.wmnet [20:57:42] (SystemdUnitFailed) resolved: nginx.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1006.eqiad.wmnet with OS bullseye [20:58:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1007.eqiad.wmnet with OS bullseye [20:58:42] (SystemdUnitFailed) firing: nginx.service Failed on wcqs2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230906T2100) [21:03:42] (SystemdUnitFailed) resolved: (2) nginx.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:42] (SystemdUnitFailed) firing: nginx.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:16] (The WCQS alerts are related to ongoing reboots, they should resolve soon. I'll need to check our cookbook to see why we're still getting alerts, they should be downtimed but I think the cookbook is lifting the downtime too eartly [21:08:19] early* [21:09:42] (SystemdUnitFailed) resolved: (3) nginx.service Failed on wcqs2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10RKemper) [21:10:29] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1006.eqiad.wmnet with reason: host reimage [21:10:42] (SystemdUnitFailed) firing: nginx.service Failed on wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:43] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1007.eqiad.wmnet with reason: host reimage [21:11:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P52278 and previous config saved to /var/cache/conftool/dbconfig/20230906-211132-arnaudb.json [21:11:48] (03PS1) 10Bking: wdqs-internal: switch wdqs1016 from public to internal role [puppet] - 10https://gerrit.wikimedia.org/r/955396 (https://phabricator.wikimedia.org/T314890) [21:12:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1006.eqiad.wmnet with reason: host reimage [21:13:09] (03CR) 10Ryan Kemper: [C: 03+1] wdqs-internal: switch wdqs1016 from public to internal role [puppet] - 10https://gerrit.wikimedia.org/r/955396 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [21:15:14] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:25] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1007.eqiad.wmnet with reason: host reimage [21:15:42] (SystemdUnitFailed) firing: (4) nginx.service Failed on wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:46] (03CR) 10Ryan Kemper: [C: 03+1] "We'll hold off on merging this, and the corresponding reimage of wdqs1016, until 2 unrelated existing wdqs eqiad reimages have completed (" [puppet] - 10https://gerrit.wikimedia.org/r/955396 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [21:18:38] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2003.codfw.wmnet [21:18:39] !log bking@cumin1001 START - Cookbook sre.dns.netbox [21:20:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:20:42] (SystemdUnitFailed) resolved: (5) nginx.service Failed on wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:54] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:02] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001" [21:22:12] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001" [21:22:12] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:22:12] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2003.codfw.wmnet on all recursors [21:22:16] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2003.codfw.wmnet on all recursors [21:22:41] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001" [21:23:09] (03Abandoned) 10Ryan Kemper: query_service: move allowlist file resource [puppet] - 10https://gerrit.wikimedia.org/r/951566 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [21:23:28] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001" [21:26:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P52279 and previous config saved to /var/cache/conftool/dbconfig/20230906-212638-arnaudb.json [21:27:28] (03CR) 10Krinkle: [C: 03+1] mw-cli-wrapper: fix own dc reference in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [21:29:13] (03PS1) 10Eevans: thanos: (re)create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955400 (https://phabricator.wikimedia.org/T345765) [21:30:04] (03PS2) 10Eevans: thanos: (re)create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955400 (https://phabricator.wikimedia.org/T345765) [21:30:42] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955400 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [21:31:03] (03PS2) 10Ryan Kemper: wdqs.data-transfer: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/949146 [21:32:27] (03PS1) 10Andrew Bogott: Horizon: use ldaps for the ldap uri [puppet] - 10https://gerrit.wikimedia.org/r/955401 (https://phabricator.wikimedia.org/T345779) [21:32:29] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2003.codfw.wmnet with OS bookworm [21:35:09] (03PS2) 10Andrew Bogott: Horizon sudo panel: use ldaps for the ldap uri [puppet] - 10https://gerrit.wikimedia.org/r/955401 (https://phabricator.wikimedia.org/T345779) [21:35:12] (03CR) 10Eevans: [C: 03+2] thanos: (re)create replacement user for wdqs savepoints [puppet] - 10https://gerrit.wikimedia.org/r/955400 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [21:38:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1006.eqiad.wmnet with OS bullseye [21:38:46] RECOVERY - Check systemd state on mw2442 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:53] !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [21:40:27] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1007.eqiad.wmnet with OS bullseye [21:41:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T343198)', diff saved to https://phabricator.wikimedia.org/P52280 and previous config saved to /var/cache/conftool/dbconfig/20230906-214145-arnaudb.json [21:41:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [21:41:48] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:42:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [21:42:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T343198)', diff saved to https://phabricator.wikimedia.org/P52281 and previous config saved to /var/cache/conftool/dbconfig/20230906-214205-arnaudb.json [21:44:30] !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [21:45:32] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) a:05BCornwall→03None [21:51:11] (03PS3) 10Andrew Bogott: Horizon sudo panel: use ldaps for the ldap uri [puppet] - 10https://gerrit.wikimedia.org/r/955401 (https://phabricator.wikimedia.org/T345779) [21:52:13] (03CR) 10Andrew Bogott: [C: 03+2] Horizon sudo panel: use ldaps for the ldap uri [puppet] - 10https://gerrit.wikimedia.org/r/955401 (https://phabricator.wikimedia.org/T345779) (owner: 10Andrew Bogott) [21:53:03] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk2003.codfw.wmnet with reason: host reimage [21:56:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk2003.codfw.wmnet with reason: host reimage [22:00:34] (03PS5) 10BCornwall: mtail: Record bad requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) [22:00:40] (03CR) 10BCornwall: mtail: Record bad requests for varnish SLI metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:10:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk2003.codfw.wmnet with OS bookworm [22:10:18] !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host flink-zk2003.codfw.wmnet [22:17:09] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) [22:37:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) Thank you for putting the summary together. Another scenario I was thinking about while reading the document is up... [22:39:13] (03CR) 10Jdlrobson: "<3 This is great. Looking forward to seeing the impact on enwiki tomorrow when wmf25 rolls out there. Thanks for prioritizing this!" [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955057 (https://phabricator.wikimedia.org/T345414) (owner: 10MusikAnimal) [22:58:18] (03PS1) 10Andrew Bogott: update horizon docker version [puppet] - 10https://gerrit.wikimedia.org/r/955407 (https://phabricator.wikimedia.org/T345779) [22:59:20] (03CR) 10Andrew Bogott: [C: 03+2] update horizon docker version [puppet] - 10https://gerrit.wikimedia.org/r/955407 (https://phabricator.wikimedia.org/T345779) (owner: 10Andrew Bogott) [23:20:38] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [23:27:28] (03PS1) 10Andrew Bogott: Revert "Horizon sudo panel: use ldaps for the ldap uri" [puppet] - 10https://gerrit.wikimedia.org/r/955409 [23:28:00] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon sudo panel: use ldaps for the ldap uri" [puppet] - 10https://gerrit.wikimedia.org/r/955409 (owner: 10Andrew Bogott) [23:44:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T343198)', diff saved to https://phabricator.wikimedia.org/P52282 and previous config saved to /var/cache/conftool/dbconfig/20230906-234458-arnaudb.json [23:45:03] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198