[00:00:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9578667 (10BTullis) 05Open→03Resolved @Jclark-ctr - This is all done now, I believe. I had to change one BIOS setting t... [00:13:33] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [00:16:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [00:18:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T357189)', diff saved to https://phabricator.wikimedia.org/P57969 and previous config saved to /var/cache/conftool/dbconfig/20240227-001802-arnaudb.json [00:18:10] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:30:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:33:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P57970 and previous config saved to /var/cache/conftool/dbconfig/20240227-003309-arnaudb.json [00:39:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006203 [00:39:18] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006203 (owner: 10TrainBranchBot) [00:48:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P57971 and previous config saved to /var/cache/conftool/dbconfig/20240227-004815-arnaudb.json [00:54:45] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [01:03:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T357189)', diff saved to https://phabricator.wikimedia.org/P57972 and previous config saved to /var/cache/conftool/dbconfig/20240227-010321-arnaudb.json [01:03:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [01:03:28] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:03:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006203 (owner: 10TrainBranchBot) [01:03:35] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9578746 (10phaultfinder) [01:03:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [01:03:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57973 and previous config saved to /var/cache/conftool/dbconfig/20240227-010344-arnaudb.json [01:07:43] 10SRE, 10SRE Observability (FY2023/2024-Q3): Icinga Log Permission Conflict with Puppet Configuration - https://phabricator.wikimedia.org/T358539#9578752 (10andrea.denisse) [01:08:02] 10SRE, 10SRE Observability (FY2023/2024-Q3): Icinga Log Permission Conflict with Puppet Configuration - https://phabricator.wikimedia.org/T358539#9578765 (10andrea.denisse) a:03andrea.denisse [01:20:12] 10SRE, 10SRE Observability (FY2023/2024-Q3): Icinga Fails to Start Due to Missing Hostgroup 'swift' - https://phabricator.wikimedia.org/T358540#9578912 (10andrea.denisse) [01:20:38] 10SRE, 10SRE Observability (FY2023/2024-Q3): Icinga Fails to Start Due to Missing Hostgroup 'swift' - https://phabricator.wikimedia.org/T358540#9578924 (10andrea.denisse) a:03andrea.denisse [01:23:57] (03PS1) 10RLazarus: k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) [01:24:36] (03CR) 10CI reject: [V: 04-1] k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:28:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57974 and previous config saved to /var/cache/conftool/dbconfig/20240227-012814-arnaudb.json [01:28:21] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:28:48] (03PS2) 10RLazarus: k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) [01:37:36] (03PS1) 10RLazarus: deployment_server: Add missing env variables to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006607 (https://phabricator.wikimedia.org/T341553) [01:41:15] 10SRE, 10ops-codfw, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9578949 (10Jhancock.wm) [01:43:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P57975 and previous config saved to /var/cache/conftool/dbconfig/20240227-014321-arnaudb.json [01:44:32] 10SRE, 10ops-codfw, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9578951 (10Jhancock.wm) This server is in codfw. I'll get a report sent to Dell asap to get a replacement cpu [01:46:34] 10SRE, 10ops-codfw, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9578953 (10Jhancock.wm) actually, this server is not in warranty. I will try to find a viable replacement from the decommissioned inventory in the morning. [01:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:53:29] (ProbeDown) firing: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:54:07] (03CR) 10RLazarus: [C: 03+2] deployment_server: Add missing env variables to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006607 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:58:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P57976 and previous config saved to /var/cache/conftool/dbconfig/20240227-015827-arnaudb.json [02:10:58] 10SRE, 10ops-codfw, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9578961 (10wiki_willy) Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refreshed in T355350, do you want to just wait for... [02:13:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57977 and previous config saved to /var/cache/conftool/dbconfig/20240227-021333-arnaudb.json [02:13:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [02:13:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:13:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [02:13:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57978 and previous config saved to /var/cache/conftool/dbconfig/20240227-021357-arnaudb.json [02:31:06] 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9578969 (10Jhancock.wm) [02:32:01] 10SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9578979 (10Jhancock.wm) [02:34:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57979 and previous config saved to /var/cache/conftool/dbconfig/20240227-023456-arnaudb.json [02:35:11] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:50:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P57980 and previous config saved to /var/cache/conftool/dbconfig/20240227-025002-arnaudb.json [02:53:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0300) [03:05:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P57981 and previous config saved to /var/cache/conftool/dbconfig/20240227-030508-arnaudb.json [03:07:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.20 [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1006204 (https://phabricator.wikimedia.org/T354438) [03:07:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.20 [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1006204 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [03:11:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:19:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:20:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:20:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57982 and previous config saved to /var/cache/conftool/dbconfig/20240227-032015-arnaudb.json [03:20:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [03:20:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [03:20:32] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [03:20:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57983 and previous config saved to /var/cache/conftool/dbconfig/20240227-032037-arnaudb.json [03:21:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 8.724 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:26:44] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.20 [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1006204 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [03:31:16] (03PS1) 10Tim Starling: In RequestContext::setUser() also reset $this->skinName [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006313 (https://phabricator.wikimedia.org/T336504) [03:41:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57984 and previous config saved to /var/cache/conftool/dbconfig/20240227-034144-arnaudb.json [03:41:51] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [03:51:00] (03CR) 10KartikMistry: "Yes. We can't automate that :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [03:56:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P57985 and previous config saved to /var/cache/conftool/dbconfig/20240227-035650-arnaudb.json [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0400) [04:02:03] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.17 (duration: 02m 00s) [04:02:26] (03PS13) 10KartikMistry: WIP: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [04:02:28] (03PS2) 10KartikMistry: Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) [04:03:21] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006612 (https://phabricator.wikimedia.org/T354438) [04:03:23] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006612 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [04:03:57] (03PS14) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [04:04:09] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006612 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [04:04:36] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.20 refs T354438 [04:04:42] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [04:11:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P57986 and previous config saved to /var/cache/conftool/dbconfig/20240227-041156-arnaudb.json [04:27:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57987 and previous config saved to /var/cache/conftool/dbconfig/20240227-042703-arnaudb.json [04:27:09] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [04:40:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:41:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:54] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.20 refs T354438 (duration: 52m 18s) [04:57:00] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [04:58:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [05:35:09] * kart_ deploying cxserver [05:35:18] (03CR) 10KartikMistry: [C: 03+2] cxserver: Remove dictionary support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006182 (owner: 10KartikMistry) [05:36:11] (03Merged) 10jenkins-bot: cxserver: Remove dictionary support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006182 (owner: 10KartikMistry) [05:41:28] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:41:59] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:46:12] (ProbeDown) firing: (2) Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:46:16] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:46:48] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:48:41] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:49:15] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:51:12] (ProbeDown) resolved: (2) Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:53:02] (JobUnavailable) firing: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:53:29] (ProbeDown) firing: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:59] 10SRE, 10ops-codfw, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9579105 (10Marostegui) >>! In T358421#9578961, @wiki_willy wrote: > Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refre... [06:06:58] !log cxserver: Removed dictionary support [06:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Master upgrade x2 T353499 [06:22:47] T353499: Upgrade x2 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T353499 [06:22:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Master upgrade x2 T353499 [06:24:14] (03PS1) 10Marostegui: db1152: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006699 (https://phabricator.wikimedia.org/T353499) [06:31:28] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9579132 (10Marostegui) [06:31:38] (03CR) 10Marostegui: [C: 03+2] db1152: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006699 (https://phabricator.wikimedia.org/T353499) (owner: 10Marostegui) [06:31:40] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574314 (10Marostegui) Started data consistency check [06:34:57] (03PS1) 10Marostegui: clouddb1014: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006748 (https://phabricator.wikimedia.org/T356838) [06:35:12] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [06:35:17] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [06:35:35] (03CR) 10Marostegui: "Host is depooled" [puppet] - 10https://gerrit.wikimedia.org/r/1006748 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [06:37:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2029 T358180', diff saved to https://phabricator.wikimedia.org/P57988 and previous config saved to /var/cache/conftool/dbconfig/20240227-063707-root.json [06:37:13] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [06:38:56] (03PS1) 10Marostegui: es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006749 (https://phabricator.wikimedia.org/T358180) [06:40:15] (03CR) 10Marostegui: [C: 03+2] es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006749 (https://phabricator.wikimedia.org/T358180) (owner: 10Marostegui) [06:41:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2029.codfw.wmnet with OS bookworm [06:41:46] (03CR) 10Ayounsi: [C: 03+2] Netbox: set ENFORCE_GLOBAL_UNIQUE to True [puppet] - 10https://gerrit.wikimedia.org/r/1006001 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:42:04] !log Netbox: set ENFORCE_GLOBAL_UNIQUE to True - T336275 [06:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:09] T336275: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 [06:42:39] (03PS1) 10Marostegui: db2118: Notes about its crash [puppet] - 10https://gerrit.wikimedia.org/r/1006750 (https://phabricator.wikimedia.org/T358421) [06:43:03] (03CR) 10Marostegui: [C: 03+1] Switch mariadb::core_multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006477 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:43:20] (03PS1) 10Marostegui: Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1006322 [06:44:48] (03CR) 10Marostegui: [C: 03+2] db2118: Notes about its crash [puppet] - 10https://gerrit.wikimedia.org/r/1006750 (https://phabricator.wikimedia.org/T358421) (owner: 10Marostegui) [06:47:58] (03PS1) 10Marostegui: db1151: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1006751 [06:49:21] (03CR) 10Marostegui: [C: 03+2] db1151: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1006751 (owner: 10Marostegui) [06:50:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [06:51:11] (03PS1) 10Marostegui: s6-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1006752 [06:51:48] (03CR) 10Marostegui: [C: 03+2] s6-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1006752 (owner: 10Marostegui) [06:52:18] (03Merged) 10jenkins-bot: s6-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1006752 (owner: 10Marostegui) [06:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:54:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:54:51] (03CR) 10Ayounsi: [C: 03+2] Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [06:56:39] (03PS1) 10Marostegui: db2143: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1006754 [06:58:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2029.codfw.wmnet with reason: host reimage [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0700). Please do the needful. [07:02:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2029.codfw.wmnet with reason: host reimage [07:02:44] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9579177 (10ayounsi) > cloud VPS doesn't really seem feasible to me I'm curious to know more why it doesn't ? Maybe if there are limitations t... [07:09:29] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T357379#9579181 (10ayounsi) 05Open→03Resolved a:03ayounsi Seeing the very sporadic nature of the issue, I'd say it's a provider issue and not an optic issue. https://librenms.wikimedia.org/graphs/to=1709017500/id=11592/type=... [07:11:33] (03CR) 10Ayounsi: [C: 03+1] Use loopback for DHCP relay on single-ip EVPN anycast GWs [homer/public] - 10https://gerrit.wikimedia.org/r/1006568 (https://phabricator.wikimedia.org/T358488) (owner: 10Cathal Mooney) [07:13:02] (03CR) 10Marostegui: [C: 03+2] db2143: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1006754 (owner: 10Marostegui) [07:20:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2029.codfw.wmnet with OS bookworm [07:20:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 1%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57989 and previous config saved to /var/cache/conftool/dbconfig/20240227-072044-root.json [07:20:47] (03CR) 10Marostegui: [C: 03+2] Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1006322 (owner: 10Marostegui) [07:20:50] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [07:22:53] PROBLEM - MariaDB read only x2 #page on db1152 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.16-MariaDB-log, Uptime 2951s, event_scheduler: True, 2117.79 QPS, connection latency: 0.004454s, query latency: 0.000439s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:23:17] that's me [07:23:18] fixing [07:23:53] RECOVERY - MariaDB read only x2 #page on db1152 is OK: Version 10.6.16-MariaDB-log, Uptime 3012s, read_only: False, event_scheduler: True, 2349.51 QPS, connection latency: 0.004444s, query latency: 0.000463s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:34:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::core_multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006477 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:35:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9579221 (10MoritzMuehlenhoff) [07:35:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 5%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57990 and previous config saved to /var/cache/conftool/dbconfig/20240227-073549-root.json [07:35:56] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [07:46:10] (03CR) 10Dom Walden: [C: 03+1] beta: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998626 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [07:46:45] (03CR) 10Dom Walden: [C: 03+1] "It would be good to monitor any divergence between the new and old tables." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006179 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [07:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57991 and previous config saved to /var/cache/conftool/dbconfig/20240227-075054-root.json [07:51:01] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [07:59:39] (03CR) 10Majavah: [C: 03+1] clouddb1014: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006748 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host dbproxy2001.codfw.wmnet [08:02:05] (03CR) 10Marostegui: [C: 03+2] clouddb1014: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006748 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:05:17] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [08:05:21] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [08:05:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57992 and previous config saved to /var/cache/conftool/dbconfig/20240227-080559-root.json [08:06:12] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [08:15:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dbproxy2001.codfw.wmnet [08:21:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57993 and previous config saved to /var/cache/conftool/dbconfig/20240227-082103-root.json [08:26:14] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2001.wikimedia.org with OS bookworm [08:36:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2132.codfw.wmnet [08:36:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57994 and previous config saved to /var/cache/conftool/dbconfig/20240227-083608-root.json [08:36:19] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [08:38:59] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9579360 (10taavi) [08:47:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2132.codfw.wmnet [08:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: After migration to 10.6 T358180', diff saved to https://phabricator.wikimedia.org/P57995 and previous config saved to /var/cache/conftool/dbconfig/20240227-085113-root.json [08:51:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9579405 (10Marostegui) @Jhancock.wm when installing these hosts, when asked about which puppet version you want, go for 5 instead of 7 [08:51:22] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [08:51:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9579413 (10Marostegui) @Jhancock.wm when installing these hosts, when asked about which puppet version you want, go for 5 instead of 7. You can try to reimage db2196 and go for... [08:52:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2134.codfw.wmnet [08:53:20] 10SRE, 10SRE Observability: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9579417 (10LSobanski) [08:54:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579419 (10Marostegui) @Jclark-ctr we've fixed the puppet issues, can you try to reimage the hosts again but when asked about which puppet version, go for 5 instead of 7. [08:56:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS bookworm [08:56:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1035.eqiad.wmnet with OS bookworm [09:06:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2134.codfw.wmnet [09:09:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [09:11:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9579490 (10MoritzMuehlenhoff) [09:12:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [09:14:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1036.eqiad.wmnet with OS bookworm [09:14:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579496 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1036.eqiad.wmnet with OS bookworm [09:15:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1151.eqiad.wmnet [09:17:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9579500 (10cmooney) 05Open→03Resolved Patch tested again and still working consistently, I think the initial prob... [09:18:35] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9579525 (10phaultfinder) [09:25:22] hi, I'm going to backport this in a few minutes: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1006313 [09:26:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1035.eqiad.wmnet with OS bookworm [09:26:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1035.eqiad.wmnet with OS bookworm completed: - es1035 (**PASS**) -... [09:27:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579571 (10Marostegui) [09:27:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [09:31:42] (03PS2) 10Slyngshede: P:IDP New Bookworm IDP servers. [puppet] - 10https://gerrit.wikimedia.org/r/1006840 (https://phabricator.wikimedia.org/T357748) [09:31:50] (03CR) 10Slyngshede: P:IDP New Bookworm IDP servers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006840 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:31:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [09:32:18] (03PS3) 10Slyngshede: P:IDP New Bookworm IDP servers. [puppet] - 10https://gerrit.wikimedia.org/r/1006840 (https://phabricator.wikimedia.org/T357748) [09:33:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006313 (https://phabricator.wikimedia.org/T336504) (owner: 10Tim Starling) [09:37:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS bookworm [09:38:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1037.eqiad.wmnet with OS bookworm [09:38:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS bookworm [09:38:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1038.eqiad.wmnet with OS bookworm [09:39:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [09:39:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [09:39:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1040.eqiad.wmnet with OS bookworm [09:40:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579615 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1040.eqiad.wmnet with OS bookworm [09:43:24] (03CR) 10Jelto: [C: 04-1] "see in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [09:44:13] (03CR) 10Gmodena: [C: 03+2] eventstreams: update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [09:44:54] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1002" [09:45:33] (03PS1) 10Klausman: admin_ng/ml-services: raise request maximums for art-desc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006855 (https://phabricator.wikimedia.org/T358467) [09:45:35] (03Merged) 10jenkins-bot: eventstreams: update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [09:45:59] (03PS2) 10Klausman: admin_ng/ml-services: raise request maximums for art-desc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006855 (https://phabricator.wikimedia.org/T358467) [09:46:01] (03PS1) 10Brouberol: superset: fix the memcached service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006856 (https://phabricator.wikimedia.org/T352166) [09:46:55] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: / 2079 MB (2% inode=98%): /tmp 2079 MB (2% inode=98%): /var/tmp 2079 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [09:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:47:57] (03PS1) 10Slyngshede: Silence PKI alerts until we have better data. [alerts] - 10https://gerrit.wikimedia.org/r/1006857 [09:50:17] (03CR) 10Kevin Bazira: [C: 03+1] admin_ng/ml-services: raise request maximums for art-desc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006855 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [09:51:45] (03PS1) 10Muehlenhoff: Switch db1151 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006858 (https://phabricator.wikimedia.org/T349619) [09:52:03] (03Merged) 10jenkins-bot: In RequestContext::setUser() also reset $this->skinName [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006313 (https://phabricator.wikimedia.org/T336504) (owner: 10Tim Starling) [09:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:53:06] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1006313|In RequestContext::setUser() also reset $this->skinName (T336504)]] [09:53:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1040.eqiad.wmnet with reason: host reimage [09:53:12] T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504 [09:53:29] (ProbeDown) firing: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1002" [09:54:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1036.eqiad.wmnet with OS bookworm [09:54:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1036.eqiad.wmnet with OS bookworm completed: - es1036 (**PASS**) -... [09:54:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579652 (10Marostegui) [09:54:46] !log jnuche@deploy2002 jnuche and tstarling: Backport for [[gerrit:1006313|In RequestContext::setUser() also reset $this->skinName (T336504)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:55:10] !log jnuche@deploy2002 jnuche and tstarling: Continuing with sync [09:55:16] (03CR) 10Klausman: [C: 03+2] admin_ng/ml-services: raise request maximums for art-desc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006855 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [09:55:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: host reimage [09:57:56] (03Merged) 10jenkins-bot: admin_ng/ml-services: raise request maximums for art-desc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006855 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [09:58:50] (03PS6) 10Jelto: site: apply etherpad role on both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [10:00:47] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:01:15] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:01:27] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:01:43] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:01:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [10:01:53] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:02:13] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:02:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch db1151 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006858 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:02:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [10:02:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1036.eqiad.wmnet with OS bookworm [10:02:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579672 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1036.eqiad.wmnet with OS bookworm completed: - es1036 (**PASS**) - Rem... [10:03:19] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1006313|In RequestContext::setUser() also reset $this->skinName (T336504)]] (duration: 10m 12s) [10:03:25] T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504 [10:06:11] !log klausman@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [10:06:56] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [10:07:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1151.eqiad.wmnet [10:09:57] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1038.eqiad.wmnet with OS bookworm [10:10:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1038.eqiad.wmnet with OS bookworm executed with errors: - es1038 (**... [10:10:03] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1002" [10:10:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS bookworm [10:10:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1038.eqiad.wmnet with OS bookworm [10:10:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2173.codfw.wmnet [10:11:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1002" [10:11:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579687 (10Marostegui) [10:11:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1040.eqiad.wmnet with OS bookworm [10:11:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1040.eqiad.wmnet with OS bookworm completed: - es1040 (**PASS**) -... [10:11:28] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [10:13:30] (03PS1) 10Muehlenhoff: Switch db2173 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006861 (https://phabricator.wikimedia.org/T349619) [10:13:40] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1037.eqiad.wmnet with OS bookworm [10:13:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1037.eqiad.wmnet with OS bookworm executed with errors: - es1037 (**... [10:14:34] (03CR) 10Btullis: [C: 03+1] superset: set root logging level to INFO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006841 (https://phabricator.wikimedia.org/T358510) (owner: 10Brouberol) [10:14:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS bookworm [10:15:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1037.eqiad.wmnet with OS bookworm [10:15:50] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1039.eqiad.wmnet with OS bookworm [10:15:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1039.eqiad.wmnet with OS bookworm executed with errors: - es1039 (**... [10:15:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [10:16:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch db2173 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006861 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:16:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [10:17:34] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1038.eqiad.wmnet with OS bookworm [10:17:37] (03CR) 10Hnowlan: [C: 03+1] ferm: Check ferm.service status in ferm_status.py [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [10:17:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1038.eqiad.wmnet with OS bookworm executed with errors: - es1038 (**... [10:19:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579721 (10Marostegui) @Jclark-ctr es1035, es1036 and es1040 have been installed. No need to touch them anymore. The following errors were found when I tried with the other ones:... [10:20:25] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1039.eqiad.wmnet with OS bookworm [10:20:28] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1037.eqiad.wmnet with OS bookworm [10:20:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1039.eqiad.wmnet with OS bookworm executed with errors: - es1039 (**... [10:20:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9579723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1037.eqiad.wmnet with OS bookworm executed with errors: - es1037 (**... [10:22:48] (03CR) 10Btullis: [C: 03+1] superset: fix the memcached service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006856 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [10:22:53] (03PS1) 10Kevin Bazira: ml-services: increase article-descriptions CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006866 (https://phabricator.wikimedia.org/T358467) [10:26:01] (03CR) 10Klausman: [C: 03+1] ml-services: increase article-descriptions CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006866 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [10:28:46] (03CR) 10Btullis: [C: 03+1] "Thanks for doing this." [puppet] - 10https://gerrit.wikimedia.org/r/995180 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:30:08] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006866 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [10:30:46] PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: / 1580 MB (1% inode=98%): /tmp 1580 MB (1% inode=98%): /var/tmp 1580 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [10:31:18] (03Merged) 10jenkins-bot: ml-services: increase article-descriptions CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006866 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [10:31:42] (03CR) 10Filippo Giunchedi: "Technically LGTM, though why not silence the alert instead?" [alerts] - 10https://gerrit.wikimedia.org/r/1006857 (owner: 10Slyngshede) [10:34:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2173.codfw.wmnet [10:34:40] (03CR) 10Brouberol: [C: 03+2] superset: fix the memcached service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006856 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [10:35:45] (03PS2) 10Brouberol: superset: set root logging level to INFO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006841 (https://phabricator.wikimedia.org/T358510) [10:40:29] (03CR) 10Brouberol: [C: 03+2] superset: set root logging level to INFO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006841 (https://phabricator.wikimedia.org/T358510) (owner: 10Brouberol) [10:41:16] (03PS2) 10Brouberol: superset: disable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006521 (https://phabricator.wikimedia.org/T352166) [10:41:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2185.codfw.wmnet [10:43:44] (03CR) 10Btullis: [C: 03+1] "This seems fine, but could the weird errors be related to the memcached selector issue that we had? Is it worth seeing if we still get the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006521 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [10:43:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006840 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:44:22] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T358566#9579799 (10phaultfinder) [10:45:30] (03PS1) 10Muehlenhoff: Switch db2185 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006887 (https://phabricator.wikimedia.org/T349619) [10:48:54] (03CR) 10Muehlenhoff: [C: 03+2] Switch db2185 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006887 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:52:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2185.codfw.wmnet [10:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:53:36] (03CR) 10Btullis: "Would it be possible elaborate a little on what changes you think might be required for haproxy to deal more gracefully with the existing " [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [10:54:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2186.codfw.wmnet [10:55:23] (03PS1) 10Muehlenhoff: Switch db2186 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006889 (https://phabricator.wikimedia.org/T349619) [10:56:56] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: / 676 MB (0% inode=98%): /tmp 676 MB (0% inode=98%): /var/tmp 676 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [10:57:35] (03PS3) 10Brouberol: superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1100) [11:00:10] (03PS4) 10Brouberol: superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) [11:08:27] (03PS5) 10Brouberol: superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) [11:08:28] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool db2117', diff saved to https://phabricator.wikimedia.org/P57996 and previous config saved to /var/cache/conftool/dbconfig/20240227-110828-jynus.json [11:08:52] (03CR) 10JMeybohm: [C: 04-1] "Needs a chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [11:09:15] (03PS1) 10Alexandros Kosiaris: Revert "conftool: Add some kubernetes hosts to parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/1006890 (https://phabricator.wikimedia.org/T357392) [11:09:17] (03PS1) 10Alexandros Kosiaris: Remove old restbase hosts hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1006891 [11:09:19] (03PS1) 10Alexandros Kosiaris: services_proxy: Add mw-parsoid in the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1006892 (https://phabricator.wikimedia.org/T357392) [11:09:22] (03PS1) 10Alexandros Kosiaris: Switch restbase1019, restbase2021 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006893 (https://phabricator.wikimedia.org/T357392) [11:09:23] (03PS1) 10Alexandros Kosiaris: Switch restbase102[01], restbase202[23] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006894 (https://phabricator.wikimedia.org/T357392) [11:09:25] (03PS1) 10Alexandros Kosiaris: Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) [11:09:27] (03PS1) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) [11:09:29] (03PS1) 10Alexandros Kosiaris: Switch the remaining parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392) [11:09:31] (03PS1) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392) [11:09:33] (03PS1) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392) [11:09:36] (03PS1) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392) [11:09:53] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool db2117', diff saved to https://phabricator.wikimedia.org/P57997 and previous config saved to /var/cache/conftool/dbconfig/20240227-110952-jynus.json [11:10:33] (03CR) 10Btullis: [C: 03+1] "Nice work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) (owner: 10Brouberol) [11:13:03] (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:13:35] (03PS6) 10Brouberol: superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) [11:14:32] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9579883 (10akosiaris) [11:16:29] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9537132 (10akosiaris) The LVS traffic approach was doomed to fail, since scap utilizes the same data s... [11:17:50] (03CR) 10Muehlenhoff: [C: 03+2] Switch db2186 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006889 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:19:18] (03CR) 10Majavah: "I still need to test it, but at least according to the docs `on-marked-down shutdown-sessions` should do the trick. I sent https://gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [11:21:15] (03CR) 10Brouberol: [C: 03+2] superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) (owner: 10Brouberol) [11:22:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2186.codfw.wmnet [11:23:02] (03CR) 10Slyngshede: [C: 03+2] P:IDP New Bookworm IDP servers. [puppet] - 10https://gerrit.wikimedia.org/r/1006840 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [11:23:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:23:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:24:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [11:24:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [11:26:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "conftool: Add some kubernetes hosts to parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/1006890 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:26:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Add mw-parsoid in the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1006892 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:27:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old restbase hosts hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1006891 (owner: 10Alexandros Kosiaris) [11:29:56] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:30:46] RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [11:32:19] (03CR) 10Volans: "Sorry for the random and out of context question:" [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [11:32:50] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26920 bytes in 1.084 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:35:06] (03CR) 10Filippo Giunchedi: "Thank you for reaching out -- off the top of my head if the active host is in hiera then we can pass it in to this class and use that vari" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:35:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] ClusterConfig: Add kube-wiki-parsoid test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005723 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:36:04] (03Merged) 10jenkins-bot: ClusterConfig: Add kube-wiki-parsoid test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005723 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:36:32] (03CR) 10Kamila Součková: [C: 03+2] shellbox: bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006007 (owner: 10Kamila Součková) [11:37:11] (03PS4) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) [11:37:34] (03CR) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:37:36] (03Merged) 10jenkins-bot: shellbox: bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006007 (owner: 10Kamila Součková) [11:37:53] (03CR) 10Majavah: "Given `profile::etherpad::service_ensure` is already a thing, why not use that as `class_parameters` here?" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:38:19] (03CR) 10CI reject: [V: 04-1] thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:38:51] (03CR) 10Filippo Giunchedi: "Indeed that's even better, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:39:54] (03CR) 10Effie Mouzeli: [C: 03+2] php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:39:56] (03PS5) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) [11:42:15] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:42:24] (03PS1) 10Brouberol: idp: collapse superset_next and superset_next_k8s into a single service [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) [11:42:26] (03PS1) 10Brouberol: ATS: redirect superset-next.wikimedia.org traffic to the Kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1006905 (https://phabricator.wikimedia.org/T358569) [11:43:09] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:43:25] (03Abandoned) 10Brouberol: superset: disable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006521 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [11:44:06] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:44:12] (03CR) 10Jelto: "profile::etherpad::service_ensure is set to "stopped" or "runnnig" for the individual etherpad hosts at the moment. I'm not sure how that " [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:44:14] (03PS2) 10Brouberol: idp: collapse superset_next and superset_next_k8s into a single service [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) [11:44:18] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [11:44:40] (03PS2) 10Brouberol: ATS: redirect superset-next.wikimedia.org traffic to the Kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1006905 (https://phabricator.wikimedia.org/T358569) [11:44:46] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006905 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [11:45:13] (03PS1) 10Slyngshede: P:pki::multirootca::monitoring Collect metrics from intermediate. [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) [11:46:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host es1025.eqiad.wmnet [11:46:31] (03CR) 10Majavah: "No, I mean the `class_parameters` parameter of the `prometheus::class_config` resource." [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:46:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch restbase1019, restbase2021 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006893 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:47:04] !log akosiaris@deploy2002 Synchronized tests/src/ClusterConfigTest.php: (no justification provided) (duration: 09m 36s) [11:49:06] (03PS2) 10EoghanGaffney: [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 [11:49:41] (03PS1) 10Muehlenhoff: Switch es1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006910 (https://phabricator.wikimedia.org/T349619) [11:49:43] (03CR) 10Btullis: [C: 03+1] idp: collapse superset_next and superset_next_k8s into a single service [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [11:50:15] (03CR) 10Btullis: [C: 03+1] ATS: redirect superset-next.wikimedia.org traffic to the Kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1006905 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [11:51:15] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:51:19] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1477/co" [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:52:15] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:52:17] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9580018 (10akosiaris) [11:52:37] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9537132 (10akosiaris) Migration started, we are batch 1 for the next few days. [11:53:38] (03CR) 10CI reject: [V: 04-1] [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 (owner: 10EoghanGaffney) [11:54:34] (03CR) 10Slyngshede: [V: 03+1] "This does create a lot of systemd timers, but the script doesn't support wildcards." [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:56:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:56:56] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [11:57:31] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp-test1003.wikimedia.org [11:57:32] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [11:58:20] !log Expanding root lv on mw2281,mw2278 by 20G [11:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:30] (03CR) 10FNegri: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [11:59:34] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1003.wikimedia.org - slyngshede@cumin1002" [12:00:28] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1003.wikimedia.org - slyngshede@cumin1002" [12:00:28] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:00:28] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp-test1003.wikimedia.org on all recursors [12:00:31] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test1003.wikimedia.org on all recursors [12:00:42] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [12:01:00] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1003.wikimedia.org - slyngshede@cumin1002" [12:01:26] !nowandnext [12:01:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch es1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006910 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:01:51] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1003.wikimedia.org - slyngshede@cumin1002" [12:02:11] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1003.wikimedia.org with OS bookworm [12:02:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [12:03:05] (03PS2) 10Muehlenhoff: Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) [12:04:09] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [12:04:27] (03CR) 10Volans: [C: 03+1] elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [12:05:11] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [12:05:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host es1025.eqiad.wmnet [12:08:26] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [12:08:52] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [12:08:58] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [12:09:18] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [12:09:24] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:09:57] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:10:04] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [12:10:28] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [12:11:12] (03PS3) 10Stevemunene: superset: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) [12:11:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:11:46] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [12:13:40] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [12:14:18] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [12:14:24] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [12:14:35] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [12:15:28] (03CR) 10Muehlenhoff: [C: 03+2] udp2log::instance: Use Stdlib::Port for the port [puppet] - 10https://gerrit.wikimedia.org/r/1006482 (owner: 10Muehlenhoff) [12:15:39] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [12:15:45] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:16:27] (03PS3) 10Muehlenhoff: Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) [12:17:00] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:17:06] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [12:18:18] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [12:18:44] (03PS1) 10Muehlenhoff: udp2log::instance: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1006915 [12:20:07] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [12:20:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:20:47] (03PS3) 10EoghanGaffney: [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 [12:20:56] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [12:21:10] (03CR) 10Btullis: [C: 03+1] "Lookg good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1006492 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:21:48] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [12:21:54] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [12:22:20] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [12:22:26] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:23:00] (03CR) 10Btullis: [C: 03+1] wikireplicas: maintain-views: try depooling host on lock failure [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:23:06] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:23:12] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [12:23:54] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [12:25:06] (03CR) 10CI reject: [V: 04-1] [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 (owner: 10EoghanGaffney) [12:28:46] !log installing perl security updates on bullseye [12:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:52] (03CR) 10Btullis: [C: 03+1] "Thanks for the suggestion. At the moment, the script is already /called/ by a cookbook with some specific options here: https://github.com" [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:29:45] !log cgoubert@cumin2002 conftool action : set/weight=20; selector: name=mw2281.codfw.wmnet,cluster=videoscaler,dc=codfw [12:30:12] !log cgoubert@cumin2002 conftool action : set/pooled=no; selector: name=mw2281.codfw.wmnet,cluster=videoscaler,dc=codfw [12:30:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:30:26] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=mw2281.codfw.wmnet,cluster=videoscaler,dc=codfw [12:31:53] !log Lowered weight and restarted apache on mw2281.codfw.wmnet [12:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:39:27] !log rebalancing videoscaler cluster: all E5-2650 to weight 25 [12:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:49] !log cgoubert@cumin2002 conftool action : set/weight=25; selector: cluster=videoscaler,dc=codfw,name=mw22(59|63|64|65|66|78|79|81).* [12:40:07] !log cgoubert@cumin2002 conftool action : set/weight=25; selector: cluster=jobrunner,dc=codfw,name=mw22(59|63|64|65|66|78|79|81).* [12:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:42:28] !log restarting apache2 on mw2281 [12:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:04] (03PS1) 10Muehlenhoff: profile::base: Allow running without cron installed [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) [12:48:55] !log restarting apache2 on mw2278 [12:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [12:53:08] (03PS4) 10EoghanGaffney: [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 [12:54:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1300) [13:00:41] Citoid is essentially non responsive. All the requests are timing out. [13:03:00] effie, moritzm: ideas? [13:09:07] Weirdly it doesn't seem to be showing up in grafana though ... [13:10:40] https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now is it possible we aren't tracking timeouts ? [13:12:15] !log cmooney@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [13:12:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:13:41] (ConfdResourceFailed) firing: confd resource _etc_haproxy_conf.d_wiki-replica-backends.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:14:13] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2001.codfw.wmnet - cmooney@cumin1002" [13:14:50] !log taavi@cumin1002 conftool action : set/pooled=inactive; selector: name=clouddb1018.eqiad.wmnet,service=s2 [13:15:47] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [13:16:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2001.codfw.wmnet - cmooney@cumin1002" [13:16:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:09] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2001.codfw.wmnet on all recursors [13:16:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2001.codfw.wmnet on all recursors [13:16:31] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2001.codfw.wmnet - cmooney@cumin1002" [13:17:11] (03CR) 10Vivian Rook: [C: 03+1] Remove unused PAWS classes [puppet] - 10https://gerrit.wikimedia.org/r/1006852 (owner: 10Majavah) [13:17:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2001.codfw.wmnet - cmooney@cumin1002" [13:18:41] (ConfdResourceFailed) resolved: confd resource _etc_haproxy_conf.d_wiki-replica-backends.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:19:00] !log remove unused 208.80.154.143/32 - 208.80.153.47/32 - 208.80.153.50/32 from Netbox [13:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:26] mvolz: I am sorry, my laptop was in the other room [13:19:43] can you please be more specific? [13:20:45] I'm logged into the deployment server right now and requests to both zotero and citoid staging servers are timing out too [13:21:04] i.e. curl -d '9791029801297' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/search [13:21:14] for zotero [13:21:58] https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/10.1038%2Fs41586-021-03470-x is an example of a request that timesout [13:22:28] (03PS1) 10Majavah: cloudlb: fix default-server spelling [puppet] - 10https://gerrit.wikimedia.org/r/1006921 [13:23:23] on losgstash I see 2 things, one is No Zotero response available from request for https: [13:23:28] [13:23:52] and a RangeError: Maximum call stack size exceeded [13:23:57] yeah that doesn't seem super relevant [13:24:01] (03CR) 10Majavah: [C: 03+2] cloudlb: fix default-server spelling [puppet] - 10https://gerrit.wikimedia.org/r/1006921 (owner: 10Majavah) [13:24:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:pki::multirootca::monitoring Collect metrics from intermediate. [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:24:49] effie: those are expected/normal errors , zotero can't interpret pdf. [13:25:40] ok let me keep digging then [13:28:17] I did a shellbox deployment about an hour ago, could it be related? [13:29:03] 10SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9580276 (10MoritzMuehlenhoff) I had a look and what is failing is the naggen2 calls in the icinga::naggen class in L12 and L21. These were moved to the Pu... [13:29:07] Probably not, I got a user report at like 7 UTC [13:29:16] it should not be related to my understanding [13:29:54] ok, just wanted to make sure, since the shellbox dashboards don't give me a ton of confidence [13:30:30] mvolz: it seems like something started going south at around 6:30 UTC [13:31:25] and I only see a cxserver deployment before that [13:32:33] ah, how can you tell? the 500 signal on grafana isn't affected [13:34:57] https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?orgId=1&refresh=30s&from=now-12h&to=now&viewPanel=15 [13:35:02] 500s [13:36:02] ah, cool, I was looking at the citoid specific logging. maybe because the requests aren't completing it can't actually log those since the logging is coming *from* the service? Dunno. [13:37:05] 10SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9580308 (10WMDE-leszek) [13:38:24] so maybe cxserver has better logging and the same issue is affecting it too? [13:38:46] (03PS1) 10Brouberol: superset: serve the requestctl-geneerator static page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006923 (https://phabricator.wikimedia.org/T356490) [13:42:01] (03PS1) 10Muehlenhoff: Set acmechief_host for idp-test[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/1006924 (https://phabricator.wikimedia.org/T357748) [13:42:15] I am taking a look at cxserver too [13:42:55] 10SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9580343 (10Volans) Is the 400 because of a missing cert? From a cumin host I get: ` $ curl -G "https://puppetdb1003.eqiad.wmnet/pdb/query/v4/resources/Nag... [13:44:20] 10SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9580350 (10WMDE-leszek) I approve the request on WMDE's behalf. While the account has been around for a while it seems we have failed to request account holder to sign the NDA with the WMF. He... [13:45:04] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006924 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [13:46:07] (03CR) 10Brouberol: [C: 03+2] idp: collapse superset_next and superset_next_k8s into a single service [puppet] - 10https://gerrit.wikimedia.org/r/1006904 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [13:46:18] (03CR) 10Brouberol: [C: 03+2] ATS: redirect superset-next.wikimedia.org traffic to the Kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/1006905 (https://phabricator.wikimedia.org/T358569) (owner: 10Brouberol) [13:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:48:54] denisse: FYI ^^^ [13:49:36] On it, thank you. [13:50:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2001.codfw.wmnet with OS bookworm [13:50:29] mvolz: those are the only two errors I see generated by tye app, plus some ,"message":"worker died, restarting [13:51:23] (03CR) 10Muehlenhoff: [C: 03+2] Set acmechief_host for idp-test[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/1006924 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [13:51:51] mvolz: I also see on zotero some FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory [13:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:53:29] (ProbeDown) firing: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:52] mvolz: but also, both citoid and zotero seem to respond to requests [13:54:10] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [13:56:52] effie: aha. it's only one server. codfw is fine [13:56:56] eqiad is not [13:57:54] try: curl -k --header 'Accept: application/json; charset=utf-8' 'https://citoid.svc.codfw.wmnet:4003/api?format=mediawiki&search=979%201029801297' vs curl -k --header 'Accept: application/json; charset=utf-8' 'https://citoid.svc.eqiad.wmnet:4003/api?format=mediawiki&search=9791029801297' [13:58:01] yes eqiad is the one, but they seem to be gettinh traffic still [13:58:26] what continent are you on? :P [13:58:29] the graph I posted before with theh elevated 500s, was the eqiad one [13:58:34] yeah [13:58:56] mvolz: europe, but I was refering to the application logs directrly as I saw them from k8s :) [13:59:00] :) [13:59:11] which didnt help much may I add [13:59:36] is this maybe some sort of isp thing, that affects me and the user report? [13:59:45] the Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory error for zotero is normal? [14:00:01] mvolz: that wouldnt generate 500s on our end I reckon [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:06] we basically do not log zotero at all [14:00:08] so something is definitely up [14:00:11] yeah i guess not [14:00:18] * Lucas_WMDE can’t deploy [14:00:25] so basically no idea what's normal for zotero, it's unfortunetly a disaster. [14:00:39] lol, it does have a reputation [14:01:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1006925 [14:02:28] mvolz: we could do a shot in the dark, and just restart things on eqiad for starters [14:02:41] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2001.codfw.wmnet with reason: host reimage [14:02:52] sounds good to me [14:03:06] (03PS1) 10Slyngshede: C:tomcat Allow users to specify which version of Tomcat to install. [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) [14:04:01] PROBLEM - Disk space on mw2279 is CRITICAL: DISK CRITICAL - free space: / 3735 MB (3% inode=98%): /tmp 3735 MB (3% inode=98%): /var/tmp 3735 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops [14:04:24] (03CR) 10Btullis: [C: 03+1] "Great, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006923 (https://phabricator.wikimedia.org/T356490) (owner: 10Brouberol) [14:05:07] (03CR) 10Brouberol: [C: 03+2] superset: serve the requestctl-geneerator static page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006923 (https://phabricator.wikimedia.org/T356490) (owner: 10Brouberol) [14:05:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2001.codfw.wmnet with reason: host reimage [14:05:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580397 (10klausman) [14:06:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: sync [14:06:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9537731 (10klausman) I've updated the partman lines. I will update `modules/profile/data/profile/installserver/preseed.yaml` to include the new host in a moment, so standa... [14:06:50] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [14:06:51] (03CR) 10CI reject: [V: 04-1] C:tomcat Allow users to specify which version of Tomcat to install. [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [14:07:28] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:07:36] !log force restarted all zotero pods in eqiad [14:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [14:07:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [14:08:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [14:08:44] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: sync [14:08:59] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync [14:09:07] !log force restarted all citoid pods in eqiad [14:09:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [14:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [14:09:51] (03PS1) 10Muehlenhoff: acmechief: Add idp-test[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/1006928 (https://phabricator.wikimedia.org/T357748) [14:10:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [14:10:54] (03PS1) 10Klausman: partman/preseed: Add ml-staging2003 to standard LW worker recipe [puppet] - 10https://gerrit.wikimedia.org/r/1006927 (https://phabricator.wikimedia.org/T357415) [14:10:56] (03CR) 10Klausman: "Adding btullis as a reviewer since I couldn't think of anyone else." [puppet] - 10https://gerrit.wikimedia.org/r/1006927 (https://phabricator.wikimedia.org/T357415) (owner: 10Klausman) [14:11:06] !log pyrra upgraded to 0.7.4-2 T351111 [14:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] T351111: Add footer including privacy policy to slo.wikimedia.org (pyrra) - https://phabricator.wikimedia.org/T351111 [14:12:49] (03PS2) 10Slyngshede: C:tomcat Allow users to specify which version of Tomcat to install. [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) [14:13:31] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1006925 (owner: 10Volans) [14:14:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580435 (10klausman) [14:17:41] (03CR) 10Jelto: "A thanks for the clarification. But I still don't get how class_parameters could be used here. So service_ensure can be set in class_param" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:17:43] (03PS2) 10Jelto: prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) [14:18:10] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:18:40] doesn't seem to have helped :/ [14:19:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2001.codfw.wmnet with OS bookworm [14:19:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [14:19:43] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1006925 (owner: 10Volans) [14:20:19] (03PS1) 10Ssingh: depool codfw: emergency depool patch (do not merge unless required) [dns] - 10https://gerrit.wikimedia.org/r/1006929 [14:20:21] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2004 - cmooney@cumin1002" [14:20:54] (03CR) 10Fabfur: [C: 03+2] codfw lvs::balancer: Switch config_host to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [14:21:37] (03PS1) 10Volans: Upstream release v8.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1006930 [14:21:48] (03PS3) 10Jelto: prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) [14:22:06] mvolz: not at all [14:22:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2004 - cmooney@cumin1002" [14:22:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:23:34] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [14:23:37] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [14:24:15] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9580452 (10akosiaris) [14:24:22] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: sync [14:24:30] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9580465 (10akosiaris) p:05Triage→03High [14:24:36] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [14:25:05] (03PS4) 10Ayounsi: Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) [14:26:06] 10SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9580471 (10Volans) @Jhancock.wm @wiki_willy few considerations here: * Netbox has space only for one asset tag, if we swap the MB and the asset tag change I think we should update Netbox accordingl... [14:26:41] (03CR) 10CI reject: [V: 04-1] Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [14:27:01] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:27:08] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [14:27:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1478/" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:27:24] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2004.wikimedia.org on all recursors [14:27:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2004.wikimedia.org on all recursors [14:27:50] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [14:28:13] oh hmm [14:28:16] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:28:46] (03PS5) 10Ayounsi: Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) [14:28:51] !log depool citoid eqiad [14:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:02] !log jiji@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=citoid,name=eqiad [14:29:04] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:29:20] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [14:29:32] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [14:29:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:48] mvolz: I depooled eqiad until we further investigate [14:30:32] (03CR) 10Volans: [C: 03+2] Upstream release v8.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1006930 (owner: 10Volans) [14:31:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:31:26] sukhe: cgoubert@conf2006:~$ sudo iptables-save | grep 10.192.0.29 [14:31:28] cgoubert@conf2006:~$ [14:31:47] Unless we have the confd port open, haven't checked yet [14:31:59] !log restarting pybal on lvs2014,lvs2011,lvs2012 and lvs2013 for T355544 [14:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:06] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [14:32:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [14:32:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:32:50] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2006.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [14:32:55] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2004.codfw.wmnet on all recursors [14:32:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2004.codfw.wmnet on all recursors [14:33:01] ah no it just needed a pybal restart [14:33:06] should be recovering yeah [14:33:08] fabfur is on it [14:33:41] yep [14:33:51] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [14:33:56] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9580482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2004.codfw.wmnet with... [14:34:02] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:34:20] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 99 connections established with conf2006.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [14:34:32] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 81 connections established with conf2006.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [14:34:38] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9580484 (10ssastry) It doesn't show up in production because the logging... [14:35:16] (03CR) 10Ayounsi: [C: 03+2] Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [14:35:41] !log Adding 20G to root lv on mw2279 [14:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:36:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1035.eqiad.wmnet with OS bookworm [14:36:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm completed: - es1035 (**WARN**) - Dow... [14:36:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:37:28] (03Merged) 10jenkins-bot: Upstream release v8.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1006930 (owner: 10Volans) [14:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:07] !log depooling mw2325.codfw.wmnet,mw2326.codfw.wmnet,mw2327.codfw.wmnet,mw2328.codfw.wmnet,mw2329.codfw.wmnet,mw2330.codfw.wmnet,mw2331.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet for T355544 [14:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:13] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [14:39:40] (03CR) 10Muehlenhoff: "We only use the Tomcat Puppet module for the IDPs, I think changing it alongside for the update with some temporary OS conditionals is fin" [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [14:39:55] ah crap, not b6 today, b3 [14:40:03] lucky I did not actually depool them :) [14:41:06] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [14:41:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [14:41:40] !log uploaded spicerack_8.4.0 to apt.wikimedia.org bullseye-wikimedia [14:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:45] (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:44:02] RECOVERY - Disk space on mw2279 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops [14:44:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS bookworm [14:44:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1037.eqiad.wmnet with OS bookworm [14:44:41] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab] Pause/Prompt before restarting gitlab during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 (owner: 10EoghanGaffney) [14:45:26] (03PS1) 10Ayounsi: set bvibber gid to 500 [puppet] - 10https://gerrit.wikimedia.org/r/1006935 (https://phabricator.wikimedia.org/T358044) [14:45:36] 10SRE, 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T358566#9580571 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm Looks like I accidentally disconnected it while prepping for the switch migration. reconnected and tested. sshable [14:45:50] !log disregard previous depooling message for T355544 [14:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:56] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [14:46:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006935 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [14:47:05] !log Depooling mw2324.codfw.wmnet,mw2323.codfw.wmnet,mw2259.codfw.wmnet,mw2261.codfw.wmnet,mw2262.codfw.wmnet,mw2263.codfw.wmnet,mw2264.codfw.wmnet,mw2265.codfw.wmnet,mw2266.codfw.wmnet,mw2268.codfw.wmnet,mw2269.codfw.wmnet,mw2270.codfw.wmnet,mw2314.codfw.wmnet,mw2315.codfw.wmnet,mw2316.codfw.wmnet,mw2320.codfw.wmnet,mw2321.codfw.wmnet,mw2322.codfw.wmnet for T355870 [14:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:12] (03CR) 10Ayounsi: [C: 03+2] set bvibber gid to 500 [puppet] - 10https://gerrit.wikimedia.org/r/1006935 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [14:47:13] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [14:48:45] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:49:30] (03Merged) 10jenkins-bot: [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 (owner: 10EoghanGaffney) [14:50:08] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2004.codfw.wmnet with OS bookworm [14:50:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS bookworm [14:50:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1038.eqiad.wmnet with OS bookworm [14:51:20] (03CR) 10CDanis: P:pki::multirootca::monitoring Collect metrics from intermediate. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:51:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [14:52:23] !log cmooney@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest2004.codfw.wmnet [14:52:44] !log Drainining mw2260.codfw.wmnet mw2267.codfw.wmnet mw2310.codfw.wmnet mw2311.codfw.wmnet mw2312.codfw.wmnet mw2313.codfw.wmnet mw2317.codfw.wmnet mw2318.codfw.wmnet mw2319.codfw.wmnet kubernetes2030.codfw.wmnet kubernetes2029.codfw.wmnet kubernetes2057.codfw.wmnet for T355870 [14:52:46] PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [14:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:51] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [14:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9580655 (10cmooney) >>! In T345803#9479281, @Papaul wrote: > @cmooney can we get those 2 hosts back in decom? Thanks I'm done with sretes... [14:53:46] RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [14:55:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [14:56:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [14:56:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [14:57:04] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:57:26] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1037.eqiad.wmnet with reason: host reimage [14:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:01] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [14:59:33] topranks: serviceops nodes depooled or drained, good to go from our end [14:59:59] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9580692 (10ayounsi) 05Open→03Resolved Puppet and LDAP updated. > I'll just need the contents on ~brion in mwmaint2002.codfw.wmnet... [15:00:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [15:00:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:20] claime: super that's great :) [15:00:21] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest2004.codfw.wmnet [15:00:26] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9580694 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `sretest2004.codfw.wmnet` - sretes... [15:00:27] I think fabfur is done with moving over pybal from conf2004 as well [15:01:12] hi claime, yeah our activity is completed [15:02:06] awesome ty <3 [15:02:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1037.eqiad.wmnet with reason: host reimage [15:03:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1038.eqiad.wmnet with reason: host reimage [15:06:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1038.eqiad.wmnet with reason: host reimage [15:08:38] !log volans@cumin1002 START - Cookbook sre.dns.netbox [15:10:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1039.eqiad.wmnet with OS bookworm [15:10:43] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002" [15:11:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [15:11:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [15:11:35] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002" [15:11:35] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:08] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9580720 (10cmooney) [15:13:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [15:13:45] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:16:11] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2196.codfw.wmnet on all recursors [15:16:15] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2196.codfw.wmnet on all recursors [15:16:17] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2197.codfw.wmnet on all recursors [15:16:20] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2197.codfw.wmnet on all recursors [15:16:22] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2198.codfw.wmnet on all recursors [15:16:25] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2198.codfw.wmnet on all recursors [15:16:27] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2199.codfw.wmnet on all recursors [15:16:28] !log Cleaning up old tmp media files on codfw jobrunners [15:16:29] sorry for the spam ;) [15:16:30] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2199.codfw.wmnet on all recursors [15:16:32] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2200.codfw.wmnet on all recursors [15:16:35] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2200.codfw.wmnet on all recursors [15:16:37] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2201.codfw.wmnet on all recursors [15:16:40] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2201.codfw.wmnet on all recursors [15:16:43] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2202.codfw.wmnet on all recursors [15:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2202.codfw.wmnet on all recursors [15:16:48] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2203.codfw.wmnet on all recursors [15:16:51] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2203.codfw.wmnet on all recursors [15:16:53] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2204.codfw.wmnet on all recursors [15:16:56] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2204.codfw.wmnet on all recursors [15:16:58] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2205.codfw.wmnet on all recursors [15:17:01] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2205.codfw.wmnet on all recursors [15:17:03] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2206.codfw.wmnet on all recursors [15:17:06] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2206.codfw.wmnet on all recursors [15:17:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:17:09] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2207.codfw.wmnet on all recursors [15:17:12] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2207.codfw.wmnet on all recursors [15:17:14] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2208.codfw.wmnet on all recursors [15:17:17] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2208.codfw.wmnet on all recursors [15:17:19] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2209.codfw.wmnet on all recursors [15:17:22] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2209.codfw.wmnet on all recursors [15:17:24] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2210.codfw.wmnet on all recursors [15:17:27] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2210.codfw.wmnet on all recursors [15:17:29] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2211.codfw.wmnet on all recursors [15:17:32] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2211.codfw.wmnet on all recursors [15:17:34] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2212.codfw.wmnet on all recursors [15:17:37] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2212.codfw.wmnet on all recursors [15:17:39] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2213.codfw.wmnet on all recursors [15:17:43] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2213.codfw.wmnet on all recursors [15:17:43] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9580727 (10Jhancock.wm) a:03Jhancock.wm [15:17:45] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2214.codfw.wmnet on all recursors [15:17:48] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2214.codfw.wmnet on all recursors [15:17:50] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2215.codfw.wmnet on all recursors [15:17:50] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:17:51] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9580725 (10Jhancock.wm) jiggled the handle. checking back later [15:17:53] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2215.codfw.wmnet on all recursors [15:17:55] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2216.codfw.wmnet on all recursors [15:17:58] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2216.codfw.wmnet on all recursors [15:18:00] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2217.codfw.wmnet on all recursors [15:18:03] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2217.codfw.wmnet on all recursors [15:18:05] you'll get lgmsgbot banned :p [15:18:05] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2218.codfw.wmnet on all recursors [15:18:09] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2218.codfw.wmnet on all recursors [15:18:11] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2219.codfw.wmnet on all recursors [15:18:11] possibly [15:18:14] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2219.codfw.wmnet on all recursors [15:18:16] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache db2220.codfw.wmnet on all recursors [15:18:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:18:19] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2220.codfw.wmnet on all recursors [15:18:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1037.eqiad.wmnet with OS bookworm [15:18:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1037.eqiad.wmnet with OS bookworm completed: - es1037 (**PASS**) - Rem... [15:18:45] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:20:01] 5 more and I'm done, sorry [15:20:04] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1035.eqiad.wmnet on all recursors [15:20:07] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1035.eqiad.wmnet on all recursors [15:20:08] the price of cleaning up things [15:20:10] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1036.eqiad.wmnet on all recursors [15:20:13] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1036.eqiad.wmnet on all recursors [15:20:14] tsk [15:20:15] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1037.eqiad.wmnet on all recursors [15:20:18] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1037.eqiad.wmnet on all recursors [15:20:20] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1038.eqiad.wmnet on all recursors [15:20:23] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1038.eqiad.wmnet on all recursors [15:20:26] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1039.eqiad.wmnet on all recursors [15:20:29] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1039.eqiad.wmnet on all recursors [15:20:31] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es1040.eqiad.wmnet on all recursors [15:20:34] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es1040.eqiad.wmnet on all recursors [15:21:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:21:28] (03PS1) 10Brouberol: idp_test: remove superset-next IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1006942 (https://phabricator.wikimedia.org/T358570) [15:21:33] !log Extending vg-root on remaining small disk codfw jobrunners [15:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:01] !log copy prometheus-mcrouter-exporter from bullseye-wikimedia to bookworm-wikimedia T357748 [15:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:07] T357748: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 [15:22:15] I'm done claime, sorry again ;) [15:22:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1038.eqiad.wmnet with OS bookworm [15:22:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1038.eqiad.wmnet with OS bookworm completed: - es1038 (**PASS**) - Rem... [15:22:44] heh, I don't really care, I don't want the poor bot to burnout :P [15:22:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1006942 (https://phabricator.wikimedia.org/T358570) (owner: 10Brouberol) [15:23:37] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [15:24:24] (03CR) 10Brouberol: [C: 03+2] idp_test: remove superset-next IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1006942 (https://phabricator.wikimedia.org/T358570) (owner: 10Brouberol) [15:24:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [15:24:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2001.codfw.wmnet [15:24:37] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9580752 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codf... [15:26:04] effie, mvolz I saw you mentioned cxserver earlier? did you figure out anything? It seems it's unable to make external connections, but could be something else. [15:27:58] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1006927 (https://phabricator.wikimedia.org/T357415) (owner: 10Klausman) [15:30:43] (03PS1) 10Kamila Součková: shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 [15:35:19] (03CR) 10Dzahn: "gotcha, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1006844 (https://phabricator.wikimedia.org/T353298) (owner: 10Filippo Giunchedi) [15:35:22] (03PS1) 10Muehlenhoff: idp::build: Remove explicit dependency on openjdk-11-jdk-headless [puppet] - 10https://gerrit.wikimedia.org/r/1006945 (https://phabricator.wikimedia.org/T357748) [15:37:16] Nikerabbit: not as far as I know. effie depooled eqiad, so citoid is functioning by running just off codfw, but presumably that doesn't help cxserver [15:37:45] Nikerabbit: I was just writing a task, it seems that urldownloader1003 is down [15:39:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on db[2108,2123].codfw.wmnet,es2021.codfw.wmnet with reason: Silence for network maintenance T355870 [15:39:34] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [15:39:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on db[2108,2123].codfw.wmnet,es2021.codfw.wmnet with reason: Silence for network maintenance T355870 [15:39:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355870 - depooling es2021 db2108 db2123', diff saved to https://phabricator.wikimedia.org/P57999 and previous config saved to /var/cache/conftool/dbconfig/20240227-153951-arnaudb.json [15:41:26] !log configuring lsw1-b3-codfw in advance of server migration T355870 [15:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:46] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 53268 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [15:45:52] !log reboot urldownloader1003 - T358597 [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] T358597: urldownloader1003's network is unresponsive - https://phabricator.wikimedia.org/T358597 [15:46:04] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9580810 (10Lucas_Werkmeister_WMDE) [15:46:34] !log jiji@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader1003.wikimedia.org [15:47:52] RECOVERY - Host urldownloader1003 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [15:49:02] effie: urldownloader can impact like, https://phabricator.wikimedia.org/T358595 ? Cause external services have more 500s starting from midnight today. [15:49:23] (03PS2) 10Clément Goubert: mathoid: Upgrade all vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006950 [15:50:27] kart_: yes, can you pleae check that things have improved now? [15:51:07] !log jiji@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader1003.wikimedia.org [15:52:44] effie: Thanks. Checking. [15:52:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:53:02] (ProbeDown) resolved: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:13] kart_: mvolz: I highly suggest you add to your applications and dashboards that connectvity with extrenal services are working alright [15:53:59] while this time it was our end, in the future, problems to external services can lead us to a goose chase [15:55:05] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b3-codfw.mgmt with reason: prepping for server uplink migration codfw rack b3 [15:55:05] kart_, effie: https://phabricator.wikimedia.org/T358597 too as a possible impact [15:55:15] Wrong copy [15:55:32] https://phabricator.wikimedia.org/T358595 [15:55:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b3-codfw.mgmt with reason: prepping for server uplink migration codfw rack b3 [15:55:39] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580837 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94e7352f-26c7-48ff-b2c5-61b1faed7b5a) set by cmooney@cumin1002 fo... [15:55:43] Ye that's what kart_ linked [15:56:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 36 hosts with reason: Migrating servers in codfw rack B3 to lsw1-b3-codfw [15:56:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 36 hosts with reason: Migrating servers in codfw rack B3 to lsw1-b3-codfw [15:56:52] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580845 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4a16f229-e545-4883-81ab-3b2ddd2d7636) set by cmooney@cumin1002 fo... [15:57:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2196.codfw.wmnet with OS bookworm [15:57:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm [15:57:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2196.codfw.wmnet with OS bookworm [15:57:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm executed with errors: - db2196 (**... [15:58:16] effie: agree. Are there any examples other services dashboard for similar setup? [15:58:31] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9580852 (10Dzahn) 05In progress→03Stalled a:05Dzahn→03None [15:58:36] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9580855 (10phaultfinder) [15:58:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2196.codfw.wmnet with OS bookworm [15:58:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm [16:00:04] eoghan, jelto, and arnoldokoth: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1600). [16:00:32] effie: sorry it was hard to debug. we're essentially a webscraper so connecting to external services is basically what we do, but the external services are arbitrary [16:01:03] mvolz: something similar to https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?orgId=1&refresh=30s&from=now-2d&to=now&viewPanel=43 [16:01:07] i did as a first pass check that crossref was up because we do connect to them a lot, but unfortunately the connection is via Zotero so logging that might be hairy :/ [16:01:23] effie: cool, thanks [16:01:32] no need to log connections rather than ensure that connectivity is achievable [16:01:36] ah ok [16:02:02] the above graph from cxserver yields that they were not able to connect to any external providers [16:02:52] additionally, I would suggest to add the ability to view graphs here https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1 [16:02:54] per DC [16:03:10] as it would have been easier to pinpoint in which DC the problem was [16:04:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1039.eqiad.wmnet with reason: host reimage [16:06:32] I think it used to have that [16:06:41] I did notice it was missing today [16:06:49] not sure how/why where it went! [16:07:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1039.eqiad.wmnet with reason: host reimage [16:13:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [16:15:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580894 (10cmooney) All moves complete, everything looking good and back responding to ping :) [16:16:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [16:17:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580897 (10ABran-WMF) thanks! will repool! [16:17:23] (03PS1) 10Ssingh: P:dns::auth::update: move authdns-update state to confd [puppet] - 10https://gerrit.wikimedia.org/r/1006955 (https://phabricator.wikimedia.org/T347054) [16:17:34] topranks: it's ok to revert to conf2004 then? [16:17:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58000 and previous config saved to /var/cache/conftool/dbconfig/20240227-161758-arnaudb.json [16:18:05] fabfur: hi yes it's back pinging so should be good [16:18:06] thanks! [16:18:11] thanks to you! [16:18:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58001 and previous config saved to /var/cache/conftool/dbconfig/20240227-161815-arnaudb.json [16:18:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58002 and previous config saved to /var/cache/conftool/dbconfig/20240227-161827-arnaudb.json [16:18:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006955 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:19:05] (03PS1) 10Fabfur: Revert "codfw lvs::balancer: Switch config_host to conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1006772 [16:19:34] (03CR) 10Ssingh: [C: 03+1] Revert "codfw lvs::balancer: Switch config_host to conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1006772 (owner: 10Fabfur) [16:20:35] (03CR) 10Fabfur: [C: 03+2] Revert "codfw lvs::balancer: Switch config_host to conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1006772 (owner: 10Fabfur) [16:22:05] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:23:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:23:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1039.eqiad.wmnet with OS bookworm [16:23:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm completed: - es1039 (**PASS**) - Rem... [16:23:39] !log restarting pybal on lvs2014,lvs2011,lvs2012 and lvs2013 for T355544 [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:45] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [16:25:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580922 (10Jclark-ctr) [16:25:51] (03PS3) 10RLazarus: k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) [16:26:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580925 (10Jclark-ctr) 05Open→03Resolved [16:27:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9580929 (10fnegri) 05Open→03Stalled [16:29:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9580934 (10Marostegui) [16:29:39] (03CR) 10RLazarus: "Whoops, sure does" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [16:30:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1001.eqiad.wmnet with OS bookworm [16:30:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9580936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1001.eqiad.wmnet with OS bookworm [16:30:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:32:19] (03CR) 10Ssingh: "After discussing in Traffic, we are going to abandon alerting on the cp hosts for this and instead focus on the analytics side. We will pu" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [16:32:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1002.eqiad.wmnet with OS bookworm [16:32:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9580951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1002.eqiad.wmnet with OS bookworm [16:32:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:33:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58003 and previous config saved to /var/cache/conftool/dbconfig/20240227-163303-arnaudb.json [16:33:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2196.codfw.wmnet with OS bookworm [16:33:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm completed: - db2196 (**PASS**) -... [16:33:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58004 and previous config saved to /var/cache/conftool/dbconfig/20240227-163320-arnaudb.json [16:33:27] (03CR) 10Ssingh: ""Majavah -1 on introducing a new Icinga check, anything new should be in Prometheus" I asked Observability and they confirmed that there i" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [16:33:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58005 and previous config saved to /var/cache/conftool/dbconfig/20240227-163332-arnaudb.json [16:33:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580955 (10Jhancock.wm) @Marostegui thanks for the tip! [16:35:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9580957 (10Jhancock.wm) [16:38:16] (03CR) 10Ssingh: "volans: Thanks for the feedback. We might end up doing something like this but not sure where. While the commit doesn't make it fully clea" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [16:39:14] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9580969 (10LSobanski) @ayounsi we're on a tight schedule here as we're trying to get contint off of Buster by EOL. The approach of using a Gan... [16:39:31] (03CR) 10Ssingh: "I have addressed all comments above; if not please let me know. Abandoning this as mentioned." [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [16:39:50] (03Abandoned) 10Ssingh: P:cache::base: add script to check versions of varnish and varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [16:42:31] (03CR) 10Klausman: [C: 03+2] partman/preseed: Add ml-staging2003 to standard LW worker recipe [puppet] - 10https://gerrit.wikimedia.org/r/1006927 (https://phabricator.wikimedia.org/T357415) (owner: 10Klausman) [16:47:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd1002.eqiad.wmnet with OS bookworm [16:47:39] (03PS1) 10Marostegui: clouddb1013: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006963 (https://phabricator.wikimedia.org/T356838) [16:47:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1002.eqiad.wmnet with OS bookworm executed with errors:... [16:48:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58006 and previous config saved to /var/cache/conftool/dbconfig/20240227-164808-arnaudb.json [16:48:19] (03CR) 10Marostegui: "Host not depooled. The migration will happen on Wed and I will depool beforehand." [puppet] - 10https://gerrit.wikimedia.org/r/1006963 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [16:48:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58007 and previous config saved to /var/cache/conftool/dbconfig/20240227-164825-arnaudb.json [16:48:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58008 and previous config saved to /var/cache/conftool/dbconfig/20240227-164837-arnaudb.json [16:49:25] (03CR) 10Majavah: [C: 03+1] clouddb1013: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006963 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [16:49:43] !log Uncordoning mw2260.codfw.wmnet mw2267.codfw.wmnet mw2310.codfw.wmnet mw2311.codfw.wmnet mw2312.codfw.wmnet mw2313.codfw.wmnet mw2317.codfw.wmnet mw2318.codfw.wmnet mw2319.codfw.wmnet kubernetes2030.codfw.wmnet kubernetes2029.codfw.wmnet kubernetes2057.codfw.wmnet for T355870 [16:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:49] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [16:51:35] !log Repooling mw2324.codfw.wmnet,mw2323.codfw.wmnet,mw2259.codfw.wmnet,mw2261.codfw.wmnet,mw2262.codfw.wmnet,mw2263.codfw.wmnet,mw2264.codfw.wmnet,mw2265.codfw.wmnet,mw2266.codfw.wmnet,mw2268.codfw.wmnet,mw2269.codfw.wmnet,mw2270.codfw.wmnet,mw2314.codfw.wmnet,mw2315.codfw.wmnet,mw2316.codfw.wmnet,mw2320.codfw.wmnet,mw2321.codfw.wmnet,mw2322.codfw.wmnet for T355870 [16:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:48] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9581075 (10ayounsi) To clarify, there was no blocker in any of my comments. On the last one, I was genuinely wondering why a CloudVPS was not... [17:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1700) [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:01:15] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=citoid,name=eqiad [17:01:26] !log dzahn@cumin1002 START - Cookbook sre.hosts.decommission for hosts contint1003.eqiad.wmnet [17:01:36] !log pool citoid eqiad back [17:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58009 and previous config saved to /var/cache/conftool/dbconfig/20240227-170312-arnaudb.json [17:03:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58010 and previous config saved to /var/cache/conftool/dbconfig/20240227-170330-arnaudb.json [17:03:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58011 and previous config saved to /var/cache/conftool/dbconfig/20240227-170342-arnaudb.json [17:05:25] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:07:21] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [17:08:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [17:08:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:08:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint1003.eqiad.wmnet [17:08:32] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9581125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1002 for hosts: `contint1003.eqiad.wmnet` - contint1003... [17:09:39] !log dzahn@cumin1002 START - Cookbook sre.hosts.decommission for hosts contint1004.eqiad.wmnet [17:13:52] (03PS1) 10Cwhite: profile: expand logging-hd preseed filter to include eqiad nodes [puppet] - 10https://gerrit.wikimedia.org/r/1006868 (https://phabricator.wikimedia.org/T355700) [17:14:32] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:17:53] (03CR) 10Dzahn: site: apply etherpad role on both eqiad and codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [17:19:19] (03CR) 10Dzahn: [C: 03+1] "I think PS5 should have also worked as "any 3 digits after 1 or 2" but I just copied this and haven't used it normally. Your version is ce" [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [17:19:47] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [17:22:48] (03CR) 10Dzahn: "I like the part that it adds a global "active_server" in common.yaml next to the exiting other servers there. I was thinking about this ye" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [17:22:58] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [17:22:59] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:22:59] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint1004.eqiad.wmnet [17:23:05] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9581169 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1002 for hosts: `contint1004.eqiad.wmnet` - contint1004... [17:24:22] PROBLEM - Juniper alarms on lsw1-b7-codfw.mgmt is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:25:40] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9581175 (10Jhancock.wm) I think the SFP may need to be replaced. please let me know when it is safe to do so. [17:26:00] (03CR) 10Cwhite: [C: 03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1006868/1481/" [puppet] - 10https://gerrit.wikimedia.org/r/1006868 (https://phabricator.wikimedia.org/T355700) (owner: 10Cwhite) [17:29:28] (03CR) 10Dzahn: [C: 03+1] "I like the general approach. I guess I just didn't expect that observability would want us to add a parameter specific to etherpad (or any" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [17:31:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1002.eqiad.wmnet with OS bookworm [17:31:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10Patch-For-Review: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1002.eqiad.wmnet with OS bookworm [17:32:43] (03CR) 10Majavah: "`service_ensure` works like this:" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [17:37:17] (03CR) 10Btullis: [C: 03+1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [17:37:23] (03PS11) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [17:38:56] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9581213 (10Dzahn) 05Stalled→03Open [17:41:29] (03PS12) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [17:43:37] (03CR) 10FNegri: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [17:49:51] 10SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9581257 (10Jhancock.wm) If updating the Accounting sheet is acceptable, I can do that. I will also update the servers with journal notes to keep track of what has been changed with which device. A... [17:50:46] (03PS1) 10Dzahn: phabricator: add scap::user setup to migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1006967 (https://phabricator.wikimedia.org/T357572) [17:53:08] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [17:53:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1003.eqiad.wmnet with OS bookworm [17:54:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1001.eqiad.wmnet with OS bookworm [17:54:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1003.eqiad.wmnet with OS bookworm [17:54:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581284 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1001.eqiad.wmnet with OS bookworm [17:54:15] (03CR) 10Volans: [C: 03+1] "LGTM, let's get approval from search and observability too for their use cases. Thanks a lot for fixing this!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [17:54:26] (03CR) 10Dzahn: "This has no effect on production servers. The class would be used on a new or reimaged server before the first scap deploy of phabricator/" [puppet] - 10https://gerrit.wikimedia.org/r/1006967 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [17:56:44] 10SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9581299 (10Volans) >>! In T358542#9581257, @Jhancock.wm wrote: > If updating the Accounting sheet is acceptable, I can do that. I will also update the servers with journal notes to keep track of wha... [17:56:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd1002.eqiad.wmnet with reason: host reimage [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1800) [18:01:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd1002.eqiad.wmnet with reason: host reimage [18:15:14] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [18:15:40] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [18:17:23] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [18:18:11] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [18:19:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd1001.eqiad.wmnet with reason: host reimage [18:19:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd1003.eqiad.wmnet with reason: host reimage [18:21:23] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [18:22:00] (03PS1) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) [18:22:08] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [18:22:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd1001.eqiad.wmnet with reason: host reimage [18:23:17] (03CR) 10CI reject: [V: 04-1] phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [18:23:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:24:22] (03PS2) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) [18:24:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd1003.eqiad.wmnet with reason: host reimage [18:25:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:25:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd1002.eqiad.wmnet with OS bookworm [18:25:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1002.eqiad.wmnet with OS bookworm completed: - logging-h... [18:25:24] !log deploying refinery [18:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:58] (03PS3) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) [18:27:19] !log tchin@deploy2002 Started deploy [analytics/refinery@ac9fd7b]: Regular analytics weekly train [analytics/refinery@ac9fd7b4] [18:29:15] (03CR) 10Dzahn: "let's get back to this. one option could be to close the linked ticket as declined if it's considered more important to have those hosts i" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [18:37:11] !log tchin@deploy2002 Finished deploy [analytics/refinery@ac9fd7b]: Regular analytics weekly train [analytics/refinery@ac9fd7b4] (duration: 09m 51s) [18:37:47] * tchin !log rollbacked refinery deployment, failed on stat1010 and stat1011 [18:38:31] !log rollbacked refinery deployment, failed on stat1010 and stat1011 [18:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:09] (03PS1) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [18:40:19] (03CR) 10CI reject: [V: 04-1] cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [18:41:26] (03PS1) 10Dzahn: cache::text: remove git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) [18:42:39] (03PS2) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [18:44:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:46:24] (03CR) 10CI reject: [V: 04-1] cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [18:46:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:46:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd1001.eqiad.wmnet with OS bookworm [18:46:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1001.eqiad.wmnet with OS bookworm completed: - logging-h... [18:46:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:47:31] (03PS3) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [18:48:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1001.eqiad.wmnet with OS bookworm [18:48:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd1001.eqiad.wmnet with OS bookworm [18:48:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1001.eqiad.wmnet with OS bookworm [18:48:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1001.eqiad.wmnet with OS bookworm executed with errors:... [18:48:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:48:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd1003.eqiad.wmnet with OS bookworm [18:48:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581417 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1003.eqiad.wmnet with OS bookworm completed: - logging-h... [18:49:09] !log tchin@deploy2002 Started deploy [analytics/refinery@ac9fd7b]: Regular analytics weekly train [analytics/refinery@ac9fd7b4] [18:49:28] !log tchin@deploy2002 Finished deploy [analytics/refinery@ac9fd7b]: Regular analytics weekly train [analytics/refinery@ac9fd7b4] (duration: 00m 18s) [18:49:58] !log tchin@deploy2002 Started deploy [analytics/refinery@ac9fd7b] (thin): Regular analytics weekly train THIN [analytics/refinery@ac9fd7b4] [18:50:04] !log tchin@deploy2002 Finished deploy [analytics/refinery@ac9fd7b] (thin): Regular analytics weekly train THIN [analytics/refinery@ac9fd7b4] (duration: 00m 06s) [18:50:15] !log tchin@deploy2002 Started deploy [analytics/refinery@ac9fd7b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ac9fd7b4] [18:53:30] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:53:41] (03PS1) 10Dzahn: phabricator: remove git.wikimedia.org vhost, rewrites and tests [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) [18:53:57] !log tchin@deploy2002 Finished deploy [analytics/refinery@ac9fd7b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ac9fd7b4] (duration: 03m 42s) [18:57:40] !log finished deploying refinery successfully [18:57:41] (03CR) 10Dzahn: "about gitblit also see: https://phabricator.wikimedia.org/T111465" [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [18:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9581424 (10Jclark-ctr) 05Open→03Resolved a:05VRiley-WMF→03Jclark-ctr [18:58:40] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused PAWS classes [puppet] - 10https://gerrit.wikimedia.org/r/1006852 (owner: 10Majavah) [19:00:05] dduvall and jeena: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1900). [19:01:55] o/ [19:04:58] o/ [19:06:12] (03PS2) 10Dzahn: phabricator: remove git.wikimedia.org vhost, rewrites and tests [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) [19:07:03] (03CR) 10Andrew Bogott: "This seems like an improvement, but I think there's still the risk that the backup job is still running by the time the cleanup starts. Ma" [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [19:07:16] (03CR) 10Andrew Bogott: [C: 03+1] P:wmcs::backup_cinder_volumes: avoid race condition [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [19:09:24] 10SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9581442 (10Dzahn) @Fring please send an email to Katie Francis (@KFrancis ) (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) and she will get back to you about the NDA. [19:10:24] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9581445 (10Dzahn) a:05Clement_Goubert→03None [19:12:48] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9581448 (10Dzahn) @Ifeatu_Nnaobi_WMDE please send an email to Katie Francis (@KFrancis ) (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) and she will get back to you about signing an... [19:15:53] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006984 (https://phabricator.wikimedia.org/T354438) [19:15:57] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006984 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:16:43] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006984 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:26:00] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.20 refs T354438 [19:26:06] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [19:26:09] jouncebot: nowandnext [19:26:10] For the next 1 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T1900) [19:26:10] In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T2100) [19:36:10] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:36:16] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:40:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:40:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:40:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P58012 and previous config saved to /var/cache/conftool/dbconfig/20240227-194021-ladsgroup.json [19:40:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:44:47] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [19:47:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:47:28] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:48:31] (03CR) 10Brennen Bearnes: [C: 03+1] "I don't have strong feelings here, but I think I'm basically fine with the removal inasmuch as I don't have a good idea for where it shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [19:51:13] (03CR) 10Ryan Kemper: [C: 03+2] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:03:36] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9581652 (10phaultfinder) [20:05:16] (03PS1) 10Dzahn: delete passwords::etherpad [labs/private] - 10https://gerrit.wikimedia.org/r/1006988 [20:10:18] (03PS2) 10Dzahn: delete passwords::etherpad [labs/private] - 10https://gerrit.wikimedia.org/r/1006988 [20:12:15] (03PS1) 10Ladsgroup: beta: Remove more mentions of the old replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006990 (https://phabricator.wikimedia.org/T358329) [20:13:14] (03CR) 10Ladsgroup: [C: 03+2] beta: Remove more mentions of the old replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006990 (https://phabricator.wikimedia.org/T358329) (owner: 10Ladsgroup) [20:13:58] (03Merged) 10jenkins-bot: beta: Remove more mentions of the old replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006990 (https://phabricator.wikimedia.org/T358329) (owner: 10Ladsgroup) [20:13:59] jeena and dduvall I'm pushing a beta cluster patch, no deploy, just rebase on deploy1002, if you see anything, my fault :P [20:14:58] rebased on deploy2002 [20:15:01] 👍 thanks for the info [20:17:13] thanks, Amir1 [20:20:20] (03PS1) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [20:21:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [20:29:20] (03CR) 10Jforrester: "Oops, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006990 (https://phabricator.wikimedia.org/T358329) (owner: 10Ladsgroup) [20:36:48] (03PS2) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [20:37:57] (03CR) 10CI reject: [V: 04-1] wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [20:40:29] (03PS3) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [20:40:42] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:41:22] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:42:41] (03PS1) 10Dzahn: Revert "site: replace contint1003 with contint1004" [puppet] - 10https://gerrit.wikimedia.org/r/1006773 [20:43:07] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:44:05] (03CR) 10Dzahn: [C: 03+2] Revert "site: replace contint1003 with contint1004" [puppet] - 10https://gerrit.wikimedia.org/r/1006773 (owner: 10Dzahn) [20:45:44] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:45:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [20:45:51] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:47:33] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:47:39] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:48:13] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [20:48:17] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [20:48:35] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [20:50:37] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:50:43] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:50:49] (03PS1) 10Dzahn: add 'bew' (Betawi) to list of project languages [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) [20:51:15] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [20:51:27] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [20:51:37] (03CR) 10Dzahn: [C: 03+1] "approved by langcom in https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Betawi" [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [20:52:44] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: add scap::user setup to migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1006967 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [20:53:02] (03CR) 10Dzahn: [C: 03+1] "https://iso639-3.sil.org/code/bew | https://en.wikipedia.org/wiki/Betawi_language" [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [20:54:10] (03CR) 10Brennen Bearnes: phabricator: setup scap bin link in migration class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [20:54:31] (03PS4) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [20:56:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [20:58:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 4.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:59:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T2100) [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:30] (03PS1) 10CDanis: [jaeger] oauth2-proxy doesn't need to authorize [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006996 (https://phabricator.wikimedia.org/T320555) [21:01:26] (03CR) 10RhinosF1: [C: 03+1] add 'bew' (Betawi) to list of project languages [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [21:12:13] (03PS5) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [21:12:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [21:12:44] (03PS4) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [21:18:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:13] (03PS5) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [21:19:33] (03PS6) 10Bking: wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) [21:23:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:23:09] (03PS6) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [21:35:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [21:41:05] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9581849 (10Izno) {T200517} has some work done on it that could maybe chan... [22:04:42] (03PS1) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007000 (https://phabricator.wikimedia.org/T356986) [22:05:59] (03CR) 10CI reject: [V: 04-1] cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007000 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [22:07:46] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [22:08:14] (03CR) 10Bking: [C: 03+2] wdqs: Add blackbox http checks for SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1006992 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [22:24:52] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host contint1003.eqiad.wmnet [22:24:53] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [22:29:13] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1003.eqiad.wmnet - dzahn@cumin1002" [22:30:06] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1003.eqiad.wmnet - dzahn@cumin1002" [22:30:06] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:30:06] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache contint1003.eqiad.wmnet on all recursors [22:30:09] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) contint1003.eqiad.wmnet on all recursors [22:30:35] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1003.eqiad.wmnet - dzahn@cumin1002" [22:31:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1003.eqiad.wmnet - dzahn@cumin1002" [22:32:10] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint1003.eqiad.wmnet with OS bullseye [22:32:16] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9581945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bu... [22:32:41] (03CR) 10Dzahn: [C: 03+2] add 'bew' (Betawi) to list of project languages [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [22:32:46] (03PS2) 10Dzahn: add 'bew' (Betawi) to list of project languages [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) [22:41:32] (03CR) 10Dzahn: phabricator: setup scap bin link in migration class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [22:41:35] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1003.eqiad.wmnet with reason: host reimage [22:44:28] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1003.eqiad.wmnet with reason: host reimage [22:45:04] (03CR) 10Dzahn: "where are you, jerkins?" [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [22:45:13] (03CR) 10Dzahn: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1006994 (https://phabricator.wikimedia.org/T357866) (owner: 10Dzahn) [22:46:56] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: / 1650 MB (1% inode=98%): /tmp 1650 MB (1% inode=98%): /var/tmp 1650 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [22:47:31] !log DNS - added new project language "bew" - Betawi, also known as Betawi Malay, Jakartan Malay, or Batavian Malay is the spoken language of the Betawi people in Jakarta, Indonesia with an estimated 5 million native speakers. T357866 [22:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:37] T357866: Create Wikipedia Betawi - https://phabricator.wikimedia.org/T357866 [22:48:30] (ProbeDown) firing: (2) Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:56] (03CR) 10Dzahn: [C: 03+2] phabricator: add scap::user setup to migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1006967 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [22:50:38] 06SRE, 10ops-eqiad, 06DC-Ops, 06Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9581998 (10dr0ptp4kt) After setup, I would be interested in using it for 6 weeks if that's okay (hopefully things would only take 4 weeks, but there's some PTO and real life... [22:50:48] (03CR) 10Dzahn: "Do we still want this, Arnold?" [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [22:53:30] (ProbeDown) firing: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:53] (03PS4) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) [22:57:55] (03PS2) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007000 (https://phabricator.wikimedia.org/T356986) [22:59:16] (03CR) 10CI reject: [V: 04-1] cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007000 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [23:03:10] (03PS1) 10Bking: wdqs: loosen up regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1007006 (https://phabricator.wikimedia.org/T358029) [23:04:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007006 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [23:05:31] (03PS1) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [23:06:09] (03Abandoned) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007000 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [23:08:09] (03PS2) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [23:09:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2197.codfw.wmnet with OS bookworm [23:09:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2198.codfw.wmnet with OS bookworm [23:09:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2199.codfw.wmnet with OS bookworm [23:09:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2200.codfw.wmnet with OS bookworm [23:09:59] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2197.codfw.wmnet with OS bookworm [23:10:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2201.codfw.wmnet with OS bookworm [23:10:02] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2198.codfw.wmnet with OS bookworm [23:10:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2202.codfw.wmnet with OS bookworm [23:10:06] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2199.codfw.wmnet with OS bookworm [23:10:12] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2200.codfw.wmnet with OS bookworm [23:10:19] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2201.codfw.wmnet with OS bookworm [23:10:25] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2202.codfw.wmnet with OS bookworm [23:12:44] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: loosen up regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1007006 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [23:14:31] (03CR) 10Bking: [C: 03+2] wdqs: loosen up regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1007006 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [23:17:05] (03PS1) 10Dzahn: delete passwords::racktables [labs/private] - 10https://gerrit.wikimedia.org/r/1007008 (https://phabricator.wikimedia.org/T327405) [23:19:45] (03PS1) 10Dzahn: delete passwords::tendril and passwords::bugzilla [labs/private] - 10https://gerrit.wikimedia.org/r/1007009 [23:22:58] (03PS1) 10Dzahn: delete passwords::mysql::wikimania_scholarships and passwords::tor [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 [23:24:26] (03PS1) 10Dzahn: delete grafana password classes [labs/private] - 10https://gerrit.wikimedia.org/r/1007011 [23:30:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2200.codfw.wmnet with reason: host reimage [23:33:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2201.codfw.wmnet with reason: host reimage [23:33:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2200.codfw.wmnet with reason: host reimage [23:33:30] (ProbeDown) firing: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [23:33:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2199.codfw.wmnet with reason: host reimage [23:36:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2201.codfw.wmnet with reason: host reimage [23:38:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [23:40:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2199.codfw.wmnet with reason: host reimage [23:41:05] (03PS1) 10Bking: wdqs: remove failing blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1007014 (https://phabricator.wikimedia.org/T358029) [23:42:27] (03CR) 10Cwhite: [C: 03+1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [23:43:30] (ProbeDown) resolved: (2) Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:45:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:49:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:49:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2200.codfw.wmnet with OS bookworm [23:49:31] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2200.codfw.wmnet with OS bookworm completed: - db2200 (**PASS**) -... [23:50:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:52:26] !log T358237 - creating VM with cookbook fails because puppet runs have certificate issue, applied role is already migrated to puppet 7 though [23:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:33] T358237: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237 [23:54:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:54:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2201.codfw.wmnet with OS bookworm [23:54:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:55:12] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2201.codfw.wmnet with OS bookworm completed: - db2201 (**PASS**) -... [23:57:22] !log T358237 - manually went through "fix forward"-steps from T349619 (install puppet-agent package, delete old key material, create new CSR, sign on puppetserver, node clean on puppetmaster) to fix puppet failures while makevm cookbook still running (which couldn't find succesful puppet run) [23:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:30] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619