[00:07:28] (03PS1) 10Superpes15: Add 'sfsblock-bypass' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916883 (https://phabricator.wikimedia.org/T336141) [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916863 [00:39:08] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916863 (owner: 10TrainBranchBot) [00:56:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916863 (owner: 10TrainBranchBot) [01:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:22:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:13:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:54:14] !log Deploy schema change on x1 eqiad wikishared with replication dbmaint T335834 [04:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:18] T335834: Update cx_section_translations table - https://phabricator.wikimedia.org/T335834 [04:57:54] Thanks @marostegui [05:05:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbproxy1013.eqiad.wmnet with reason: Maintenance [05:06:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbproxy1013.eqiad.wmnet with reason: Maintenance [05:10:34] (03PS1) 10Marostegui: db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916894 (https://phabricator.wikimedia.org/T336029) [05:10:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6) T336029', diff saved to https://phabricator.wikimedia.org/P47783 and previous config saved to /var/cache/conftool/dbconfig/20230508-051036-root.json [05:10:40] T336029: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 [05:11:22] (03CR) 10Marostegui: [C: 03+2] db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916894 (https://phabricator.wikimedia.org/T336029) (owner: 10Marostegui) [05:13:10] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/916895 [05:17:34] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) @jijiki I am an employee of WMDE on a new team called "Wikibase Suite". I think my request above has been consolidated into a team-wide request and has been handled, so this can be closed. Tha... [05:18:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1014.eqiad.wmnet with reason: Maintenance [05:18:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1014.eqiad.wmnet with reason: Maintenance [05:20:54] (03PS1) 10Marostegui: Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/915724 [05:21:10] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915725 [05:22:35] (03CR) 10Marostegui: [C: 03+2] Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/915724 (owner: 10Marostegui) [05:25:06] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915725 (owner: 10Marostegui) [05:26:05] (03PS1) 10Marostegui: pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916896 [05:26:30] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915725 (owner: 10Marostegui) [05:26:47] (03CR) 10Marostegui: [C: 03+2] pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916896 (owner: 10Marostegui) [05:27:59] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:915725|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] [05:40:21] (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/916897 [05:42:17] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:915725|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [05:44:09] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/916897 (owner: 10Marostegui) [05:46:48] !log phedenskog@deploy1002 Started deploy [performance/navtiming@9b22d3b]: Measure largest contentful paint element type [05:46:54] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@9b22d3b]: Measure largest contentful paint element type (duration: 00m 05s) [05:55:46] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:915725|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] (duration: 27m 46s) [06:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:12:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:50] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916900 [06:32:35] (03CR) 10Muehlenhoff: "The NDA has been updated." [puppet] - 10https://gerrit.wikimedia.org/r/914348 (owner: 10Addshore) [06:32:40] (03PS2) 10Muehlenhoff: admin: Remove self from some wmde groups & fix email [puppet] - 10https://gerrit.wikimedia.org/r/914348 (owner: 10Addshore) [06:34:13] (03CR) 10Muehlenhoff: [C: 03+2] admin: Remove self from some wmde groups & fix email [puppet] - 10https://gerrit.wikimedia.org/r/914348 (owner: 10Addshore) [06:40:49] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:42:07] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: improve failed first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans) [06:43:02] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:43:42] (03CR) 10Muehlenhoff: [C: 03+2] Move duplicity check for apt keyrings to !defined [puppet] - 10https://gerrit.wikimedia.org/r/916434 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [06:44:04] (03CR) 10Volans: [C: 03+2] decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855) (owner: 10Volans) [06:44:18] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:45:16] (03PS1) 10DLynch: Enable DiscussionTools visual enhancements a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916903 (https://phabricator.wikimedia.org/T302358) [06:45:19] (03Merged) 10jenkins-bot: sre.hosts.reimage: improve failed first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans) [06:47:06] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:47:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [06:48:10] !log Deployed MinT to the production (T331505) [06:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:13] T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505 [06:48:14] (03Merged) 10jenkins-bot: decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855) (owner: 10Volans) [06:48:19] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [06:48:30] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [06:48:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host netflow2003.codfw.wmnet with OS bookworm [06:48:50] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - **The rei... [06:49:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [06:49:29] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [06:49:35] (03CR) 10Abijeet Patro: Add MinT support to cxserver (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry) [06:50:27] !log bounce ferm on aux-k8s-ctrl1001 [06:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:36] RECOVERY - Check systemd state on aux-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:51] (03PS1) 10Volans: ganeti: split test line to avoid noqa [software/spicerack] - 10https://gerrit.wikimedia.org/r/916904 [06:55:58] (03PS12) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [06:56:09] (03CR) 10KartikMistry: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry) [06:58:05] (03CR) 10Volans: [C: 03+2] "trivial, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/916904 (owner: 10Volans) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:54] (03Merged) 10jenkins-bot: ganeti: split test line to avoid noqa [software/spicerack] - 10https://gerrit.wikimedia.org/r/916904 (owner: 10Volans) [07:02:01] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [07:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:05:17] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [07:06:01] (03PS2) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [07:06:41] RECOVERY - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:07:14] (03PS4) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [07:08:31] (NodeTextfileStale) resolved: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:09:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus::server: fix target and rule purge rules [puppet] - 10https://gerrit.wikimedia.org/r/916422 (owner: 10Majavah) [07:09:30] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/916926 [07:10:11] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/916926 (owner: 10Volans) [07:10:45] (03PS12) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [07:11:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:34] (03CR) 10Jelto: [C: 04-1] "I separated the configuration for every release in individual values files in I26940bf1d90bb1ca9e5ff0a2111f3ac810ec0172. So this change co" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [07:11:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host netflow2003.codfw.wmnet with OS bookworm [07:11:56] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Downtimed... [07:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:13:31] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_ [07:14:07] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.4.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/916926 (owner: 10Volans) [07:16:37] (03PS1) 10Volans: Upstream release v6.4.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/916929 [07:16:49] (03CR) 10Volans: [C: 03+2] Upstream release v6.4.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/916929 (owner: 10Volans) [07:18:01] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5027.eqsin.wmnet, cp5028.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:18:44] (HaproxyUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:19:05] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /_info (get the service info) is CRITICAL: Test get the service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:19:10] !log updated bookworm installer to RC2 T330495 [07:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:14] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [07:19:17] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:19:46] <_joe_> we have an outage in eqsin [07:20:04] <_joe_> seems to be recovering [07:20:45] (03Merged) 10jenkins-bot: Upstream release v6.4.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/916929 (owner: 10Volans) [07:20:55] * volans here too [07:21:32] <_joe_> NOT recovering [07:21:39] I'm around [07:21:49] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5027.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:22:03] <_joe_> yeah again in eqsin [07:22:07] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [07:22:21] <_joe_> volans: do you underrstand what's wrong? [07:22:25] <_joe_> because I surely don't [07:22:32] not yet [07:22:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [07:22:46] <_joe_> ah there was a traffic spike [07:22:48] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [07:23:06] <_joe_> another hotlink maybe [07:23:07] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /{src}/info.json (Get service info for osm-intl) is CRITICAL: Test Get service info for osm-intl returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:23:17] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5027.eqsin.wmnet, cp5032.eqsin.wmnet, cp5028.eqsin.wmnet, cp5025.eqsin.wmnet, cp5030.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5028.eqsin.wmnet, cp5030.eqsin.wmnet, cp5025.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:23:28] <_joe_> yeah the load balancers are going down [07:23:51] this is upload in the last 15m https://superset.wikimedia.org/superset/dashboard/p/KawrLzdv29R/ [07:23:55] in eqsin only [07:24:07] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:24:09] <_joe_> we have the data in grafana [07:24:54] <_joe_> https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1&viewPanel=31&from=now-3h&to=now [07:25:07] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:19] <_joe_> I'm gonna try to restart varnish there [07:25:50] ack [07:25:51] <_joe_> !log running restart-cdn on cp5030 [07:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:40] <_joe_> can you ack trhe alert please? [07:26:47] sure [07:26:55] <_joe_> yeah varnish-frontend is failing to restart on cp5030 [07:26:57] <_joe_> sigh [07:27:06] * volans didn't get paged yet [07:27:08] acked [07:27:19] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /{src}/info.json (Get service info for osm-intl) is CRITICAL: Test Get service info for osm-intl returned the unexpected status 503 (expecting: 200): /_info (get the service info) is CRITICAL: Test get the service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:28:12] <_joe_> so varnish was in a really bad state [07:28:31] (JobUnavailable) firing: Reduced availability for job swagger_check_maps_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:28:44] <_joe_> yeah I'm restarting varnish on all hosts [07:29:07] (ProbeDown) firing: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:28] <_joe_> !log restarting varnish-frontend on upload eqsin [07:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:42] <_joe_> volans: if you can try to pin what was the source of the traffic peak at 7:16 [07:30:07] (ProbeDown) resolved: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:31:19] _joe_: there was no spike in upload, it started with a drop, both globally and looking at eqsin only [07:31:38] <_joe_> the data I showed you from grafana disagrees [07:31:46] <_joe_> we had a spike in bytes transmitted [07:31:51] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /{src}/info.json (Get service info for osm-intl) is CRITICAL: Test Get service info for osm-intl returned the unexpected status 503 (expecting: 200): /_info (get the service info) is CRITICAL: Test get the service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:32:07] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:32:33] acked [07:32:45] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:32:54] (JobUnavailable) resolved: Reduced availability for job swagger_check_maps_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:33:33] <_joe_> volans: if it doesn't recover in 2-3 minutes when I'm done restarting varnishes, I'd depool eqsin tbh [07:34:05] I see what grafana shows, but I can't find in superset right now... unless I'm still sleepy [07:34:07] (ProbeDown) firing: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:16] <_joe_> why is this still failing [07:34:57] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:35:22] (ProbeDown) resolved: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:05] <_joe_> volans: looks to me like the usual "varnish can't handle traffic and goes in a bad state" situation [07:36:07] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp5026 is OK: HTTP OK: HTTP/1.1 200 OK - 431 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:36:30] agree [07:36:35] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:36:41] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:39:07] (ProbeDown) resolved: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:35] (03PS1) 10Muehlenhoff: Move to udebs from unstable, needed until next mirror sync [puppet] - 10https://gerrit.wikimedia.org/r/917170 [07:43:44] (HaproxyUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:43:49] (03PS2) 10Muehlenhoff: Move to udebs from unstable, needed until next mirror sync [puppet] - 10https://gerrit.wikimedia.org/r/917170 [07:44:29] (03CR) 10David Caro: [C: 03+2] P:toolforge: fix toolforge-cli config file location [puppet] - 10https://gerrit.wikimedia.org/r/916788 (owner: 10Majavah) [07:44:44] (03CR) 10David Caro: [C: 03+2] P:toolforge::k8s::client: remove toolforge-webservice link [puppet] - 10https://gerrit.wikimedia.org/r/916789 (owner: 10Majavah) [07:44:50] (03CR) 10David Caro: [C: 03+2] P:toolforge: actually install toolforge-cli package [puppet] - 10https://gerrit.wikimedia.org/r/916790 (owner: 10Majavah) [07:45:14] (03CR) 10David Caro: [C: 04-1] "I would like to discuss the need for this package" [puppet] - 10https://gerrit.wikimedia.org/r/916791 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [07:45:44] (HaproxyUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:45:44] (03CR) 10David Caro: [C: 04-1] P:toolforge: install toolforge-logs-cli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916791 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [07:45:54] (03CR) 10Muehlenhoff: [C: 03+2] Move to udebs from unstable, needed until next mirror sync [puppet] - 10https://gerrit.wikimedia.org/r/917170 (owner: 10Muehlenhoff) [07:46:16] <_joe_> !incidents [07:46:17] 3599 (UNACKED) HaproxyUnavailable cache_upload global sre () [07:46:17] 3596 (RESOLVED) HaproxyUnavailable cache_upload global sre () [07:46:17] 3598 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:46:18] 3597 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [07:46:23] <_joe_> !ack 3599 [07:46:23] 3599 (ACKED) HaproxyUnavailable cache_upload global sre () [07:46:26] dcaro: shall I merge your patches along? [07:46:47] <_joe_> godog: no idea why, but that alert just fired and it has no reason to fire again [07:47:11] <_joe_> see the graph too [07:49:38] moritzm: yes please [07:50:14] _joe_: taking a look [07:50:44] (HaproxyUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:50:49] <_joe_> godog: nevermind, I just looked, the problem is that metrics seem not to be recovering to 100% but just 98.9% [07:50:54] <_joe_> and the limit is at 99% [07:51:01] <_joe_> the graph tricked me [07:51:05] dcaro: ack, done [07:51:14] _joe_: ah ok, yeah that makes sense, thanks for checking [07:51:14] thanks! [07:51:17] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host netflow2003.codfw.wmnet with OS bookworm [07:51:20] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [07:51:26] <_joe_> but the metric is flapping which doesn't make much sense [07:53:27] !log fetch HAProxy 2.6.13 on thirdparty/haproxy2.6 (apt.wm.o) - T334448 [07:53:30] <_joe_> godog: is there any way to protect against flapping alerts in alertmanager? [07:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:30] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [07:53:31] (03PS1) 10Filippo Giunchedi: Revert "prometheus::server: fix target and rule purge rules" [puppet] - 10https://gerrit.wikimedia.org/r/917148 [07:54:00] <_joe_> !log restarting varnish-frontend on cp5029, last host in eqsin/upload to be restarted [07:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:08] (03CR) 10CI reject: [V: 04-1] Revert "prometheus::server: fix target and rule purge rules" [puppet] - 10https://gerrit.wikimedia.org/r/917148 (owner: 10Filippo Giunchedi) [07:54:44] _joe_: silencing would be the easiest I'd say [07:55:23] (03PS2) 10Filippo Giunchedi: Revert "prometheus::server: fix target and rule purge rules" [puppet] - 10https://gerrit.wikimedia.org/r/917148 [07:55:29] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916900 (owner: 10Marostegui) [07:55:32] <_joe_> godog: yeah I was asking if it had any algorhythms to do so, like nagios did [07:55:32] (03PS1) 10Marostegui: pc1014: Give master role [puppet] - 10https://gerrit.wikimedia.org/r/917174 [07:56:09] (03CR) 10Marostegui: [C: 03+2] pc1014: Give master role [puppet] - 10https://gerrit.wikimedia.org/r/917174 (owner: 10Marostegui) [07:56:54] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916900 (owner: 10Marostegui) [07:57:33] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:916900|ProductionServices.php: Promote pc1014 to pc1 master]] [07:57:58] _joe_: ah got it now, I'm not aware of a similar mechanism no [07:59:19] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:916900|ProductionServices.php: Promote pc1014 to pc1 master]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:59:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [07:59:37] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus::server: fix target and rule purge rules" [puppet] - 10https://gerrit.wikimedia.org/r/917148 (owner: 10Filippo Giunchedi) [07:59:38] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [07:59:54] taavi: sorry I had to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/917148 :| see commit [08:00:06] ack :/ [08:01:31] taavi: I'm onboard with the idea though, maybe with rules dir only for now if that works for your use case? [08:04:06] (03PS1) 10Majavah: prometheus::server: properly purge rules_path [puppet] - 10https://gerrit.wikimedia.org/r/917176 [08:04:15] ^ something like that? [08:06:32] taavi: yeah exactly, thank you [08:06:40] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus::server: properly purge rules_path [puppet] - 10https://gerrit.wikimedia.org/r/917176 (owner: 10Majavah) [08:10:08] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10jijiki) @darthmon_wmde, I am afraid we will need one task for each request. Please close this one as invalid [08:13:01] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:916900|ProductionServices.php: Promote pc1014 to pc1 master]] (duration: 15m 27s) [08:13:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:13:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:13:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:13:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:13:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P47785 and previous config saved to /var/cache/conftool/dbconfig/20230508-081353-ladsgroup.json [08:13:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [08:14:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [08:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T335845)', diff saved to https://phabricator.wikimedia.org/P47786 and previous config saved to /var/cache/conftool/dbconfig/20230508-081415-ladsgroup.json [08:16:11] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/916895 (owner: 10Marostegui) [08:16:58] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/916895 (owner: 10Marostegui) [08:17:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41076/console" [puppet] - 10https://gerrit.wikimedia.org/r/914340 (https://phabricator.wikimedia.org/T335052) (owner: 10Arturo Borrero Gonzalez) [08:17:55] !log Failover m3-master from dbproxy1020 to dbproxy1016 [08:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:10] (03CR) 10Majavah: [V: 03+1 C: 03+1] kubeadm: install certificates before trying to use them [puppet] - 10https://gerrit.wikimedia.org/r/914340 (https://phabricator.wikimedia.org/T335052) (owner: 10Arturo Borrero Gonzalez) [08:18:14] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host netflow2003.codfw.wmnet with OS bookworm [08:18:16] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [08:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T335845)', diff saved to https://phabricator.wikimedia.org/P47787 and previous config saved to /var/cache/conftool/dbconfig/20230508-081937-ladsgroup.json [08:24:23] (03PS1) 10Marostegui: Revert "pc1014: Give master role" [puppet] - 10https://gerrit.wikimedia.org/r/917149 [08:24:39] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917150 [08:25:42] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917150 (owner: 10Marostegui) [08:27:01] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917150 (owner: 10Marostegui) [08:27:20] !log HAProxy updated to 2.6.13 on cp1077 and cp1085 - T334448 [08:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:24] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [08:27:36] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:917150|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] [08:28:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [08:28:06] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [08:29:12] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:917150|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:30:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: install certificates before trying to use them [puppet] - 10https://gerrit.wikimedia.org/r/914340 (https://phabricator.wikimedia.org/T335052) (owner: 10Arturo Borrero Gonzalez) [08:34:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P47788 and previous config saved to /var/cache/conftool/dbconfig/20230508-083444-ladsgroup.json [08:35:55] 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Michael) [08:36:01] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Give master role" [puppet] - 10https://gerrit.wikimedia.org/r/917149 (owner: 10Marostegui) [08:40:54] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:917150|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 13m 18s) [08:43:30] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host netflow2003.codfw.wmnet with OS bookworm [08:43:33] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [08:45:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:46:55] (03PS1) 10Marostegui: pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/917279 [08:47:49] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/917279 (owner: 10Marostegui) [08:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P47789 and previous config saved to /var/cache/conftool/dbconfig/20230508-084950-ladsgroup.json [08:51:01] (03PS1) 10Ladsgroup: Set externallinks migration to read new on mediawiki.org and fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917280 (https://phabricator.wikimedia.org/T326251) [08:52:18] (03PS2) 10Ladsgroup: Set externallinks migration to read new on mediawiki.org and fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917280 (https://phabricator.wikimedia.org/T335343) [08:53:58] (03PS1) 10Marostegui: es102[25],es202[25]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917282 [08:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 es1025 es2025 es2022 for reboots', diff saved to https://phabricator.wikimedia.org/P47790 and previous config saved to /var/cache/conftool/dbconfig/20230508-085435-root.json [08:54:38] jouncebot: nowandnext [08:54:38] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [08:54:39] In 1 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1000) [08:54:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917280 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [08:54:54] (03CR) 10Marostegui: [C: 03+2] es102[25],es202[25]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917282 (owner: 10Marostegui) [08:55:35] (03Merged) 10jenkins-bot: Set externallinks migration to read new on mediawiki.org and fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917280 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [08:56:04] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:917280|Set externallinks migration to read new on mediawiki.org and fawikiquote (T335343)]] [08:56:07] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [08:57:30] Gitlab maintenance will be starting in 5 minutes, expect downtime of a few hours. You can follow along with today's maintenance here: https://phabricator.wikimedia.org/T335504 [08:57:45] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:917280|Set externallinks migration to read new on mediawiki.org and fawikiquote (T335343)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:59:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [09:02:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [09:04:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T335845)', diff saved to https://phabricator.wikimedia.org/P47791 and previous config saved to /var/cache/conftool/dbconfig/20230508-090456-ladsgroup.json [09:05:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:05:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:05:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T335845)', diff saved to https://phabricator.wikimedia.org/P47792 and previous config saved to /var/cache/conftool/dbconfig/20230508-090521-ladsgroup.json [09:05:54] !log eoghan@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [09:09:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:09] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:917280|Set externallinks migration to read new on mediawiki.org and fawikiquote (T335343)]] (duration: 14m 04s) [09:10:13] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [09:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T335845)', diff saved to https://phabricator.wikimedia.org/P47793 and previous config saved to /var/cache/conftool/dbconfig/20230508-091140-ladsgroup.json [09:12:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P47794 and previous config saved to /var/cache/conftool/dbconfig/20230508-091408-ladsgroup.json [09:15:10] (03PS5) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [09:16:15] (03PS1) 10Marostegui: Revert "es102[25],es202[25]: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/917151 [09:17:13] (03PS4) 10Volans: json-webrequests-stats: add -t/--time-range [puppet] - 10https://gerrit.wikimedia.org/r/854521 [09:18:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [09:18:46] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:22:05] (03CR) 10Marostegui: [C: 03+2] Revert "es102[25],es202[25]: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/917151 (owner: 10Marostegui) [09:22:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47796 and previous config saved to /var/cache/conftool/dbconfig/20230508-092221-root.json [09:22:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47797 and previous config saved to /var/cache/conftool/dbconfig/20230508-092223-root.json [09:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47798 and previous config saved to /var/cache/conftool/dbconfig/20230508-092232-root.json [09:22:45] (03PS2) 10EoghanGaffney: [gitlab/failover] Swap DNS entries for gitlab [dns] - 10https://gerrit.wikimedia.org/r/912972 (https://phabricator.wikimedia.org/T335504) [09:23:57] (03CR) 10Jbond: "lgtm just some left over code" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [09:24:32] (03PS3) 10EoghanGaffney: [gitlab/failover] Switch primary from codfw->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/912881 (https://phabricator.wikimedia.org/T335504) [09:24:35] (03PS4) 10Jbond: add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [09:25:01] (03CR) 10Jbond: [C: 03+1] add tunnelencabulator (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [09:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P47799 and previous config saved to /var/cache/conftool/dbconfig/20230508-092647-ladsgroup.json [09:26:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall looks good to me, after discussion with Janis. I think this is ok as a transitional patch but I'd like us to stop assigning every " [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P47800 and previous config saved to /var/cache/conftool/dbconfig/20230508-092916-ladsgroup.json [09:31:31] (03CR) 10Muehlenhoff: [C: 03+2] os-updates: Generate an additional overview page with a breakdown per SRE team [puppet] - 10https://gerrit.wikimedia.org/r/916493 (owner: 10Muehlenhoff) [09:33:48] (03CR) 10JMeybohm: [C: 03+2] Add new url downloaders to ACLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/916512 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [09:34:56] (03PS2) 10Effie Mouzeli: data.yaml: Add Surbhi Gupta [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) [09:35:54] (03CR) 10Effie Mouzeli: data.yaml: Add Surbhi Gupta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) (owner: 10Effie Mouzeli) [09:36:21] (03Merged) 10jenkins-bot: Add new url downloaders to ACLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/916512 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [09:37:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47801 and previous config saved to /var/cache/conftool/dbconfig/20230508-093726-root.json [09:37:27] (03CR) 10Volans: [C: 03+2] "Let's test it! (finally)" [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [09:37:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47802 and previous config saved to /var/cache/conftool/dbconfig/20230508-093728-root.json [09:37:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47803 and previous config saved to /var/cache/conftool/dbconfig/20230508-093735-root.json [09:37:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47804 and previous config saved to /var/cache/conftool/dbconfig/20230508-093743-root.json [09:38:10] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:38:22] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:39:36] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:39:45] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:39:59] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:40:11] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:13] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:40:22] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:40:30] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:40:44] moritzm: updated ACLs for urldownloaders have been deployed now [09:41:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/916736 (owner: 10Majavah) [09:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P47805 and previous config saved to /var/cache/conftool/dbconfig/20230508-094153-ladsgroup.json [09:44:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P47806 and previous config saved to /var/cache/conftool/dbconfig/20230508-094422-ladsgroup.json [09:44:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:45:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] apt: unattendedupgrades: upgrade osbpo packages too [puppet] - 10https://gerrit.wikimedia.org/r/916736 (owner: 10Majavah) [09:48:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [09:48:09] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [09:52:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47807 and previous config saved to /var/cache/conftool/dbconfig/20230508-095231-root.json [09:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47808 and previous config saved to /var/cache/conftool/dbconfig/20230508-095233-root.json [09:52:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47809 and previous config saved to /var/cache/conftool/dbconfig/20230508-095240-root.json [09:52:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47810 and previous config saved to /var/cache/conftool/dbconfig/20230508-095248-root.json [09:53:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) 05Open→03Resolved a:03jijiki Hey @SGupta-WMF, after https://gerrit.wikimedia.org/r/916491 is merged, you can go ahead ask for kerberos access: https://wi... [09:53:43] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: add support for return traffic to public VIPs if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) [09:54:22] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10jijiki) Hey @KFrancis! Could you please set up @lojo with their NDA? Thank you! [09:55:10] (03CR) 10Jcrespo: "Thank you a lot for this. <3 May I ask to update https://wikitech.wikimedia.org/wiki/Logs/Runbook#Webrequests_Sampled ? I don't think that" [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [09:55:54] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: add support for return traffic to public VIPs if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) (owner: 10Arturo Borrero Gonzalez) [09:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T335845)', diff saved to https://phabricator.wikimedia.org/P47811 and previous config saved to /var/cache/conftool/dbconfig/20230508-095659-ladsgroup.json [09:57:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47812 and previous config saved to /var/cache/conftool/dbconfig/20230508-095724-ladsgroup.json [09:57:30] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10jijiki) Dear #Infrastructure-Foundations, please choose a name for th... [09:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P47813 and previous config saved to /var/cache/conftool/dbconfig/20230508-095928-ladsgroup.json [09:59:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [09:59:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T335845)', diff saved to https://phabricator.wikimedia.org/P47814 and previous config saved to /var/cache/conftool/dbconfig/20230508-100003-ladsgroup.json [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1000) [10:00:48] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: add support for return traffic to public VIPs if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) [10:02:49] (03PS5) 10Arturo Borrero Gonzalez: cloud_private_subnet: add support for return traffic to public VIPs if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) [10:03:33] (03PS2) 10Hnowlan: service: move device-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/899608 (https://phabricator.wikimedia.org/T320967) [10:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47815 and previous config saved to /var/cache/conftool/dbconfig/20230508-100449-ladsgroup.json [10:04:57] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [10:05:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) (owner: 10Effie Mouzeli) [10:05:34] (03CR) 10Effie Mouzeli: [C: 03+2] data.yaml: Add Surbhi Gupta [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) (owner: 10Effie Mouzeli) [10:06:05] RECOVERY - puppet last run on irc2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T335845)', diff saved to https://phabricator.wikimedia.org/P47816 and previous config saved to /var/cache/conftool/dbconfig/20230508-100622-ladsgroup.json [10:07:05] (03PS13) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [10:07:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47817 and previous config saved to /var/cache/conftool/dbconfig/20230508-100736-root.json [10:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47818 and previous config saved to /var/cache/conftool/dbconfig/20230508-100737-root.json [10:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47819 and previous config saved to /var/cache/conftool/dbconfig/20230508-100744-root.json [10:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47820 and previous config saved to /var/cache/conftool/dbconfig/20230508-100753-root.json [10:09:13] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [10:10:21] (03PS1) 10Effie Mouzeli: data.yaml: add fjoseph to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/917289 (https://phabricator.wikimedia.org/T336009) [10:10:45] (03PS1) 10Volans: spicerack: refactor IRC logging [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917290 [10:12:44] (03PS1) 10Volans: IRC logging: renamed irc_logger to sal_logger [cookbooks] - 10https://gerrit.wikimedia.org/r/917291 [10:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:15:54] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Switch primary from codfw->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/912881 (https://phabricator.wikimedia.org/T335504) (owner: 10EoghanGaffney) [10:18:25] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:19:43] (03CR) 10Vgutierrez: [C: 03+1] service: move device-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/899608 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P47821 and previous config saved to /var/cache/conftool/dbconfig/20230508-101955-ladsgroup.json [10:20:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917290 (owner: 10Volans) [10:21:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/917291 (owner: 10Volans) [10:21:17] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:21:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P47822 and previous config saved to /var/cache/conftool/dbconfig/20230508-102128-ladsgroup.json [10:21:41] (03CR) 10Jbond: [C: 03+1] data.yaml: add fjoseph to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/917289 (https://phabricator.wikimedia.org/T336009) (owner: 10Effie Mouzeli) [10:21:59] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [10:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47823 and previous config saved to /var/cache/conftool/dbconfig/20230508-102240-root.json [10:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47824 and previous config saved to /var/cache/conftool/dbconfig/20230508-102242-root.json [10:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47825 and previous config saved to /var/cache/conftool/dbconfig/20230508-102249-root.json [10:22:55] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1397 is CRITICAL: etcd last index (1923336) is outdated compared to the master one (1923339) https://wikitech.wikimedia.org/wiki/Etcd [10:22:57] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1482 is CRITICAL: etcd last index (1923336) is outdated compared to the master one (1923339) https://wikitech.wikimedia.org/wiki/Etcd [10:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47826 and previous config saved to /var/cache/conftool/dbconfig/20230508-102258-root.json [10:23:11] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Hadn't considered doing it for the whole range rather than individual VIP but that should make it simpler across the different LBs." [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) (owner: 10Arturo Borrero Gonzalez) [10:24:21] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1397 is OK: etcd last index (1923342) matches the master one (1923342) https://wikitech.wikimedia.org/wiki/Etcd [10:24:23] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1482 is OK: etcd last index (1923342) matches the master one (1923342) https://wikitech.wikimedia.org/wiki/Etcd [10:24:29] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [10:27:34] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/917293 [10:27:49] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host netflow2003.codfw.wmnet with OS bookworm [10:27:52] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [10:27:54] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:28:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bookworm [10:28:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: add support for return traffic to public VIPs if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) (owner: 10Arturo Borrero Gonzalez) [10:31:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [10:33:51] PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [10:34:45] RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [10:35:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [10:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P47827 and previous config saved to /var/cache/conftool/dbconfig/20230508-103501-ladsgroup.json [10:35:53] PROBLEM - Check systemd state on cloudlb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P47828 and previous config saved to /var/cache/conftool/dbconfig/20230508-103634-ladsgroup.json [10:36:37] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:41] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [10:36:41] PROBLEM - Host cloudlb2003-dev is DOWN: PING CRITICAL - Packet loss = 100% [10:37:01] RECOVERY - Check systemd state on cloudlb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:11] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:45] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47829 and previous config saved to /var/cache/conftool/dbconfig/20230508-103745-root.json [10:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47830 and previous config saved to /var/cache/conftool/dbconfig/20230508-103747-root.json [10:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47831 and previous config saved to /var/cache/conftool/dbconfig/20230508-103754-root.json [10:37:54] (JobUnavailable) firing: (3) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:37:57] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:38:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47832 and previous config saved to /var/cache/conftool/dbconfig/20230508-103802-root.json [10:38:17] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [10:38:27] RECOVERY - Host cloudlb2003-dev is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [10:39:41] (03PS14) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [10:41:01] (03CR) 10Effie Mouzeli: [C: 03+1] Enable parser cache warming jobs for parsoid on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:41:11] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:42:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:42:30] (03CR) 10Hnowlan: [C: 03+2] service: move device-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/899608 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:43:16] (03PS4) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) [10:43:29] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:43:31] (JobUnavailable) firing: (3) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:43:53] (03CR) 10Muehlenhoff: "Two comments inline" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [10:44:19] (03Merged) 10jenkins-bot: Enable parser cache warming jobs for parsoid on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:44:45] !log daniel@deploy1002 Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] [10:44:49] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [10:44:51] !log daniel@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (duration: 00m 05s) [10:45:00] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Swap DNS entries for gitlab [dns] - 10https://gerrit.wikimedia.org/r/912972 (https://phabricator.wikimedia.org/T335504) (owner: 10EoghanGaffney) [10:45:52] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:45:55] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967) [10:45:56] (03CR) 10Jbond: "lgtm but some minor nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [10:45:58] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [10:47:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967) [10:47:39] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [10:47:43] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [10:48:26] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47833 and previous config saved to /var/cache/conftool/dbconfig/20230508-105007-ladsgroup.json [10:50:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:50:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:50:27] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache gitlab.wikimedia.org on all recursors [10:50:30] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab.wikimedia.org on all recursors [10:50:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T335845)', diff saved to https://phabricator.wikimedia.org/P47834 and previous config saved to /var/cache/conftool/dbconfig/20230508-105032-ladsgroup.json [10:50:38] (03PS6) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [10:50:40] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: refresh cloud-private vlan subdomain [dns] - 10https://gerrit.wikimedia.org/r/917296 (https://phabricator.wikimedia.org/T335759) [10:50:42] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache gitlab-replica.wikimedia.org on all recursors [10:50:45] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-replica.wikimedia.org on all recursors [10:51:02] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [10:51:06] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [10:51:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [10:51:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T335845)', diff saved to https://phabricator.wikimedia.org/P47835 and previous config saved to /var/cache/conftool/dbconfig/20230508-105141-ladsgroup.json [10:51:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:52:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T335845)', diff saved to https://phabricator.wikimedia.org/P47836 and previous config saved to /var/cache/conftool/dbconfig/20230508-105215-ladsgroup.json [10:52:23] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2005.codfw.wmnet with OS bookworm [10:52:37] (03PS1) 10Arturo Borrero Gonzalez: cloud-private: refresh domain [puppet] - 10https://gerrit.wikimedia.org/r/917297 (https://phabricator.wikimedia.org/T335759) [10:52:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47837 and previous config saved to /var/cache/conftool/dbconfig/20230508-105250-root.json [10:52:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47838 and previous config saved to /var/cache/conftool/dbconfig/20230508-105252-root.json [10:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47839 and previous config saved to /var/cache/conftool/dbconfig/20230508-105258-root.json [10:53:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bookworm [10:53:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47840 and previous config saved to /var/cache/conftool/dbconfig/20230508-105307-root.json [10:53:08] (03PS2) 10Arturo Borrero Gonzalez: wikimedia.cloud: refresh cloud-private vlan subdomain [dns] - 10https://gerrit.wikimedia.org/r/917296 (https://phabricator.wikimedia.org/T335759) [10:53:10] (03PS7) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [10:54:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: refresh cloud-private vlan subdomain [dns] - 10https://gerrit.wikimedia.org/r/917296 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [10:54:55] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967) [10:54:58] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [10:55:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [10:55:56] (03PS1) 10Daniel Kinzler: Revert "Enable parser cache warming jobs for parsoid on small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917153 [10:56:06] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [10:56:13] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967) [10:56:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud-private: refresh domain [puppet] - 10https://gerrit.wikimedia.org/r/917297 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [10:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T335845)', diff saved to https://phabricator.wikimedia.org/P47841 and previous config saved to /var/cache/conftool/dbconfig/20230508-105753-ladsgroup.json [10:57:54] (JobUnavailable) resolved: (3) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:31] (JobUnavailable) firing: (4) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T335845)', diff saved to https://phabricator.wikimedia.org/P47842 and previous config saved to /var/cache/conftool/dbconfig/20230508-105835-ladsgroup.json [10:59:00] (03PS8) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [10:59:09] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [11:01:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) 05Open→03Resolved The above patch has been merged and tested, it now will output:... [11:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:07:54] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47843 and previous config saved to /var/cache/conftool/dbconfig/20230508-110755-root.json [11:07:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47844 and previous config saved to /var/cache/conftool/dbconfig/20230508-110756-root.json [11:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47845 and previous config saved to /var/cache/conftool/dbconfig/20230508-110803-root.json [11:08:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P47846 and previous config saved to /var/cache/conftool/dbconfig/20230508-110812-root.json [11:09:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:09] (03PS1) 10Marostegui: instances.yaml: Remove db1113 (s5,s6) [puppet] - 10https://gerrit.wikimedia.org/r/917301 (https://phabricator.wikimedia.org/T336029) [11:10:48] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1113 (s5,s6) [puppet] - 10https://gerrit.wikimedia.org/r/917301 (https://phabricator.wikimedia.org/T336029) (owner: 10Marostegui) [11:11:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1113 from dbctl T336029', diff saved to https://phabricator.wikimedia.org/P47847 and previous config saved to /var/cache/conftool/dbconfig/20230508-111113-marostegui.json [11:11:17] T336029: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 [11:12:39] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: introduce haproxy check for the BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/917302 (https://phabricator.wikimedia.org/T324992) [11:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P47848 and previous config saved to /var/cache/conftool/dbconfig/20230508-111259-ladsgroup.json [11:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:13:33] (03PS1) 10Marostegui: db1215: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917303 (https://phabricator.wikimedia.org/T335014) [11:13:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47849 and previous config saved to /var/cache/conftool/dbconfig/20230508-111342-ladsgroup.json [11:14:24] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:11] (03CR) 10Marostegui: [C: 03+2] db1215: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917303 (https://phabricator.wikimedia.org/T335014) (owner: 10Marostegui) [11:17:23] (03PS1) 10Hnowlan: Add discovery records for device-analytics [dns] - 10https://gerrit.wikimedia.org/r/917306 (https://phabricator.wikimedia.org/T335505) [11:20:17] !log daniel@deploy1002 Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] [11:20:22] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [11:21:41] !log daniel@deploy1002 daniel: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:24:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gitlab1004), Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:24:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: introduce haproxy check for the BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/917302 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:27:40] ^ backup alert for gitlab1004 is expected, host was switched and backup will be created later this night [11:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P47850 and previous config saved to /var/cache/conftool/dbconfig/20230508-112805-ladsgroup.json [11:28:08] (03PS2) 10Marostegui: .bashrc: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [11:28:10] (03CR) 10Jbond: "Ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [11:28:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47851 and previous config saved to /var/cache/conftool/dbconfig/20230508-112848-ladsgroup.json [11:29:37] (03PS1) 10Marostegui: orchestrator: Change database [puppet] - 10https://gerrit.wikimedia.org/r/917313 (https://phabricator.wikimedia.org/T334455) [11:29:53] (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover to happen" [puppet] - 10https://gerrit.wikimedia.org/r/917313 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:31:39] (03PS1) 10Marostegui: switchover.py: Replace zarcillo host [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) [11:31:55] (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover to happen" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:32:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2005.codfw.wmnet with OS bookworm [11:33:08] (03CR) 10CI reject: [V: 04-1] switchover.py: Replace zarcillo host [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:35:01] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/917293 (owner: 10Muehlenhoff) [11:35:43] (03PS1) 10Marostegui: mariadb: Promote db1215 to zarcillo master [puppet] - 10https://gerrit.wikimedia.org/r/917323 (https://phabricator.wikimedia.org/T335014) [11:35:44] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] (duration: 15m 26s) [11:35:47] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [11:35:51] arturo: ok to merge your "cloudlb: introduce haproxy check for the BGP VIP" change along? [11:36:02] moritzm: sorry, yes [11:36:05] (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover day" [puppet] - 10https://gerrit.wikimedia.org/r/917323 (https://phabricator.wikimedia.org/T335014) (owner: 10Marostegui) [11:36:29] moritzm: not sure what happened [11:36:49] https://www.irccloud.com/pastebin/GJNCd23p/ [11:37:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/917289 (https://phabricator.wikimedia.org/T336009) (owner: 10Effie Mouzeli) [11:37:31] arturo: ack, done :-) [11:37:36] thanks [11:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:41:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bullseye [11:43:02] (03PS2) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [11:43:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T335845)', diff saved to https://phabricator.wikimedia.org/P47853 and previous config saved to /var/cache/conftool/dbconfig/20230508-114312-ladsgroup.json [11:43:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:43:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:43:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T335845)', diff saved to https://phabricator.wikimedia.org/P47854 and previous config saved to /var/cache/conftool/dbconfig/20230508-114336-ladsgroup.json [11:43:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T335845)', diff saved to https://phabricator.wikimedia.org/P47855 and previous config saved to /var/cache/conftool/dbconfig/20230508-114354-ladsgroup.json [11:43:55] (03PS3) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [11:43:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:44:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T335845)', diff saved to https://phabricator.wikimedia.org/P47856 and previous config saved to /var/cache/conftool/dbconfig/20230508-114417-ladsgroup.json [11:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T335845)', diff saved to https://phabricator.wikimedia.org/P47857 and previous config saved to /var/cache/conftool/dbconfig/20230508-115036-ladsgroup.json [11:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T335845)', diff saved to https://phabricator.wikimedia.org/P47858 and previous config saved to /var/cache/conftool/dbconfig/20230508-115056-ladsgroup.json [11:51:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:54:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:59:58] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [12:00:58] (03CR) 10Filippo Giunchedi: sre.hardware.sel: add simple cookbook for querying the SEL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [12:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P47859 and previous config saved to /var/cache/conftool/dbconfig/20230508-120542-ladsgroup.json [12:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P47860 and previous config saved to /var/cache/conftool/dbconfig/20230508-120602-ladsgroup.json [12:06:03] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2448.codfw.wmnet [12:06:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bullseye [12:08:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10jijiki) 05Resolved→03Open Hello, I am afraid `mw2448` was not feeling any better today, so for the time being it is marked again as `inactive`. I am terribly... [12:11:05] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:12:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:16:09] (03CR) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [12:20:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P47861 and previous config saved to /var/cache/conftool/dbconfig/20230508-122048-ladsgroup.json [12:21:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P47862 and previous config saved to /var/cache/conftool/dbconfig/20230508-122108-ladsgroup.json [12:22:48] (03PS1) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: don't use a tempfile [puppet] - 10https://gerrit.wikimedia.org/r/917326 [12:22:50] (03PS1) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: remove unused variables [puppet] - 10https://gerrit.wikimedia.org/r/917327 [12:22:52] (03PS1) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: prevent globbing and word splitting [puppet] - 10https://gerrit.wikimedia.org/r/917328 [12:22:54] (03PS1) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: introduce new check mode --check=someup [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) [12:23:13] (03CR) 10CI reject: [V: 04-1] haproxy: check_haproxy: don't use a tempfile [puppet] - 10https://gerrit.wikimedia.org/r/917326 (owner: 10Arturo Borrero Gonzalez) [12:23:54] (03PS1) 10Ladsgroup: Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) [12:25:26] (03CR) 10CI reject: [V: 04-1] Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:26:03] (03PS2) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: don't use a tempfile [puppet] - 10https://gerrit.wikimedia.org/r/917326 [12:26:05] (03PS2) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: remove unused variables [puppet] - 10https://gerrit.wikimedia.org/r/917327 [12:26:07] (03PS2) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: prevent globbing and word splitting [puppet] - 10https://gerrit.wikimedia.org/r/917328 [12:26:09] (03PS2) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: introduce new check mode --check=someup [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) [12:26:43] (03PS2) 10Ladsgroup: Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) [12:28:37] (03CR) 10CI reject: [V: 04-1] Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:28:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [12:30:34] (03PS3) 10Ladsgroup: Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) [12:32:07] (03CR) 10CI reject: [V: 04-1] Fixes to CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:32:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [12:35:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T335845)', diff saved to https://phabricator.wikimedia.org/P47863 and previous config saved to /var/cache/conftool/dbconfig/20230508-123554-ladsgroup.json [12:35:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:36:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:36:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T335845)', diff saved to https://phabricator.wikimedia.org/P47864 and previous config saved to /var/cache/conftool/dbconfig/20230508-123614-ladsgroup.json [12:36:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:36:21] (03CR) 10David Caro: [C: 03+2] toolforge: add tekton metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/915771 (https://phabricator.wikimedia.org/T325163) (owner: 10Raymond Ndibe) [12:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T335845)', diff saved to https://phabricator.wikimedia.org/P47865 and previous config saved to /var/cache/conftool/dbconfig/20230508-123624-ladsgroup.json [12:36:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:36:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:36:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T335845)', diff saved to https://phabricator.wikimedia.org/P47866 and previous config saved to /var/cache/conftool/dbconfig/20230508-123654-ladsgroup.json [12:37:01] (03CR) 10EoghanGaffney: [gitlab/failover] Rename host flags (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:38:57] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade [12:39:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt with reason: cloudsw1-b1-codfw OS upgrade [12:39:21] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=744d6bf2-4472-4a4c-b0a2-ebf0e4e9d466) set by cmooney@cu... [12:39:44] (03PS4) 10Ladsgroup: Drop wmfmariadbpy/cli_admin/osc_host.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) [12:40:53] !log rebooting cloudsw1-b1-codfw for OS upgrade T333316 [12:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:57] T333316: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 [12:41:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [12:42:12] (03CR) 10Marostegui: [C: 03+1] Drop wmfmariadbpy/cli_admin/osc_host.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:42:24] (03PS5) 10Ladsgroup: Drop wmfmariadbpy/cli_admin/osc_host.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) [12:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T335845)', diff saved to https://phabricator.wikimedia.org/P47867 and previous config saved to /var/cache/conftool/dbconfig/20230508-124414-ladsgroup.json [12:44:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T335845)', diff saved to https://phabricator.wikimedia.org/P47868 and previous config saved to /var/cache/conftool/dbconfig/20230508-124452-ladsgroup.json [12:45:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [12:45:02] !log installing openvswitch securiy updates [12:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:15] (03CR) 10Ladsgroup: [C: 03+2] Drop wmfmariadbpy/cli_admin/osc_host.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:48:51] (03Merged) 10jenkins-bot: Drop wmfmariadbpy/cli_admin/osc_host.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917330 (https://phabricator.wikimedia.org/T336166) (owner: 10Ladsgroup) [12:50:00] (03CR) 10Ladsgroup: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:51:10] !log installing python-django security updates on stretch [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:52] ^^^ cr2-codfw router alert is due to reboot of cloudsw1-b1, expected [12:53:51] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:55:19] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt [12:55:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt [12:56:25] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1215 to zarcillo master [puppet] - 10https://gerrit.wikimedia.org/r/917323 (https://phabricator.wikimedia.org/T335014) (owner: 10Marostegui) [12:56:48] (03PS12) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [12:56:53] (03CR) 10Ladsgroup: [C: 03+1] orchestrator: Change database [puppet] - 10https://gerrit.wikimedia.org/r/917313 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:56:55] !log installing ruby-rack security updates [12:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P47869 and previous config saved to /var/cache/conftool/dbconfig/20230508-125920-ladsgroup.json [12:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P47870 and previous config saved to /var/cache/conftool/dbconfig/20230508-125958-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1300). nyaa~ [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:09:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [13:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P47871 and previous config saved to /var/cache/conftool/dbconfig/20230508-131426-ladsgroup.json [13:14:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [13:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P47872 and previous config saved to /var/cache/conftool/dbconfig/20230508-131504-ladsgroup.json [13:16:46] (03CR) 10Jelto: [C: 03+1] "looks good now thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [13:16:56] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Upgraded to 22.2R3.15, which is now the recommended version for this platform, hoping it might make some difference, but the issue pers... [13:20:08] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) 05Open→03Declined I'm going to close this task for now. We should have sufficient IPs from the RIPE waiting list fr... [13:21:46] (03PS4) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [13:22:08] (03CR) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [13:22:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [13:22:33] (03PS5) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [13:23:16] (03PS6) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [13:24:39] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [13:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [13:27:55] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41078/console" [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:29:22] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:29:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T335845)', diff saved to https://phabricator.wikimedia.org/P47873 and previous config saved to /var/cache/conftool/dbconfig/20230508-132932-ladsgroup.json [13:29:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:29:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47874 and previous config saved to /var/cache/conftool/dbconfig/20230508-132957-ladsgroup.json [13:30:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T335845)', diff saved to https://phabricator.wikimedia.org/P47875 and previous config saved to /var/cache/conftool/dbconfig/20230508-133011-ladsgroup.json [13:30:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [13:30:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [13:30:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T335845)', diff saved to https://phabricator.wikimedia.org/P47876 and previous config saved to /var/cache/conftool/dbconfig/20230508-133034-ladsgroup.json [13:35:07] (03PS1) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [13:37:09] (03CR) 10CDanis: Set DoProbe cookie to initiate a probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:37:14] (03CR) 10CI reject: [V: 04-1] Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [13:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47877 and previous config saved to /var/cache/conftool/dbconfig/20230508-133718-ladsgroup.json [13:39:05] jouncebot: nowandnext [13:39:05] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1300) [13:39:05] In 1 hour(s) and 50 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1530) [13:40:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T335845)', diff saved to https://phabricator.wikimedia.org/P47878 and previous config saved to /var/cache/conftool/dbconfig/20230508-134002-ladsgroup.json [13:40:44] (03PS2) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [13:42:47] (03PS6) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [13:42:54] (03CR) 10CI reject: [V: 04-1] Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [13:43:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41079/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:43:39] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:44:08] (03CR) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:44:38] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:44:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [13:46:27] (03PS3) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [13:47:23] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Looking further at the logs I honed in on this message: ` Mar 28 09:28:53 cloudsw1-b1-codfw sshd[11344]: subsystem request for netconf... [13:48:37] (03PS1) 10KartikMistry: Update cxserver to 2023-05-08-134152-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917338 (https://phabricator.wikimedia.org/T336115) [13:48:46] 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Michael) a:05Michael→03HasanAkgun_WMDE @HasanAkgun_WMDE is managing the deployment tomorrow. [13:50:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [13:50:55] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [13:51:27] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [13:51:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [13:52:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P47879 and previous config saved to /var/cache/conftool/dbconfig/20230508-135224-ladsgroup.json [13:55:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P47880 and previous config saved to /var/cache/conftool/dbconfig/20230508-135508-ladsgroup.json [13:55:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [13:55:51] (03PS11) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [13:55:53] (03CR) 10Jelto: [V: 03+1] "one question about cas args in line. Otherwise change and diff looks good as far as I can tell." [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:57:55] (03CR) 10CI reject: [V: 04-1] sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [13:58:40] (03CR) 10JMeybohm: scaffold: add support for periodic jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [13:58:56] (03CR) 10Ottomata: [C: 03+2] Add mediawiki.page_outlink_topic_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:59:48] (03Merged) 10jenkins-bot: Add mediawiki.page_outlink_topic_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:00:51] (03PS6) 10Ottomata: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) [14:02:15] (03PS5) 10JMeybohm: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [14:02:17] (03PS1) 10JMeybohm: CI: Diff scaffold changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/917339 [14:02:19] (03PS1) 10Hnowlan: rest-gateway: don't append when setting headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) [14:04:07] (03PS7) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [14:04:09] (03PS7) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:05:40] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:41] (03Abandoned) 10Effie Mouzeli: Revert "Enable parser cache warming jobs for parsoid on small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917153 (owner: 10Daniel Kinzler) [14:07:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P47881 and previous config saved to /var/cache/conftool/dbconfig/20230508-140731-ladsgroup.json [14:08:31] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: wgEventStreams - Add mediawiki.page_outlink_topic_prediction_change stream - T328899 (duration: 06m 54s) [14:08:35] T328899: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 [14:08:40] (03PS1) 10Ssingh: sites.yaml: add new dns host dns2004 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/917341 (https://phabricator.wikimedia.org/T326688) [14:08:43] (03PS8) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:09:00] (03PS1) 10Ssingh: dns2004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/917342 (https://phabricator.wikimedia.org/T326688) [14:09:24] (03CR) 10Ottomata: [C: 03+2] "Deployed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:09:45] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-airflow1001.eqiad.wmnet [14:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P47882 and previous config saved to /var/cache/conftool/dbconfig/20230508-141014-ladsgroup.json [14:11:34] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:55] (03PS2) 10Ssingh: dns2004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/917342 (https://phabricator.wikimedia.org/T326688) [14:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:14:21] (03PS1) 10Bking: airflow: decommission an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/917343 (https://phabricator.wikimedia.org/T333697) [14:14:37] (03CR) 10Effie Mouzeli: [C: 03+2] data.yaml: add fjoseph to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/917289 (https://phabricator.wikimedia.org/T336009) (owner: 10Effie Mouzeli) [14:14:39] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:15:31] (03CR) 10Ssingh: [C: 03+2] dns2004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/917342 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [14:16:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2004.wikimedia.org with OS bullseye [14:16:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2004.wikimedia.org with OS bullseye [14:18:02] (03PS12) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [14:19:17] (03CR) 10Jbond: "thanks, update" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:20:14] (03CR) 10CI reject: [V: 04-1] sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:20:33] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf and Turnilo for Fjoseph - https://phabricator.wikimedia.org/T336009 (10jijiki) 05Open→03Resolved a:03jijiki [14:21:26] (03CR) 10JMeybohm: [C: 03+1] scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [14:22:02] (03PS8) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [14:22:04] (03PS9) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:22:06] (03PS1) 10Andrew Bogott: Rearrange py2/py3 versions of mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/917345 [14:22:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47883 and previous config saved to /var/cache/conftool/dbconfig/20230508-142237-ladsgroup.json [14:22:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:22:47] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/917346 (https://phabricator.wikimedia.org/T253093) [14:22:54] (03CR) 10CI reject: [V: 04-1] Rearrange py2/py3 versions of mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/917345 (owner: 10Andrew Bogott) [14:22:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47884 and previous config saved to /var/cache/conftool/dbconfig/20230508-142302-ladsgroup.json [14:24:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47885 and previous config saved to /var/cache/conftool/dbconfig/20230508-142427-ladsgroup.json [14:24:32] (03PS2) 10Andrew Bogott: Rearrange py2/py3 versions of mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/917345 [14:24:34] (03PS9) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [14:24:36] (03PS10) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:24:55] (03CR) 10JMeybohm: CI: Diff scaffold changes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917339 (owner: 10JMeybohm) [14:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T335845)', diff saved to https://phabricator.wikimedia.org/P47886 and previous config saved to /var/cache/conftool/dbconfig/20230508-142520-ladsgroup.json [14:25:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [14:25:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [14:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T335845)', diff saved to https://phabricator.wikimedia.org/P47887 and previous config saved to /var/cache/conftool/dbconfig/20230508-142543-ladsgroup.json [14:27:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41081/console" [puppet] - 10https://gerrit.wikimedia.org/r/917346 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:27:27] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:29:07] (03PS11) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:30:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47888 and previous config saved to /var/cache/conftool/dbconfig/20230508-143038-ladsgroup.json [14:31:12] (03CR) 10Vgutierrez: [C: 03+1] varnish: bump size of varnish shared memory log to 160M (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/917346 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:31:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [14:31:36] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:32:38] (03CR) 10Jobo: [V: 03+1] Add analytics_product admin group for airflow [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [14:32:48] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/917346 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:34:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T335845)', diff saved to https://phabricator.wikimedia.org/P47889 and previous config saved to /var/cache/conftool/dbconfig/20230508-143410-ladsgroup.json [14:34:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [14:36:42] (03PS3) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [14:37:03] (03CR) 10EoghanGaffney: [gitlab/failover] Add rollback method (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [14:37:05] !log sudo cumin -b1 -s1200 'A:cp and A:ulsfo' 'varnish-frontend-restart': T253093 [14:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [14:38:26] jouncebot nowandnext [14:38:27] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [14:38:27] In 0 hour(s) and 51 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1530) [14:38:56] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [14:40:02] !log train 1.41.0-wmf.7 (T330213): proceeding to all wikis [14:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [14:40:25] (03PS4) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [14:40:29] (03CR) 10Jelto: [C: 03+1] "lgtm, renaming current and new primary makes sense!" [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:41:41] PROBLEM - Recursive DNS on 208.80.153.48 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:41:56] ^ expected, provisioning [14:42:45] RECOVERY - Recursive DNS on 208.80.153.48 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:43:38] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917350 (https://phabricator.wikimedia.org/T330214) [14:43:40] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917350 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [14:44:36] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917350 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [14:45:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P47890 and previous config saved to /var/cache/conftool/dbconfig/20230508-144544-ladsgroup.json [14:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P47891 and previous config saved to /var/cache/conftool/dbconfig/20230508-144916-ladsgroup.json [14:50:47] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:51:01] (03PS1) 10DLynch: Update a/b test code for visual enhancements a/b test [extensions/DiscussionTools] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917160 (https://phabricator.wikimedia.org/T333715) [14:51:43] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.7 refs T330214 [14:51:46] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [14:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:52:53] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T335722 (10Papaul) 05Open→03Resolved Same server again so resolving [14:53:19] RECOVERY - Host mw2448 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:54:01] PROBLEM - Check systemd state on mw2448 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:37] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Papaul) @Marostegui hello do you have time to take this server down so we can work on the firmware upgrade? Thanks [14:55:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2004.wikimedia.org with OS bullseye [14:55:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2004.wikimedia.org with OS bullseye completed: - dns2004 (**WARN**)... [14:56:49] RECOVERY - Check systemd state on mw2448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org [14:57:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org [14:57:32] (03PS12) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [14:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:57:44] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10MoritzMuehlenhoff) >>! In T334154#8832825, @jijiki wrote: > Dear #Inf... [14:59:15] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Papaul) @Marostegui sorry just getting backup with you on this in the main time we can power the server down and swap DIMM A7 with DIMM B7 and see if we se... [14:59:36] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336082 (10Papaul) a:03Jhancock.wm [15:00:29] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:00:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P47892 and previous config saved to /var/cache/conftool/dbconfig/20230508-150050-ladsgroup.json [15:03:26] (03PS13) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [15:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P47893 and previous config saved to /var/cache/conftool/dbconfig/20230508-150423-ladsgroup.json [15:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:06:16] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:07:09] (03PS13) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [15:08:23] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns2004 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/917341 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [15:09:52] !log homer "cr*-codfw*" commit "Gerrit: 917341 add new DNS host dns2004" [15:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) The recommended fix for this one (according to Dell) is a reboot and see if the error comes back. I've done a full power cycle. Right now there's n... [15:12:47] !log [done] homer "cr*-codfw*" commit "Gerrit: 917341 add new DNS host dns2004": T326688 [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [15:13:04] (03PS1) 10Jbond: admin: updatre permisdsions for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) [15:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47894 and previous config saved to /var/cache/conftool/dbconfig/20230508-151556-ladsgroup.json [15:16:46] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, and 2 others: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10jbond) >>! In T334154#8763604, @Dzahn wrote: > ` > 'ALL = (ALL)... [15:17:24] jouncebot: next [15:17:24] In 0 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1530) [15:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T335845)', diff saved to https://phabricator.wikimedia.org/P47895 and previous config saved to /var/cache/conftool/dbconfig/20230508-151929-ladsgroup.json [15:19:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [15:19:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [15:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T335845)', diff saved to https://phabricator.wikimedia.org/P47896 and previous config saved to /var/cache/conftool/dbconfig/20230508-151952-ladsgroup.json [15:20:14] (03PS2) 10Jbond: admin: updatre permisdsions for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) [15:22:22] (03PS1) 10Muehlenhoff: Failover the kadminserver to krb2002 [puppet] - 10https://gerrit.wikimedia.org/r/917359 (https://phabricator.wikimedia.org/T331695) [15:22:31] PROBLEM - mediawiki-installation DSH group on mw2448 is CRITICAL: Host mw2448 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:48] (03CR) 10CI reject: [V: 04-1] Failover the kadminserver to krb2002 [puppet] - 10https://gerrit.wikimedia.org/r/917359 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [15:23:31] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:24:54] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) Let me do it @Papaul. I should be the first point of contact for this ticket. [15:25:34] !log installing grep updates from Bullseye 11.7 point release [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:42] (03PS2) 10Muehlenhoff: Failover the kadminserver to krb2002 [puppet] - 10https://gerrit.wikimedia.org/r/917359 (https://phabricator.wikimedia.org/T331695) [15:27:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T335845)', diff saved to https://phabricator.wikimedia.org/P47897 and previous config saved to /var/cache/conftool/dbconfig/20230508-152716-ladsgroup.json [15:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47898 and previous config saved to /var/cache/conftool/dbconfig/20230508-152716-ladsgroup.json [15:27:54] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) Server should be down now. [15:28:41] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336082 (10Jhancock.wm) 05Open→03Resolved The input power for power supply 1 has been restored. Mon 08 May 2023 15:24:56 The power supplies are redundant. Mon 08 May 2023 15:24:56 Power supply redundancy... [15:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1530). [15:30:18] (03PS1) 10Jbond: cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 [15:32:10] !log ns1: remove dns2001, add dns2004 next-hop [ 208.80.153.48 208.80.153.111 208.80.153.10 ]: T335777 [15:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:14] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [15:32:53] (03CR) 10Ottomata: [C: 03+2] Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:33:22] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:34:47] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10LSobanski) [15:35:21] (03Merged) 10jenkins-bot: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:35:37] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) The additional troubleshooting Dell wants us to do is swap DIMM and see if the error travels. that's already been done and the error has not come back. Can we try putting it under... [15:36:15] (03CR) 10Muehlenhoff: cookbooks: sre.pki.restart-reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 (owner: 10Jbond) [15:36:44] (03PS2) 10Jbond: cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 [15:36:50] (03PS3) 10Majavah: toolforge: wmcs-package-build: support .git suffix in URLs [puppet] - 10https://gerrit.wikimedia.org/r/916787 [15:36:52] (03PS1) 10Majavah: toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 [15:37:22] (03CR) 10CI reject: [V: 04-1] toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 (owner: 10Majavah) [15:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:39:15] (03PS1) 10Muehlenhoff: Add partman config for ldap-rw* hosts [puppet] - 10https://gerrit.wikimedia.org/r/917363 (https://phabricator.wikimedia.org/T331699) [15:39:52] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:40:23] PROBLEM - Host pki-root1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:03] RECOVERY - Host pki-root1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:41:39] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:42:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P47899 and previous config saved to /var/cache/conftool/dbconfig/20230508-154222-ladsgroup.json [15:42:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P47900 and previous config saved to /var/cache/conftool/dbconfig/20230508-154222-ladsgroup.json [15:43:17] (03PS3) 10Jbond: cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 [15:46:03] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:46:51] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/917363 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [15:47:01] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:47:48] (03PS1) 10Ssingh: sites.yaml: remove dns2001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/917364 (https://phabricator.wikimedia.org/T335777) [15:49:00] (03PS4) 10Jbond: cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 [15:49:05] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 (owner: 10Jbond) [15:50:50] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) The current VMs are quite overdimensioned in terms of CPU core: I'd go with 4G RAM, 4 CPUs and 20G disk space instead for ldap-rw1001/2001 [15:52:29] (03PS1) 10Ssingh: hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/917365 (https://phabricator.wikimedia.org/T335777) [15:53:56] (03PS1) 10Volans: spicerack: refactor IRC logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/917366 [15:53:58] (03PS1) 10Volans: doc: do not load UI fix when building the manpage [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 [15:54:55] (03Abandoned) 10Volans: spicerack: refactor IRC logging [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917290 (owner: 10Volans) [15:55:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 (owner: 10Jbond) [15:55:48] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Jhancock.wm) @jcrespo I swapped DIMM A7 with DIMM B6. (Their server's DIMM is asymmetrical for some reason. There's no b7 so I used that instead). It's bee... [15:55:54] (03CR) 10Muehlenhoff: [C: 03+2] Add partman config for ldap-rw* hosts [puppet] - 10https://gerrit.wikimedia.org/r/917363 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [15:56:23] (03PS2) 10Majavah: toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 [15:56:49] (03CR) 10CI reject: [V: 04-1] toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 (owner: 10Majavah) [15:57:09] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) Can I put it back to work and test it's memory under regular usage? [15:57:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P47901 and previous config saved to /var/cache/conftool/dbconfig/20230508-155728-ladsgroup.json [15:57:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P47902 and previous config saved to /var/cache/conftool/dbconfig/20230508-155729-ladsgroup.json [15:57:31] (03PS3) 10Majavah: toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 [15:57:57] (03CR) 10Volans: "Original code review at https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/917290/" [software/spicerack] - 10https://gerrit.wikimedia.org/r/917366 (owner: 10Volans) [15:58:01] (03CR) 10CI reject: [V: 04-1] doc: do not load UI fix when building the manpage [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 (owner: 10Volans) [15:59:17] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Jhancock.wm) Yes. Let me know if any errors pop up. Thanks! [16:00:05] sukhe: OwO what's this, a deployment window?? LVS maintenance. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1600). nyaa~ [16:00:56] not exactly but thanks jouncebot [16:01:03] (03PS14) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [16:02:32] !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw [16:03:04] jouncebot: next [16:03:04] In 0 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1700) [16:03:04] In 0 hour(s) and 56 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1700) [16:03:49] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:03:56] (03PS1) 10Ssingh: Revert "Revert "lvs2011: commission new LVS host (codfw hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/917161 [16:07:01] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10jhathaway) >>! In T331699#8833877, @MoritzMuehlenhoff wrote: > The current VMs are quite overdimensioned in terms of CPU core: I'd go with 4G RAM, 4 CPUs... [16:07:36] (03PS2) 10Volans: doc: do not load UI fix when building the manpage [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 [16:09:47] (03CR) 10Volans: [C: 03+2] "Merging as this was reviewed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/917290/ (against the wrong branch). No c" [software/spicerack] - 10https://gerrit.wikimedia.org/r/917366 (owner: 10Volans) [16:09:57] (03PS1) 10Cathal Mooney: Add policy for cloudsw BGP peering to cloudlb and other cloud servers [homer/public] - 10https://gerrit.wikimedia.org/r/917369 (https://phabricator.wikimedia.org/T324992) [16:10:02] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) Expected post messages: ` Message PR1: Replaced part detected for device: DDR4 DIMM(DIMM A7). Message PR1: Replaced part detected for device: DDR4... [16:11:39] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [16:11:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 (owner: 10Volans) [16:11:43] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [16:12:22] (03CR) 10Volans: [C: 03+2] doc: do not load UI fix when building the manpage [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 (owner: 10Volans) [16:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T335845)', diff saved to https://phabricator.wikimedia.org/P47903 and previous config saved to /var/cache/conftool/dbconfig/20230508-161234-ladsgroup.json [16:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47904 and previous config saved to /var/cache/conftool/dbconfig/20230508-161235-ladsgroup.json [16:12:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [16:12:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [16:12:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [16:12:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T335845)', diff saved to https://phabricator.wikimedia.org/P47905 and previous config saved to /var/cache/conftool/dbconfig/20230508-161258-ladsgroup.json [16:13:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [16:13:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T335845)', diff saved to https://phabricator.wikimedia.org/P47906 and previous config saved to /var/cache/conftool/dbconfig/20230508-161313-ladsgroup.json [16:13:28] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "lvs2011: commission new LVS host (codfw hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/917161 (owner: 10Ssingh) [16:13:30] (03Merged) 10jenkins-bot: spicerack: refactor IRC logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/917366 (owner: 10Volans) [16:14:08] (03CR) 10Cathal Mooney: [C: 03+2] Add policy for cloudsw BGP peering to cloudlb and other cloud servers [homer/public] - 10https://gerrit.wikimedia.org/r/917369 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [16:14:38] (03PS2) 10Hnowlan: thumbor: haproxy timeout changes, block /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) [16:14:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:14:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:15:17] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [16:16:18] (03Merged) 10jenkins-bot: doc: do not load UI fix when building the manpage [software/spicerack] - 10https://gerrit.wikimedia.org/r/917367 (owner: 10Volans) [16:16:20] (03Merged) 10jenkins-bot: Add policy for cloudsw BGP peering to cloudlb and other cloud servers [homer/public] - 10https://gerrit.wikimedia.org/r/917369 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [16:18:18] (03PS1) 10BBlack: timesyncd ntp servers: add 3rd core dns node [puppet] - 10https://gerrit.wikimedia.org/r/917371 [16:18:59] (03Abandoned) 10Dzahn: wdqs/wcqs: change discovery name of backends for GUIs [puppet] - 10https://gerrit.wikimedia.org/r/915737 (owner: 10Dzahn) [16:20:00] (03CR) 10Dzahn: [C: 03+2] trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [16:20:07] (03PS4) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 [16:20:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T335845)', diff saved to https://phabricator.wikimedia.org/P47907 and previous config saved to /var/cache/conftool/dbconfig/20230508-162024-ladsgroup.json [16:20:26] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2011.codfw.wmnet with OS bullseye [16:20:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:21:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [16:21:22] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/917372 [16:21:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:22:30] (03PS1) 10Ottomata: admin_ng/flink-operator - set default rbac.create: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/917373 (https://phabricator.wikimedia.org/T333464) [16:22:52] (03CR) 10Ssingh: timesyncd ntp servers: add 3rd core dns node (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/917371 (owner: 10BBlack) [16:23:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T335845)', diff saved to https://phabricator.wikimedia.org/P47908 and previous config saved to /var/cache/conftool/dbconfig/20230508-162309-ladsgroup.json [16:24:20] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/917372 (owner: 10Volans) [16:26:08] (03CR) 10JMeybohm: [C: 03+1] admin_ng/flink-operator - set default rbac.create: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/917373 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [16:26:10] (03PS2) 10Ottomata: admin_ng/flink-operator - set default rbac.create: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/917373 (https://phabricator.wikimedia.org/T333464) [16:27:47] (03PS3) 10Ottomata: admin_ng/flink-operator - set default rbac.create: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/917373 (https://phabricator.wikimedia.org/T333464) [16:28:22] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/917372 (owner: 10Volans) [16:30:20] (03PS1) 10Volans: Upstream release v7.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917374 [16:30:57] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10CBogen) [16:31:20] (03CR) 10Ottomata: [C: 03+2] admin_ng/flink-operator - set default rbac.create: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/917373 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [16:32:41] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:32:51] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:33:12] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:33:18] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:33:28] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) After discussions with serviceops about use of Persistent Volume Clai... [16:35:27] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Okay thank you! [16:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P47909 and previous config saved to /var/cache/conftool/dbconfig/20230508-163530-ladsgroup.json [16:36:11] (03CR) 10Jbond: [C: 03+2] cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 (owner: 10Jbond) [16:37:04] (03CR) 10Volans: [C: 03+2] Upstream release v7.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917374 (owner: 10Volans) [16:38:02] (03CR) 10BBlack: [C: 03+2] "I'm avoiding adding any new 2004 references in this commit, as it's a retro-fixup from when we renamed+re-roled the authdns* boxes." [puppet] - 10https://gerrit.wikimedia.org/r/917371 (owner: 10BBlack) [16:38:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P47910 and previous config saved to /var/cache/conftool/dbconfig/20230508-163816-ladsgroup.json [16:38:39] (03Merged) 10jenkins-bot: cookbooks: sre.pki.restart-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/917360 (owner: 10Jbond) [16:38:44] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Double checking the only config that seems to be needed to allow Homer to commit is: ` system { services { netconf {... [16:38:44] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2011.codfw.wmnet with OS bullseye [16:38:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [16:39:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:39:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:09] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:39:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:39:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [16:39:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:44] RECOVERY - Check systemd state on aux-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:47] (03CR) 10Ebernhardson: "perhaps in another patch, but we should also remove all the airflow 1 puppet code" [puppet] - 10https://gerrit.wikimedia.org/r/917343 (https://phabricator.wikimedia.org/T333697) (owner: 10Bking) [16:39:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:40:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:40:52] (03PS1) 10Jelto: trafficserver: switch non-critical miscweb sites back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/917376 (https://phabricator.wikimedia.org/T335797) [16:40:54] (03Merged) 10jenkins-bot: Upstream release v7.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/917374 (owner: 10Volans) [16:41:28] (03CR) 10Dzahn: [C: 03+1] trafficserver: switch non-critical miscweb sites back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/917376 (https://phabricator.wikimedia.org/T335797) (owner: 10Jelto) [16:42:37] (03PS15) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [16:43:01] (03CR) 10Jelto: [C: 03+2] trafficserver: switch non-critical miscweb sites back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/917376 (https://phabricator.wikimedia.org/T335797) (owner: 10Jelto) [16:45:19] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:46:34] !log uploaded spicerack_7.0.0 to apt.wikimedia.org bullseye-wikimedia [16:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:20] (03PS16) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [16:50:06] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:50:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P47912 and previous config saved to /var/cache/conftool/dbconfig/20230508-165036-ladsgroup.json [16:53:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P47913 and previous config saved to /var/cache/conftool/dbconfig/20230508-165322-ladsgroup.json [16:55:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [16:58:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1700) [17:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T1700). [17:02:37] RECOVERY - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:05:31] (03PS1) 10Dzahn: miscweb: switch design.wikimedia.org back from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/917379 (https://phabricator.wikimedia.org/T335797) [17:05:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T335845)', diff saved to https://phabricator.wikimedia.org/P47914 and previous config saved to /var/cache/conftool/dbconfig/20230508-170542-ladsgroup.json [17:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T335845)', diff saved to https://phabricator.wikimedia.org/P47915 and previous config saved to /var/cache/conftool/dbconfig/20230508-170828-ladsgroup.json [17:08:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [17:08:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [17:09:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T335845)', diff saved to https://phabricator.wikimedia.org/P47916 and previous config saved to /var/cache/conftool/dbconfig/20230508-170902-ladsgroup.json [17:16:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye [17:17:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye completed:... [17:17:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T335845)', diff saved to https://phabricator.wikimedia.org/P47917 and previous config saved to /var/cache/conftool/dbconfig/20230508-171720-ladsgroup.json [17:18:43] (03PS17) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [17:27:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:27:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudswift1001.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:28:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudswift1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:50] !log installed spicerack 7.0.0 on cumin2002 [17:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:09] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v7.0.0 [17:29:24] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v7.0.0 [17:31:07] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS buster [17:31:16] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors: - an-wo... [17:31:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:31:20] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [17:31:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:31:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:31:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:31:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T335845)', diff saved to https://phabricator.wikimedia.org/P47918 and previous config saved to /var/cache/conftool/dbconfig/20230508-173152-ladsgroup.json [17:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P47919 and previous config saved to /var/cache/conftool/dbconfig/20230508-173226-ladsgroup.json [17:33:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs2011.codfw.wmnet [17:33:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2011.codfw.wmnet [17:35:04] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2011 [17:35:11] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2011 [17:36:32] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [17:36:32] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:33] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1001.eqiad.wmnet [17:37:00] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new LVS host lvs2011 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/914871 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [17:38:02] !log installed spicerack 7.0.0 on cumin1001 [17:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:05] (03CR) 10Volans: [C: 03+2] IRC logging: renamed irc_logger to sal_logger [cookbooks] - 10https://gerrit.wikimedia.org/r/917291 (owner: 10Volans) [17:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T335845)', diff saved to https://phabricator.wikimedia.org/P47920 and previous config saved to /var/cache/conftool/dbconfig/20230508-173808-ladsgroup.json [17:39:06] !log homer "cr*-codfw*" commit "Gerrit: 914871 add new LVS host lvs2011": T326767 [17:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:10] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:40:28] (03PS2) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/914343 (https://phabricator.wikimedia.org/T335777) [17:43:34] (03PS1) 10Ssingh: hiera: remove BGP MED override for lvs2011 [puppet] - 10https://gerrit.wikimedia.org/r/917386 (https://phabricator.wikimedia.org/T326767) [17:45:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Dzahn) >>! In T334429#8833688, @Jhancock.wm wrote: > The recommended fix for this one (according to Dell) is a reboot and see if the error comes back. For the r... [17:46:06] (03CR) 10Ssingh: [C: 03+2] hiera: remove BGP MED override for lvs2011 [puppet] - 10https://gerrit.wikimedia.org/r/917386 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [17:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P47922 and previous config saved to /var/cache/conftool/dbconfig/20230508-174732-ladsgroup.json [17:48:15] !log restart pybal on lvs2011 to pick up bgp med change: T326767 [17:48:16] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging2001.codfw.wmnet [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:19] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:49:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:51:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:44] !log set routing-options static route 208.80.153.224/28 [high-traffic1, codfw] next-hop 10.192.0.29: T326767 [17:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P47923 and previous config saved to /var/cache/conftool/dbconfig/20230508-175314-ladsgroup.json [17:54:48] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging2001.codfw.wmnet [17:55:13] (03CR) 10Andrew Bogott: [C: 03+2] Rearrange py2/py3 versions of mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/917345 (owner: 10Andrew Bogott) [17:56:10] (03PS1) 10Ottomata: flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917389 (https://phabricator.wikimedia.org/T336185) [17:56:43] jouncebot: next [17:56:43] In 2 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T2000) [17:56:54] (03PS10) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [17:56:56] (03PS18) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [17:56:58] (03PS1) 10Andrew Bogott: codfw1dev: enforce scope and default policies [puppet] - 10https://gerrit.wikimedia.org/r/917390 (https://phabricator.wikimedia.org/T330759) [17:57:22] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging2002.codfw.wmnet [17:57:30] (03PS2) 10Ottomata: flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917389 (https://phabricator.wikimedia.org/T336185) [17:59:21] volans@cumin2002 test message [18:01:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:02:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T335845)', diff saved to https://phabricator.wikimedia.org/P47925 and previous config saved to /var/cache/conftool/dbconfig/20230508-180239-ladsgroup.json [18:02:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:02:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:03:07] (03CR) 10Ottomata: [C: 03+2] flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917389 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata) [18:03:49] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging2002.codfw.wmnet [18:04:43] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 113m 03s) [18:04:46] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [18:05:34] (03Merged) 10jenkins-bot: flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917389 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata) [18:06:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:07:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:08:12] (03CR) 10Dzahn: [C: 03+2] "To double check this, I: determined the directory that scap deploys to (it's symlinked to the apache docroot). -> /srv/deployment/design/" [puppet] - 10https://gerrit.wikimedia.org/r/917379 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [18:08:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P47926 and previous config saved to /var/cache/conftool/dbconfig/20230508-180820-ladsgroup.json [18:08:31] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:09:04] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: enforce scope and default policies [puppet] - 10https://gerrit.wikimedia.org/r/917390 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:09:08] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:13:06] (03CR) 10Herron: [C: 03+1] "Nice one this will be useful, LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [18:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:18:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (7) wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:23:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T335845)', diff saved to https://phabricator.wikimedia.org/P47927 and previous config saved to /var/cache/conftool/dbconfig/20230508-182327-ladsgroup.json [18:23:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (14) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:23:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:23:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:23:47] ph, I'm on clinic duty, found out from topic again [18:23:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T335845)', diff saved to https://phabricator.wikimedia.org/P47928 and previous config saved to /var/cache/conftool/dbconfig/20230508-182350-ladsgroup.json [18:24:55] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) (owner: 10Jbond) [18:25:23] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, and 2 others: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) thanks John, patch looks good to me, +1ed [18:28:20] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) @darthmon_wmde What jijiki said please, but meanwhile all 3 of you can already contact @KFrancis. Please send her your email address... [18:28:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (15) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:29:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) @lojo Please send an email to @KFrancis (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) to continue with the NDA signing process. [18:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T335845)', diff saved to https://phabricator.wikimedia.org/P47929 and previous config saved to /var/cache/conftool/dbconfig/20230508-183048-ladsgroup.json [18:35:31] (03PS1) 10Dzahn: switch webserver-misc-sites from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) [18:36:26] (03PS1) 10Jdlrobson: Ensure page load popupNotification is closed when the toggle button is clicked [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917162 (https://phabricator.wikimedia.org/T335153) [18:41:23] (03CR) 10Krinkle: "> https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/63533/console : FAILURE" [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [18:45:23] (03PS2) 10Majavah: toolserver_legacy: Remove exim4 service [puppet] - 10https://gerrit.wikimedia.org/r/916877 (https://phabricator.wikimedia.org/T136225) [18:45:25] (03PS1) 10Majavah: P:wmcs::toolserver_legacy: convert icinga checks to blackbox probes [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) [18:45:55] (03CR) 10CI reject: [V: 04-1] P:wmcs::toolserver_legacy: convert icinga checks to blackbox probes [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) (owner: 10Majavah) [18:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P47930 and previous config saved to /var/cache/conftool/dbconfig/20230508-184554-ladsgroup.json [18:47:18] (03PS2) 10Majavah: P:wmcs::toolserver_legacy: convert icinga checks to blackbox probes [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) [18:48:17] (03PS1) 10Jdlrobson: Deploy fixed width indicator to wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917395 (https://phabricator.wikimedia.org/T335307) [18:48:24] (03CR) 10CI reject: [V: 04-1] Deploy fixed width indicator to wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917395 (https://phabricator.wikimedia.org/T335307) (owner: 10Jdlrobson) [18:48:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (15) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:52:11] (03PS11) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [18:52:13] (03PS19) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [18:52:15] (03PS1) 10Andrew Bogott: neutron policy.yaml: don't override network_device policy [puppet] - 10https://gerrit.wikimedia.org/r/917396 (https://phabricator.wikimedia.org/T330759) [18:52:53] (03PS20) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [18:53:14] (03CR) 10Andrew Bogott: [C: 03+2] neutron policy.yaml: don't override network_device policy [puppet] - 10https://gerrit.wikimedia.org/r/917396 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:53:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (15) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:53:59] (03CR) 10Cathal Mooney: "Updated to use count_ipaddresses now." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [18:56:08] (03CR) 10Dzahn: "@RyanKemper: we are switching back to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [18:56:52] (03PS2) 10Ryan Kemper: switch webserver-misc-sites from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [18:57:05] (03CR) 10Ryan Kemper: [C: 03+1] switch webserver-misc-sites from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [18:58:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (15) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:59:59] (03CR) 10Dzahn: [C: 03+2] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [19:00:11] (03PS3) 10Dzahn: switch webserver-misc-sites from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/917393 (https://phabricator.wikimedia.org/T335797) [19:01:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P47931 and previous config saved to /var/cache/conftool/dbconfig/20230508-190100-ladsgroup.json [19:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:12:07] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 19 hosts with reason: rebooting to help with lag [19:12:21] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 19 hosts with reason: rebooting to help with lag [19:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T335845)', diff saved to https://phabricator.wikimedia.org/P47932 and previous config saved to /var/cache/conftool/dbconfig/20230508-191607-ladsgroup.json [19:16:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [19:16:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [19:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T335845)', diff saved to https://phabricator.wikimedia.org/P47933 and previous config saved to /var/cache/conftool/dbconfig/20230508-191630-ladsgroup.json [19:17:36] (03PS1) 10BCornwall: pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) [19:18:05] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1004.eqiad.wmnet with reason: rebooting to help with lag [19:18:30] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1004.eqiad.wmnet with reason: rebooting to help with lag [19:20:13] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2006.codfw.wmnet [19:20:13] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2006.codfw.wmnet [19:20:36] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41084/console" [puppet] - 10https://gerrit.wikimedia.org/r/917399 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:20:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs2006.codfw.wmnet with reason: rebooting to help with lag [19:20:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs2006.codfw.wmnet with reason: rebooting to help with lag [19:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T335845)', diff saved to https://phabricator.wikimedia.org/P47934 and previous config saved to /var/cache/conftool/dbconfig/20230508-192243-ladsgroup.json [19:24:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging2003.codfw.wmnet [19:25:34] (03CR) 10JHathaway: [C: 03+2] ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) (owner: 10JHathaway) [19:30:22] (03PS1) 10AOkoth: site|install: add vrts1001 insetup [puppet] - 10https://gerrit.wikimedia.org/r/917400 [19:31:52] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging2003.codfw.wmnet [19:32:20] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging2005.codfw.wmnet [19:34:57] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2006.codfw.wmnet [19:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P47935 and previous config saved to /var/cache/conftool/dbconfig/20230508-193750-ladsgroup.json [19:38:42] (03CR) 10Dzahn: [C: 03+1] site|install: add vrts1001 insetup [puppet] - 10https://gerrit.wikimedia.org/r/917400 (owner: 10AOkoth) [19:38:45] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging2005.codfw.wmnet [19:39:06] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging2004.codfw.wmnet [19:41:52] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2006.codfw.wmnet [19:45:20] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging2004.codfw.wmnet [19:45:43] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2004.codfw.wmnet [19:46:41] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging1005.eqiad.wmnet [19:51:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2004.codfw.wmnet [19:52:08] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2007.codfw.wmnet [19:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P47936 and previous config saved to /var/cache/conftool/dbconfig/20230508-195256-ladsgroup.json [19:54:02] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging1005.eqiad.wmnet [19:57:55] (03CR) 10AOkoth: [C: 03+2] site|install: add vrts1001 insetup [puppet] - 10https://gerrit.wikimedia.org/r/917400 (owner: 10AOkoth) [19:59:26] !log aokoth@cumin1001 START - Cookbook sre.ganeti.makevm for new host vrts1001.eqiad.wmnet [19:59:27] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T2000). [20:00:04] kemayo and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] 👋🏻 [20:00:51] howdy [20:00:55] I have a backport and a config patch --- there's not much I can do to test the former until the latter is also deployed. [20:01:11] o/ I can deploy [20:01:53] Kemayo: is it fine to pull both to mwdebug at the same time? [20:01:59] Yup [20:02:05] (03CR) 10Majavah: [C: 03+2] Update a/b test code for visual enhancements a/b test [extensions/DiscussionTools] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917160 (https://phabricator.wikimedia.org/T333715) (owner: 10DLynch) [20:02:12] (03PS2) 10Majavah: Enable DiscussionTools visual enhancements a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916903 (https://phabricator.wikimedia.org/T302358) (owner: 10DLynch) [20:02:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917160 (https://phabricator.wikimedia.org/T333715) (owner: 10DLynch) [20:02:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916903 (https://phabricator.wikimedia.org/T302358) (owner: 10DLynch) [20:03:05] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916903 (https://phabricator.wikimedia.org/T302358) (owner: 10DLynch) [20:03:14] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1001.eqiad.wmnet - aokoth@cumin1001" [20:04:17] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1001.eqiad.wmnet - aokoth@cumin1001" [20:04:17] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:04:17] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache vrts1001.eqiad.wmnet on all recursors [20:04:20] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts1001.eqiad.wmnet on all recursors [20:04:45] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1001.eqiad.wmnet - aokoth@cumin1001" [20:05:31] Jdlrobson: does your backport depend on the config change or vice versa? [20:05:42] backport needs to happen first [20:05:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1001.eqiad.wmnet - aokoth@cumin1001" [20:05:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host vrts1001.eqiad.wmnet [20:05:49] config can't go out until the backport has happened [20:05:59] ack. going to +2 it now to save time later [20:06:22] (03CR) 10Majavah: [C: 03+2] Ensure page load popupNotification is closed when the toggle button is clicked [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917162 (https://phabricator.wikimedia.org/T335153) (owner: 10Jdlrobson) [20:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T335845)', diff saved to https://phabricator.wikimedia.org/P47937 and previous config saved to /var/cache/conftool/dbconfig/20230508-200802-ladsgroup.json [20:08:03] (03Merged) 10jenkins-bot: Update a/b test code for visual enhancements a/b test [extensions/DiscussionTools] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917160 (https://phabricator.wikimedia.org/T333715) (owner: 10DLynch) [20:08:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:08:19] !log taavi@deploy1002 Started scap: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]] [20:08:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:08:24] T302358: [A/B Test] Run an A/B test to evaluate impact of Usability Improvements - https://phabricator.wikimedia.org/T302358 [20:08:24] T333715: Implement bucketing for Usability Improvements A/B test - https://phabricator.wikimedia.org/T333715 [20:08:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T335845)', diff saved to https://phabricator.wikimedia.org/P47938 and previous config saved to /var/cache/conftool/dbconfig/20230508-200825-ladsgroup.json [20:09:49] !log taavi@deploy1002 kemayo and taavi: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:09:54] Kemayo: please test [20:10:08] It'll take me just a sec [20:11:02] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reimage for host vrts1001.eqiad.wmnet with OS bullseye [20:13:26] (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [20:13:40] taavi: Okay, looks good [20:14:24] It took me longer than I'd expected to get an account on one of those wikis with an even userid so the test would actually trigger. I should have prearranged that. 😅 [20:14:35] ok, syncing [20:14:41] (03PS2) 10Jdlrobson: Deploy fixed width indicator to wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917395 (https://phabricator.wikimedia.org/T335307) [20:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T335845)', diff saved to https://phabricator.wikimedia.org/P47939 and previous config saved to /var/cache/conftool/dbconfig/20230508-201537-ladsgroup.json [20:20:13] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]] (duration: 11m 54s) [20:20:18] T302358: [A/B Test] Run an A/B test to evaluate impact of Usability Improvements - https://phabricator.wikimedia.org/T302358 [20:20:18] T333715: Implement bucketing for Usability Improvements A/B test - https://phabricator.wikimedia.org/T333715 [20:20:19] Kemayo: yours is live [20:20:27] taavi: Thanks! [20:20:32] Jdlrobson: yours is up next, starting from the backport [20:20:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917162 (https://phabricator.wikimedia.org/T335153) (owner: 10Jdlrobson) [20:20:46] sounds good! [20:21:35] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts1001.eqiad.wmnet with reason: host reimage [20:21:36] (03Merged) 10jenkins-bot: Ensure page load popupNotification is closed when the toggle button is clicked [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917162 (https://phabricator.wikimedia.org/T335153) (owner: 10Jdlrobson) [20:21:55] !log taavi@deploy1002 Started scap: Backport for [[gerrit:917162|Ensure page load popupNotification is closed when the toggle button is clicked (T335153)]] [20:21:59] T335153: popupNotification Fix memory leaks - https://phabricator.wikimedia.org/T335153 [20:22:15] (03CR) 10BryanDavis: "> consensus was that we will make a buildpack based image available" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (owner: 10Majavah) [20:23:11] !log taavi@deploy1002 jdlrobson and taavi: Backport for [[gerrit:917162|Ensure page load popupNotification is closed when the toggle button is clicked (T335153)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:23:17] Jdlrobson: please test [20:23:22] looking [20:24:28] taavi: yep you can sync that. [20:24:33] doing that [20:24:44] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts1001.eqiad.wmnet with reason: host reimage [20:25:13] (03CR) 10Majavah: [C: 03+2] Deploy fixed width indicator to wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917395 (https://phabricator.wikimedia.org/T335307) (owner: 10Jdlrobson) [20:25:25] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2007.codfw.wmnet [20:26:00] (03Merged) 10jenkins-bot: Deploy fixed width indicator to wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917395 (https://phabricator.wikimedia.org/T335307) (owner: 10Jdlrobson) [20:29:53] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:917162|Ensure page load popupNotification is closed when the toggle button is clicked (T335153)]] (duration: 07m 58s) [20:29:57] T335153: popupNotification Fix memory leaks - https://phabricator.wikimedia.org/T335153 [20:30:31] !log taavi@deploy1002 Started scap: Backport for [[gerrit:917395|Deploy fixed width indicator to wikis (T335307)]] [20:30:35] T335307: Deploy fixed width indicator to English Wikipedia - https://phabricator.wikimedia.org/T335307 [20:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P47941 and previous config saved to /var/cache/conftool/dbconfig/20230508-203043-ladsgroup.json [20:30:52] (03CR) 10Majavah: Remove -pwb images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (owner: 10Majavah) [20:31:54] !log taavi@deploy1002 jdlrobson and taavi: Backport for [[gerrit:917395|Deploy fixed width indicator to wikis (T335307)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:32:04] Jdlrobson: config patch is now available for testing [20:32:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging1003.eqiad.wmnet [20:32:27] looking. Thanks taavi [20:32:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on miscweb2003.codfw.wmnet with reason: reboot [20:33:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on miscweb2003.codfw.wmnet with reason: reboot [20:33:15] taavi: looks great [20:33:26] syncing [20:36:48] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host vrts1001.eqiad.wmnet with OS bullseye [20:38:54] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:917395|Deploy fixed width indicator to wikis (T335307)]] (duration: 08m 22s) [20:38:58] T335307: Deploy fixed width indicator to English Wikipedia - https://phabricator.wikimedia.org/T335307 [20:39:00] ok, all done [20:39:33] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging1003.eqiad.wmnet [20:41:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on miscweb2003.codfw.wmnet with reason: reboot [20:41:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on miscweb2003.codfw.wmnet with reason: reboot [20:41:52] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging1004.eqiad.wmnet [20:42:33] (03CR) 10BryanDavis: [C: 03+1] Remove -pwb images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/916795 (owner: 10Majavah) [20:43:10] !log miscweb2003 - rebooting [20:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P47942 and previous config saved to /var/cache/conftool/dbconfig/20230508-204549-ladsgroup.json [20:49:26] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging1004.eqiad.wmnet [20:50:27] Thanks taavi for your help today! [20:51:56] (03CR) 10Krinkle: Define dummy pass for passwords::excimer_ui_server (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [20:53:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging1002.eqiad.wmnet [20:56:30] (03PS3) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [20:58:20] (03PS4) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [20:59:50] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging1002.eqiad.wmnet [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230508T2100). [21:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T335845)', diff saved to https://phabricator.wikimedia.org/P47943 and previous config saved to /var/cache/conftool/dbconfig/20230508-210056-ladsgroup.json [21:01:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [21:01:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [21:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T335845)', diff saved to https://phabricator.wikimedia.org/P47944 and previous config saved to /var/cache/conftool/dbconfig/20230508-210119-ladsgroup.json [21:02:18] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-logging1001.eqiad.wmnet [21:05:46] (03PS1) 10Eevans: ml-cache: upgrade Cassandra to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) [21:06:24] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:06:29] (03PS7) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:07:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T335845)', diff saved to https://phabricator.wikimedia.org/P47945 and previous config saved to /var/cache/conftool/dbconfig/20230508-210742-ladsgroup.json [21:08:14] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:09:07] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-logging1001.eqiad.wmnet [21:12:42] (03PS1) 10SBassett: Disable translation memory on collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916913 (https://phabricator.wikimedia.org/T313241) [21:13:21] (03PS8) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:15:11] security team preparing to do config deploy during the security deploy window [21:15:38] (03CR) 10Mstyles: [C: 03+2] Disable translation memory on collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916913 (https://phabricator.wikimedia.org/T313241) (owner: 10SBassett) [21:16:24] (03Merged) 10jenkins-bot: Disable translation memory on collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/916913 (https://phabricator.wikimedia.org/T313241) (owner: 10SBassett) [21:17:32] !log mstyles@deploy1002 Started scap: Backport for [[gerrit:916913|Disable translation memory on collabwiki (T313241)]] [21:18:53] !log mstyles@deploy1002 mstyles and sbassett: Backport for [[gerrit:916913|Disable translation memory on collabwiki (T313241)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:21:44] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@a6a3ceb]: (no justification provided) [21:21:54] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@a6a3ceb]: (no justification provided) (duration: 00m 09s) [21:22:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P47946 and previous config saved to /var/cache/conftool/dbconfig/20230508-212248-ladsgroup.json [21:24:18] !log mstyles@deploy1002 Finished scap: Backport for [[gerrit:916913|Disable translation memory on collabwiki (T313241)]] (duration: 06m 45s) [21:24:24] (03PS9) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:25:31] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:33:28] (03PS10) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:34:19] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:35:35] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P47947 and previous config saved to /var/cache/conftool/dbconfig/20230508-213754-ladsgroup.json [21:40:44] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10KFrancis) Hello all, if you prefer to keep your email address off Phabricator, please send them to kfrancis@wikimedia.org. Thanks! [21:43:37] (03PS11) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:44:15] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:46:59] (03PS12) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:47:34] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T335845)', diff saved to https://phabricator.wikimedia.org/P47948 and previous config saved to /var/cache/conftool/dbconfig/20230508-215300-ladsgroup.json [21:53:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [21:53:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [21:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T335845)', diff saved to https://phabricator.wikimedia.org/P47949 and previous config saved to /var/cache/conftool/dbconfig/20230508-215323-ladsgroup.json [22:01:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T335845)', diff saved to https://phabricator.wikimedia.org/P47950 and previous config saved to /var/cache/conftool/dbconfig/20230508-220103-ladsgroup.json [22:06:57] (03PS1) 10Brennen Bearnes: phabricator: update scap deployment repo to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/917409 (https://phabricator.wikimedia.org/T336210) [22:07:08] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10lmata) p:05Triage→03Medium This has been requested in next FY's budget. [22:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:14:19] (03PS13) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [22:14:44] (03CR) 10CI reject: [V: 04-1] Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [22:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P47951 and previous config saved to /var/cache/conftool/dbconfig/20230508-221609-ladsgroup.json [22:16:26] (03PS14) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [22:18:23] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [22:31:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P47952 and previous config saved to /var/cache/conftool/dbconfig/20230508-223115-ladsgroup.json [22:33:41] (03PS2) 10Brennen Bearnes: phabricator: update scap deployment repo to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/917409 (https://phabricator.wikimedia.org/T336210) [22:34:54] (03PS1) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [22:36:32] (03CR) 10CI reject: [V: 04-1] prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [22:42:33] (03PS2) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [22:44:31] (03CR) 10CI reject: [V: 04-1] prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [22:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T335845)', diff saved to https://phabricator.wikimedia.org/P47953 and previous config saved to /var/cache/conftool/dbconfig/20230508-224622-ladsgroup.json [22:46:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [22:46:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [22:46:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T335845)', diff saved to https://phabricator.wikimedia.org/P47954 and previous config saved to /var/cache/conftool/dbconfig/20230508-224657-ladsgroup.json [22:50:29] (03PS3) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [22:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T335845)', diff saved to https://phabricator.wikimedia.org/P47955 and previous config saved to /var/cache/conftool/dbconfig/20230508-225313-ladsgroup.json [23:08:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P47956 and previous config saved to /var/cache/conftool/dbconfig/20230508-230819-ladsgroup.json [23:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:14:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:20:54] (03CR) 10Dzahn: [C: 03+2] phabricator: update scap deployment repo to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/917409 (https://phabricator.wikimedia.org/T336210) (owner: 10Brennen Bearnes) [23:21:08] (03CR) 10Eevans: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [23:23:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P47957 and previous config saved to /var/cache/conftool/dbconfig/20230508-232325-ladsgroup.json [23:24:33] (03PS3) 10Dzahn: lower TTL for gerrit.wikimedia.org reverse lookups [dns] - 10https://gerrit.wikimedia.org/r/916637 (https://phabricator.wikimedia.org/T326368) [23:24:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:30:08] (03CR) 10Dzahn: [C: 03+2] lower TTL for gerrit.wikimedia.org reverse lookups [dns] - 10https://gerrit.wikimedia.org/r/916637 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [23:31:43] (03CR) 10Brennen Bearnes: "Thanks! Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/917409 (https://phabricator.wikimedia.org/T336210) (owner: 10Brennen Bearnes) [23:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T335845)', diff saved to https://phabricator.wikimedia.org/P47958 and previous config saved to /var/cache/conftool/dbconfig/20230508-233832-ladsgroup.json [23:49:51] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:55:50] (03PS1) 10Superpes15: [arwikisource] Replace the current logo with an identical HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917415 (https://phabricator.wikimedia.org/T336193) [23:56:40] (03PS1) 10Zabe: Start writing to af_actor/afh_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917416 (https://phabricator.wikimedia.org/T334295) [23:57:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917416 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:58:36] (03Merged) 10jenkins-bot: Start writing to af_actor/afh_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917416 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:58:52] !log zabe@deploy1002 Started scap: Backport for [[gerrit:917416|Start writing to af_actor/afh_actor everywhere (T334295)]] [23:58:56] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295