[09:13:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with O... [09:17:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:58:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bu... [10:11:18] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Vgutierrez) [10:11:34] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Vgutierrez) p:05Triage→03Medium [10:11:59] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Vgutierrez) [10:21:44] re: the confd-template discussion from yesterday I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/859102 out vgutierrez [10:33:36] I stared at that one a little bit [10:33:50] wondering if instead of a File resource it shouldn't be a systemd::tempfile one [10:35:50] yeah I started with a tmpfile but then realized that the spam is really from tidy/puppet so creating the directory just before is fine [10:47:04] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [11:01:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with O... [11:08:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [11:18:11] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [11:20:56] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [11:45:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bu... [11:54:36] 10Traffic, 10SRE, 10serviceops, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) 05Open→03Resolved a:03Joe ` vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml Using config file: /home/vgutier... [12:04:53] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [12:25:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with O... [13:07:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bu... [13:28:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:59:56] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Aklapper) >>! In T316337#8216814, @jcrespo wrote: > I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish... [14:06:36] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) > I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just th... [14:13:07] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review. [14:15:08] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [14:35:00] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) I am filling in: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues (Still WIP) [15:41:46] vgutierrez: I was in meeting re: escaping '.', does \\. work ? [15:50:08] godog: checking [15:52:58] ack, FWIW you can also run CI locally with 'tox' [15:54:29] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) [15:58:28] godog: apparently yep, https://gerrit.wikimedia.org/r/c/operations/alerts/+/858658 [15:59:03] vgutierrez: *nod* a little counter intuitive I guess but 'ok' [16:00:49] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05Open→03In progress [16:01:05] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:07:56] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs4009 are failing - TODO - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=ulsfo%20prometheus/ops&var-server=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [17:20:19] sukhe: ^^ [17:21:08] yeah, see -private [17:21:14] now fixing, was making lunch [17:24:53] 10.128.0.9 64600 9 9 0 0 1:03 Establ [17:27:56] (PyBalBGPUnstable) resolved: (2) PyBal BGP sessions on instance lvs4009 are failing - TODO - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=ulsfo%20prometheus/ops&var-server=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [17:28:19] nice, I can enjoy my rice now! [17:29:34] ack, thx [17:38:19] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye [18:34:10] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Downtimed on Ic... [18:36:11] brett: ^ check the logs to see what happened? kinda unexpected! [18:39:01] tried to schedule downtime in Icinga but failed and then it considers all of it failed [18:39:04] is my guess [18:39:29] mutante: I think those result in a warning. this one seems more severe :P [18:42:30] ACK, I don't even see this in /var/log/spicerack/sre/hosts/reimage-extended.log but maybe Brett used the other cumin host [18:42:46] yeah [18:43:12] brett: ^ that's where to check [19:01:49] 10Traffic, 10SRE, 10ops-eqiad: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10wiki_willy) a:03Jclark-ctr [19:19:00] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye [19:19:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4006.ulsfo.wmnet` - lvs4006.ulsfo.wmnet (**WARN**) - D... [19:27:15] the cookbook failure above was due to the NIC firmware not being updated, which then fails the d-i: T286722 [19:27:16] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [19:28:13] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu... [19:52:39] brett: please try again now [19:53:11] ay ay, cap'n [19:53:12] I upgraded the firmware manually for now so that we can test it out but we should use the cookbook [19:53:21] which you can see I tried but failed so for later :) [19:54:03] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye [19:58:10] hmm not cool [19:58:27] it should have installed the firmware at next reboot before proceeding to the d-i [19:58:50] brett: please cancel and let's try again [20:00:06] it says one pending job as expected [20:00:14] Message [20:00:15] JCP001: Task successfully scheduled. [20:00:16] weird [20:05:06] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu...