[06:45:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:50:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:51:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:56:11] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:00:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:05:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:25:18] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster [08:04:39] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster completed: - cp6002 (**WARN**)... [08:14:49] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster [08:54:42] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster completed: - cp6003 (**WARN**)... [09:09:46] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster [09:49:04] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster completed: - cp6004 (**WARN**)... [09:51:14] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster [10:08:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) @cmooney, I agree with your take on the security aspect. We're not in a typical service provider (ISP)/customer relationship... [10:11:51] /win 14 [10:11:55] almost! [10:29:05] 10Traffic, 10SRE, 10ops-drmrs: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10Peachey88) [10:31:07] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster completed: - cp6005 (**WARN**)... [10:35:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) From a certain point of view what we're doing here is validating [[https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_g... [10:37:54] mmandere, bblack: I noticed that the ganeti servers for drmrs have been installed with insetup/buster. Can we directly install these with bullseye and Ganeti 3? given the Ganeti clusters are isolated by itself this saves us a migration later one and we've been following that approach in the past before [10:38:29] when ganeti was new in the edges we also directly started with Buster even though the eqiad/codfw clusters were/are on Stretch [11:20:51] moritzm: that's right, we had them set up as so (insetup/buster) inorder to test that hardware and connectivity is correctly set and behaving as expected. Our next steps after completing the cp instance test reimaging is to reimage the ganeti servers. Though I am not sure we were to have them installed as you've suggested above but I'll let bblack comment more on that once he's available [11:25:33] ack, let's wait for Brandon then :-) [11:32:49] great :) [11:34:11] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster [11:43:11] here they experiment with pretty much all the http proxies we've used / are using :) -> T-Reqs: HTTP Request Smuggling with Differential Fuzzing https://bahruz.me/papers/ccs2021treqs.pdf [11:47:47] 10netops, 10Infrastructure-Foundations, 10SRE: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) a:03cmooney So of course there is a complication. Currently we have a single BGP session between adjacent CR routers, peered over the loopback IPv4 addresses either si... [12:13:39] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster completed: - cp6006 (**WARN**)... [12:22:21] 10netops, 10Infrastructure-Foundations, 10SRE-tools: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans) 05Open→03In progress p:05Triage→03Medium [12:24:38] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster [12:31:01] 10netops, 10Analytics, 10Infrastructure-Foundations, 10SRE-tools: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) [12:31:15] 10netops, 10Analytics, 10Infrastructure-Foundations, 10SRE-tools: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) p:05Triage→03Medium [13:04:40] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster completed: - cp6007 (**WARN**)... [13:18:26] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10MoritzMuehlenhoff) [14:05:18] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) [14:15:01] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans) [14:18:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: `rpki2001.codfw.wmnet` - rpki2001.codfw.wmnet (**PAS... [14:19:04] 10netops, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10elukey) The only recent thing that I recall is T276239, but not for all workers mentioned. I checked quickly the dry-run for... [15:04:35] 10netops, 10Infrastructure-Foundations, 10SRE: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) For the sake of completeness, another option could be to add the fffff: IP to the loopback address, but that would be more of a workaround than a long term solution.... [15:11:55] moritzm: re ganeti - if you think it won't be a major problem, we can certainly go bullseye-first for ganeti6. I've been reading through some of https://wikitech.wikimedia.org/wiki/Ganeti trying to make sense of the manual parts of bringing up a new cluster in general. I imagine going to ganeti3 will change some/all of that? [15:12:32] moritzm: the other change in drmrs-ganeti, is we're doing L3 at the ToR switches, so each of the 2x drmrs physical racks is basically like a row in the core DCs (seperrate vlans). [15:13:03] my understanding (in ganeti2 terms) is that we'd still build one "cluster", but we'd set up node groups called e.g. Rack_B12 and Rack_B13 [15:13:23] (and then do our instance layouts in a way that's still cognizant of failover issues if we lose one rack/group) [15:13:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans) [15:13:50] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans) 05Open→03Resolved a:03Volans After verifying that the changes were all expected and the VLAN bits were actually an artifact of how... [15:15:02] (and for now, we only have 4x ganeti nodes in drmrs, so it's 2 nodes per group/rack. Later, if/when we ever complete the necessary software tasks blocking it, we'll fold dns600[12] hardware into the ganeti cluster and have 3 nodes per group, but that won't be the case initially. [15:15:17] ) [15:15:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) @ayounsi I think it would be fine to do the codfw pfw's this year. Please ping me on IRC when you have some time to discuss. [15:16:51] fwiw we also have a cookbook, sre.ganeti.addnode: Add a new node to a Ganeti cluster ;) not sure if mentioned in the docs [15:17:31] do we have an sre.ganeti.createcluster? :) [15:17:47] either way, I imagine any automation will need revalidation/patching for ganeti3 [15:18:47] I guess not, I was not involved in either :D [15:19:15] I think m.oritz has already used those for the test ganeti cluster in codfw in the last weeks [15:19:20] ok [15:21:11] I also had some broader design questons about how our ganeti clusters work in failure scenarios [15:21:42] I think the biggest one was about the primary node and the floating cluster service IP [15:21:47] ganeti01.svc.whatever [15:21:51] 10netops, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) Yes I thought this was a bit odd. I saw there was a bit of re-imaging here: T231067#6891049 but that was before my t... [15:22:09] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) a:03BTullis [15:22:33] it seems to describe that this is a "floating" VIP so that it can be moved to another node, and that basically ganeti-level operations (e.g. migrating service instances around, etc) won't work unless it's working. [15:22:55] but these floating IPs are currently (looking at core sites) allocated from per-row VLANs. [15:23:19] so it seems like if we lose the row containing the floating master IP, there's no easy recourse to continue normal operations on the remaining row groups in the cluster? [15:24:10] (this becomes more acute in the edge sites with L3@ToR - as we have just two physical racks, which are separate vlans/networks and hardware failure domains at the power/switch/etc level) [15:24:56] it makes me wonder if we shouldn't try a different layout there for resiliency in this scenario, like building two separate entire clusters ganeti01 and ganeti02. [15:25:18] then we know each can operate independently in case of whole-rack/vlan/switch failure. [15:26:07] (and for our actual service instances, where they have more than one, place one in each cluster - e.g. doh6001 is in the ganeti01 cluster in rack b12, and doh6002 is in the ganeti02 cluster in rack b13 [15:26:11] ) [15:47:05] bblack: sounds good! I wouldn't expect any changes for ganeti 3 wrt the bringup, Ganeti 3 mostly bumped the version since it migrated to Python 3, there's no breaking change that I'm aware of (and even if, we'll just fix it) [15:48:10] the addvm cookbook will need a patch for the new racks, but just a small one liner [15:48:45] let me know when you plan to bring up new cluster and we can go through the necessary steps to bring up a new cluster together [15:49:43] moritzm: do you have any thoughts about the whole 2 groups vs 2 cluster things and rack failure scenarios? [15:50:17] moritzm: in general, I was planning to start tackling it today and/or tomorrow (the ganeti cluster software setup stuff) [15:50:55] I worry that two clusters in one site will break some puppet/netbox/automation assumptions, but it seems like the better option for resiliency. [15:51:38] today won't work, but I'd be around tomorrow for the setup, then I'll also follow up on the failure mode thing, ok? (need to make a change to the LDAP config with Andrew now) [15:51:52] moritzm: ok sounds good :) [15:52:15] I'll try to get up a little earlier than normal so we have more overlap tomorrow, but no gaurantees :) [15:52:53] sounds good, the gist of it doesn't take long and adding additional nodes can follow up piece by piece [16:07:06] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) ==an-worker1104== ====Current interfaces snapshot: {F34750374,width=600} ====Current interfaces: * eno1 - S... [17:38:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Papaul) I downgrade Junos on QFX5100 at https://netbox.wikimedia.org/dcim/rack-elevations/ and did a request system zeroize on it . This is the one we will be using to repl... [17:44:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Volans) [17:46:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) That works for me, thanks, can you send a calendar invite? Note that the link in your comment doesn't point to any specific device. [17:46:29] volans: I did try to fix the legacy-recdns IPs (I gave them DNS names in netbox). let me know if something didn't work out with that and came up in audit! :) [17:48:46] bblack: with the fix merged they didn't come out at all! [17:48:53] so should be all good there [18:00:24] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) So at first glance, this looks like the Netbox script will do the right thing. It will delete and recreate... [18:08:04] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) @BTullis fwiw +1 from my end, thanks for having a look. [18:31:47] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye [18:31:50] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye [18:31:52] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye [18:31:56] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye [18:36:24] volans: thought you might appreciate me trying to push on race-condition buttons :) https://phabricator.wikimedia.org/F34750529 [18:36:40] (everything working fine so far as I can tell, this early!) [18:36:51] I launched all 4 at the exact same moment :P [18:51:28] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye executed with errors: - gan... [18:54:06] one interesting bit was the alert1001 icinga runs [18:54:08] root@alert1001:~# ps -ef|grep pup[p] [18:54:08] root 5970 5770 0 18:51 ? 00:00:00 /bin/bash /usr/local/sbin/run-puppet-agent --quiet --attempts 30 [18:54:11] root 6613 6583 0 18:51 ? 00:00:00 /bin/bash /usr/local/sbin/run-puppet-agent --quiet --attempts 30 [18:54:14] root 20548 5970 29 18:53 ? 00:00:01 /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize --no-splay --show_diff --ignorecache --no-usecacheonfailure [18:54:17] (currently with two still active, there) [18:54:28] one of the four failed outright pretty quickly at that step, so there might be some race/issue there [18:54:44] (and in general, they're all serially waiting to re-execute the same thing, which has already been applied in theory) [18:56:31] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye executed with errors: - gan... [18:57:51] ah now all 4 are done with that stage and it's a little clearer [18:58:15] two of them made it through the alert1001-agent-run gate and moved on to their initial host puppet runs, but two of them failed the alert1001-agent-run step [18:58:31] not exactly sure why yet [19:10:36] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye [19:11:17] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye [19:11:21] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye completed: - ganeti6004 (**... [19:14:27] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye completed: - ganeti6001 (**... [19:15:01] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10Majavah) [19:51:38] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye completed: - ganeti6003 (**... [19:51:41] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye completed: - ganeti6002 (**... [20:13:24] bblack: yes the run of puppet on the alert hosts for icinga is run via the run-puppet-agent that has a timeout if it can't get the lock and hence run within some timeout. The earlier run has no insurance to have picked already the changes from all the catalogs and is known it can hit this issue. It's better to splay them at least a couple of minutes apart from each other. [20:14:26] on a side note one of the planned improvements to spicerack is to add distributed locks for entire cookbooks and partial steps allowing to run at most X at the same time... in a near future in a a nearby shell ;) [20:52:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [21:29:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/739353 [21:29:19] ^ A suggestion re: cronspam from acme_chief-test machine [21:29:33] that we currently get at root@ [23:13:59] 10Traffic, 10SRE: Image requests sending neither "Last-Modified" nor "ETag" HTTP headers. - https://phabricator.wikimedia.org/T295556 (10Ade56facc) OK, I have seen again responses from server Thumbor without headers named in bug title. I have reloaded web page a few times using key F5 in Chrome browser (which...