[12:00:32] <Emperor>	 I've tried to reimage 2 ms-be nodes (Dell) in codfw to move them to new VLAN, and in both cases the systems are failing to DHCP. Is there a known issue here?
[12:01:09] <Emperor>	 CLIENT MAC ADDR 00 62 0B 74 EA 40 for ms-be2074 and 00 62 0B 75 4A 80 for ms-be2076
[12:02:28] <Emperor>	 These systems were originally installed in October 2023 (T349839), this is the first reimage since.
[12:02:29] <stashbot>	 T349839: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839
[12:30:58] <Emperor>	 phabricator just gave me an error "Unable to establish a connection to any database host (while trying "phabricator_policy"). All masters and replicas are completely unreachable. AphrontConnectionLostQueryException: #2006: MySQL server has gone away This error may occur if your configured MySQL "wait_timeout" or "max_allowed_packet" values are too small. This may also indicate that something used the MySQL "KILL <process>" command to 
[12:30:58] <Emperor>	 kill the connection running the query."
[12:32:06] <Emperor>	 is this expected?
[12:32:45] <marostegui>	 it shouldn't be
[12:32:47] <marostegui>	 let me check
[12:33:27] <marostegui>	 Something hit the DB
[12:33:45] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db1250&var-port=9104&refresh=1m&viewPanel=panel-2
[12:35:31] <marostegui>	 It seems the values are back to normal
[12:36:01] <Emperor>	 Thanks, phab seems happier now
[12:36:10] <marostegui>	 andre: is there anything similar to "recentchanges" in phabricator? I can inspect the binlogs but that's going to be messy, so maybe there's a better way to scan for that weird activity
[12:39:30] <jynus>	 https://phabricator.wikimedia.org/feed/ ?
[12:40:26] <marostegui>	 jynus: yeah, I was just checking that
[12:40:36] <marostegui>	 but nothing stands out there
[12:41:35] <andre>	 Feed is just user-visible activity
[12:42:25] <andre>	 For DB connectivity, not really I'd say. There are generally things like https://phabricator.wikimedia.org/daemon/ or https://phabricator.wikimedia.org/config/cluster/databases/ but they are not helpful at all in this case
[12:43:00] <andre>	 marostegui: I think the only place to see DB issues listed is  https://logstash.wikimedia.org/app/dashboards#/view/AWt2XRVF0jm7pOHZjNIV (may be filtered out by default) or the error log on phab1004
[12:43:14] <jynus>	 in the past high activity came from repo imports, maybe have a look at difussion
[12:43:34] <marostegui>	 andre: I don't have access to those first two links :(
[12:44:09] <arnaudb>	 it seem to have started around Jan 12 https://grafana.wikimedia.org/goto/6u7xzaSvg?orgId=1
[12:45:02] <andre>	 https://phabricator.wikimedia.org/config/cluster/repositories/ lists only some Diffusion pulling errors, and still not very helpful
[12:45:17] <andre>	 maybe /var/log/phd/daemons.log on phab1004 comes closest? but it's always been quite noisy
[12:46:15] <marostegui>	 arnaudb: I cannot correlate that increase with a database activity increase. Also, from that graph only throughput seems to increase but not requests?
[12:47:26] <arnaudb>	 perhaps the database latency is a side effect? I don't see any significant increase activity wise on the db graphs
[12:48:00] <marostegui>	 On the database I do see an increase on writes, so that must be coming from somewhere
[12:48:16] <marostegui>	 I just wanted to check whether it is a legit increase 
[12:56:14] <arnaudb>	 fwiw: https://grafana.wikimedia.org/goto/Y8B7iaSDg?orgId=1 all writes activities are added here, there seem to be an increase but zooming out to 90 days shows that october was more intense
[12:56:16] <andre>	 I don't have a good explanation for that increase
[12:57:25] <marostegui>	 Maybe then it was just a spike, let's see if it happens again
[12:57:27] <marostegui>	 Thank you all
[12:57:35] <arnaudb>	 there was a few phabricator updates in between so that might be the explanation for the november rate variation
[12:58:25] <arnaudb>	 marostegui: I'm not sure this is a blip, yesterday we had a similar yet a bit different blip: https://wikimedia.slack.com/archives/C05H0JYT85V/p1768906413997449
[12:58:40] <arnaudb>	 (which also resorbed itself)
[12:59:07] <marostegui>	 arnaudb: mmm, then that's a thing indeed
[12:59:29] <arnaudb>	 I haven't found anything obvious yesterday, thinking it might be nothing, but I'm doubting now
[12:59:34] <marostegui>	 I don't know if we get some of those "normally" or we are simply seeing some more reports?
[13:00:10] <arnaudb>	 we had an alert about http probes being laggy at the same timing so that looks fairly new  to me
[13:00:19] <arnaudb>	 we usually don't have those frequently
[13:01:42] <marostegui>	 I can try to narrow the writes from that spike I posted and check binlogs, but I won't be able to tell probably if it is a normal traffic or not unless it is very obvious
[13:03:12] <arnaudb>	 lmk if you see something obvious, I'll try and see if I find something in the logs
[13:03:18] <marostegui>	 roger, thanks
[13:07:36] <Emperor>	 I've opened T415189 about the DHCP/PXE failures 
[13:07:37] <stashbot>	 T415189: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189
[13:09:12] <marostegui>	 arnaudb andre there are lots of deletes on `phabricator_cache`.`cache_markupcache `phabricator_cache`.`cache_general` and `repository_statusmessage` WHERE repositoryID = 3533 AND statusType = 'needs-update'  and similar inserts like: INSERT INTO `cache_general`
[13:09:17] <marostegui>	 But I have no idea if this is normal or not
[13:12:25] <andre>	 That doesn't sound suspicious to me but I'd still have no idea why that's suddenly a higher rate than usual :-/
[13:13:55] <arnaudb>	 there is _something_ that makes some mysql queries slower or fail  `rg mysql -i *error.log |rg -i AH01071 -c` returns `229`, timestamps are starting at 0:05 to now
[13:15:43] <arnaudb>	 maybe not all but a fair number: `rg "#2006: MySQL server has gone away" -c *error.log` → `187` (out of 229)
[13:16:16] <marostegui>	 I think that's probably because the query dies
[13:16:37] <arnaudb>	 because of the query exec time on mariadb's side?
[13:16:57] <marostegui>	 Probably times out, because we don't have any query killers there
[13:17:13] <marostegui>	 And I guess the thread tries to get reused and that's why the connection gets mysql server has gone away
[13:17:58] <andre>	 some might be very broad queries by crawlers; Phab itself is supposed to time out with "Maximum execution time of 30 seconds exceeded"
[13:18:22] <arnaudb>	 that could explain the httpd volume increase
[13:18:41] <marostegui>	 andre: are those supposed to be logged to the dashboard you pasted earlier in logstash?
[13:18:47] <andre>	 yes
[13:19:06] <marostegui>	 I don't see any in the last 3 hours if that's the case
[13:19:25] <andre>	 you may want to re-enable some more filters on that Logstash dashboard, there are a few "DB" ones
[13:29:54] <arnaudb>	 I don't see anything obvious in the error logs, httpd access rate looks steady over time (https://grafana.wikimedia.org/goto/lhSZGaIvg?orgId=1=)