[12:00:32] I've tried to reimage 2 ms-be nodes (Dell) in codfw to move them to new VLAN, and in both cases the systems are failing to DHCP. Is there a known issue here? [12:01:09] CLIENT MAC ADDR 00 62 0B 74 EA 40 for ms-be2074 and 00 62 0B 75 4A 80 for ms-be2076 [12:02:28] These systems were originally installed in October 2023 (T349839), this is the first reimage since. [12:02:29] T349839: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 [12:30:58] phabricator just gave me an error "Unable to establish a connection to any database host (while trying "phabricator_policy"). All masters and replicas are completely unreachable. AphrontConnectionLostQueryException: #2006: MySQL server has gone away This error may occur if your configured MySQL "wait_timeout" or "max_allowed_packet" values are too small. This may also indicate that something used the MySQL "KILL " command to [12:30:58] kill the connection running the query." [12:32:06] is this expected? [12:32:45] it shouldn't be [12:32:47] let me check [12:33:27] Something hit the DB [12:33:45] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db1250&var-port=9104&refresh=1m&viewPanel=panel-2 [12:35:31] It seems the values are back to normal [12:36:01] Thanks, phab seems happier now [12:36:10] andre: is there anything similar to "recentchanges" in phabricator? I can inspect the binlogs but that's going to be messy, so maybe there's a better way to scan for that weird activity [12:39:30] https://phabricator.wikimedia.org/feed/ ? [12:40:26] jynus: yeah, I was just checking that [12:40:36] but nothing stands out there [12:41:35] Feed is just user-visible activity [12:42:25] For DB connectivity, not really I'd say. There are generally things like https://phabricator.wikimedia.org/daemon/ or https://phabricator.wikimedia.org/config/cluster/databases/ but they are not helpful at all in this case [12:43:00] marostegui: I think the only place to see DB issues listed is https://logstash.wikimedia.org/app/dashboards#/view/AWt2XRVF0jm7pOHZjNIV (may be filtered out by default) or the error log on phab1004 [12:43:14] in the past high activity came from repo imports, maybe have a look at difussion [12:43:34] andre: I don't have access to those first two links :( [12:44:09] it seem to have started around Jan 12 https://grafana.wikimedia.org/goto/6u7xzaSvg?orgId=1 [12:45:02] https://phabricator.wikimedia.org/config/cluster/repositories/ lists only some Diffusion pulling errors, and still not very helpful [12:45:17] maybe /var/log/phd/daemons.log on phab1004 comes closest? but it's always been quite noisy [12:46:15] arnaudb: I cannot correlate that increase with a database activity increase. Also, from that graph only throughput seems to increase but not requests? [12:47:26] perhaps the database latency is a side effect? I don't see any significant increase activity wise on the db graphs [12:48:00] On the database I do see an increase on writes, so that must be coming from somewhere [12:48:16] I just wanted to check whether it is a legit increase [12:56:14] fwiw: https://grafana.wikimedia.org/goto/Y8B7iaSDg?orgId=1 all writes activities are added here, there seem to be an increase but zooming out to 90 days shows that october was more intense [12:56:16] I don't have a good explanation for that increase [12:57:25] Maybe then it was just a spike, let's see if it happens again [12:57:27] Thank you all [12:57:35] there was a few phabricator updates in between so that might be the explanation for the november rate variation [12:58:25] marostegui: I'm not sure this is a blip, yesterday we had a similar yet a bit different blip: https://wikimedia.slack.com/archives/C05H0JYT85V/p1768906413997449 [12:58:40] (which also resorbed itself) [12:59:07] arnaudb: mmm, then that's a thing indeed [12:59:29] I haven't found anything obvious yesterday, thinking it might be nothing, but I'm doubting now [12:59:34] I don't know if we get some of those "normally" or we are simply seeing some more reports? [13:00:10] we had an alert about http probes being laggy at the same timing so that looks fairly new to me [13:00:19] we usually don't have those frequently [13:01:42] I can try to narrow the writes from that spike I posted and check binlogs, but I won't be able to tell probably if it is a normal traffic or not unless it is very obvious [13:03:12] lmk if you see something obvious, I'll try and see if I find something in the logs [13:03:18] roger, thanks [13:07:36] I've opened T415189 about the DHCP/PXE failures [13:07:37] T415189: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189 [13:09:12] arnaudb andre there are lots of deletes on `phabricator_cache`.`cache_markupcache `phabricator_cache`.`cache_general` and `repository_statusmessage` WHERE repositoryID = 3533 AND statusType = 'needs-update' and similar inserts like: INSERT INTO `cache_general` [13:09:17] But I have no idea if this is normal or not [13:12:25] That doesn't sound suspicious to me but I'd still have no idea why that's suddenly a higher rate than usual :-/ [13:13:55] there is _something_ that makes some mysql queries slower or fail  `rg mysql -i *error.log |rg -i AH01071 -c` returns `229`, timestamps are starting at 0:05 to now [13:15:43] maybe not all but a fair number: `rg "#2006: MySQL server has gone away" -c *error.log` → `187` (out of 229) [13:16:16] I think that's probably because the query dies [13:16:37] because of the query exec time on mariadb's side? [13:16:57] Probably times out, because we don't have any query killers there [13:17:13] And I guess the thread tries to get reused and that's why the connection gets mysql server has gone away [13:17:58] some might be very broad queries by crawlers; Phab itself is supposed to time out with "Maximum execution time of 30 seconds exceeded" [13:18:22] that could explain the httpd volume increase [13:18:41] andre: are those supposed to be logged to the dashboard you pasted earlier in logstash? [13:18:47] yes [13:19:06] I don't see any in the last 3 hours if that's the case [13:19:25] you may want to re-enable some more filters on that Logstash dashboard, there are a few "DB" ones [13:29:54] I don't see anything obvious in the error logs, httpd access rate looks steady over time (https://grafana.wikimedia.org/goto/lhSZGaIvg?orgId=1=)