[06:26:59] Can someone quickly review this for me? Mostly if the IPs are correct :) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/915150/1/wmf-config/ProductionServices.php#197 [06:34:31] looking [06:34:43] danke moritzm [06:35:45] +1d :-) [06:37:31] thanks :*** [09:13:31] I'm looking for reviewers for this ops/dns.git patch, preferably someone from traffic (or who knows more about DNS) https://gerrit.wikimedia.org/r/c/operations/dns/+/914751 [09:24:56] apergos: hi! dumps (coluddumps1001 specifically) is having a peak of io usage, there's some wm_enterprise_downloader script also going on, can you help figuring out what's going on? (we are mostly on #wikimedia-cloud-admin) [09:25:50] dcaro: the monthly download script runs on the 1st and second, it's now doing backfill because the upstream servers had some issues. (this downloads enterprise html entity dumps.) [09:26:25] that script runs twice a month every month, on the primary (i.e. web) server [09:28:57] a side note that this is one single process downloading serially. [09:39:56] interesting [09:41:01] apergos: so the issue is that it's reaching the io limit of the disks, so things start slowing down (and we get paged), is there anything we should do? or you think we should just wait it out? (we might consider removing the alert if so) [09:41:28] is clouddumps1001 still the primary? and why would this be the triggering factor this time and never previously? [09:41:37] (primary i.e. the web server) [09:44:18] yes, clouddumps1001 is the web server, clouddumps1002 the NFS [09:44:36] ok, so it's running on the right host as it has been for many months now [09:44:55] not sure about why now, there's some rsync processes, the enterprise script, and a few nginx (mirrors pulling) [09:45:10] that seems like it would be good to know [09:49:58] the most relevant difference I see from a month ago, is that there's a bit more write load to disk (caused by the enterprise script) [09:50:27] the enterprise script seems to be flapping from read to write [09:50:29] it runs for longer now because there are more files to download; that's the only difference [09:52:28] hmm, so maybe it just passed the threshold xd [09:53:26] how? it's length of the run that changed, not amount written at once [09:53:58] the alert complains about sustained high iowait, so if it takes more now it might trigger it [09:54:10] did it run before the 1st of april? [09:54:14] it used to take 18 hours or something [09:54:16] (I see a big changes there) [09:54:19] now it takes over 24 [09:54:33] I'm guessing that an alert would trigger in either case, no? [09:54:45] https://usercontent.irccloud-cdn.com/file/i1do2H8g/image.png [09:54:48] the script runs on the 1st and 20th of evrey month [09:54:55] that's iowait [09:55:30] graph here https://grafana-rw.wikimedia.org/d/000000568/wmcs-dumps-general-view?orgId=1&from=now-90d&to=now [09:56:47] sorry, that's load xd [09:56:50] https://phabricator.wikimedia.org/T273585 here's the original task for this script (with wmcs consultation of course), for background in cse it's useful. [10:00:05] so it's been running for a while then [10:00:16] (>1 year) [10:00:19] right [10:00:32] hmm, what happened then 1 month ago? [10:01:02] a month ago? nothing on our end. [10:05:02] puppet seems to be broken at least on puppetmasters [10:05:13] May 4 09:58:04 puppetmaster1001 puppet-agent[7553]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /etc/puppet/modules/ssh/templates/publish_fingerprints/known_hosts.epp, line: 8, column: 18) on node puppetmaster1001.eqiad.wmnet [10:07:24] vgutierrez: AFAIK is known by mor.itz and seems related to bookworm install that now works [10:10:14] cc moritzm [10:11:12] it could also be something entirely different, it was just a coincidence timing-wise [10:11:19] apergos: is there any problem if the script is deprioritized? [10:11:42] sretest1002 seems to have the correct values in facter (I didn't check puppetdb) [10:11:46] so it could also be another host [10:11:59] not that there's nobody having errors yet though, but could be an option to allow users downloads priority over the sync cron [10:12:17] it's fine as long as it's permitted to continue to run [10:14:47] volans: yeah, I checked the networking and ssh facts and it looks fine in general [10:30:56] moritzm: according to the error (line 8 column 18) it should be $config['networking']['ip'] at fault [10:31:51] but AFAICT all hosts have that set (I queried puppetdb for the networking facts) [10:33:07] jbond might have additional insights [10:33:22] * jbond looking [10:36:10] godog: volans: [10:36:13] ignore sorry [10:36:26] :) [10:36:48] * godog ignores [10:37:48] * elukey got a ping because of "ores" lol [10:38:29] elukey: clearly your highlights are wrongly configured :D [10:40:04] lolz [10:40:11] volans: I thought you would have commented my work choices, but yes highlights as well D [10:40:14] :D [10:40:49] I know I can't speak about highlight choices, it would only backfire :D [11:04:16] dcaro: if you want the script to continue running with a lower priority in the future, you can submit a patch to this file: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/dumps/manifests/web/enterprise.pp [11:06:11] vgutierrez volans moritzm the error is because pc2011 has no networking fact [11:06:14] curl -vsX GET http://localhost:8080/pdb/query/v4 --data-urlencode "query=inventory{nodes{certname = 'pc2011.codfw.wmnet'}}" | jq . [11:06:18] * jbond from puppetdb2001 [11:06:52] i can add a fix to the epp file but want to dig a bit deeper to see why its missing (ftr pc2011 isa not reachable, possibly decomend?) [11:07:37] I can't even ssh [11:07:37] T334722 [11:07:38] T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 [11:07:40] * jbond used puppetdb1002:~jbond/pql/fingerprints.py to track down [11:07:41] is active in netbox [11:07:44] apergos: thanks! [11:08:16] jbond: thanks nice find, that's why I didn't find in the first place I asked for the ssh and networking facts that ofc didn't return hosts without them [11:09:05] ahh yes :) [11:09:36] fyi this page is also usefull for pql stuff https://voxpupuli.org/docs/pql_queries/#get-a-list-of-nodes-for-which-a-fact-is-not-set [11:12:03] nice! [11:13:59] jbond: thanks for tracking that down, in fact pc2011 was depooled earlier for hw issues: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/915150 [11:20:18] ack cheers just testing a cr now [11:43:18] not sure what happened but that pc2011 is alerting about its switch port being at 100Mbps https://librenms.wikimedia.org/device/device=95/tab=port/port=9745/ [11:43:49] XioNoX: that host was powered off earlier today [11:44:25] interesting [11:44:41] maybe when a host is off it's switch port fallback to 100M (and stay up) [12:58:42] dcaro: just an fyi that I get 'access denied to this dashboard' for the grafana link you posted earlier (though not to others), and that's as a logged in user [12:59:48] oh, that's not nice [13:00:22] I got it too, then hit sign-in again, and it appear [13:00:32] Q/ [13:00:35] :/ [13:01:02] now it does not happen [13:32:18] I signed in again and it did not help [13:32:33] I did that before sending the message to you just to be sure [16:06:56] FYI, puppetboard.w.o will be unavailable for a few minutes [16:07:11] k [16:12:27] and it's back [16:46:52] same for etherpad, it was down for a minute and back