[09:33:44] something's strange wdqs1022, wdqs1023 and wdqs1024 do appear to fail some systemd unit and/or ssh everyday around 8 or 9 am [10:57:34] lunch [11:54:24] so, i'm off for a half day, but dcausse can you repro this? i could be dreaming and maybe this is already arranged in prod (so it's just an artifact in my dev box), but it seems like this brought load time of one .ttl.gz munged file (from back in september) down to 2.5 minutes of processing time. seriously: [11:54:47] add this to logback.xml in the same dir as runBlazegraph.sh: https://phabricator.wikimedia.org/P54265 [11:55:40] update runBlazeGraph.sh with this (or set your own environment variable) [11:55:45] LOG_CONFIG=${LOG_CONFIG:-"./logback.xml"} [11:58:18] (that is, _create_ a logback.xml in dist/service* dir for your local build where runBlazegraph.sh is; there's no file there by default) [11:59:26] * dr0ptp4kt thinks it would require more config, but temporarily disabling logging might be worth exploring. had verbalized this, but needed to find the config flag staring right at me after looking at a bunch of wrong info on disabling log4j, slf4j, and logback non-classic :P [12:00:26] okay, i think i'm not dreaming - i just did it on another wikidump file and it took 134 seconds. [12:01:27] no telling how badly blazegraph's rebalancing might behave trying to catch up (not sure how much of it is async post import as a reconciliation versus on the hot path of the dataloader servlet, but have glanced over the code and maybe it's okay) [12:04:06] i'm going to exercise and get to that half day off, but i'll be back in the afternoon us ct time to talk with i.nflatador and r.yankemper on bringing in the flat files on those test servers and such. curious if anyone can repro if they have a few minutes [13:19:17] dr0ptp4kt: interesting, fwiw we do set com.bigdata to "warn" by default on prod machines [13:24:43] got the push notification, so quick reply: so, I wonder if the trick is setting to error like in that paste. also, my i7 8700 box's target disk is that u.2 nvme non-raided. so the plot slightly thickens! [14:15:35] gehel I declined our mtg since we talked about skipping it yesterday [14:16:25] dcausse we don't install wdqs-updater at all on those hosts, maybe that job has a dependency on it? [14:16:40] o/ [14:17:01] inflatador: I thought we would have a first pass on the alerting strategy to review? [14:17:03] inflatador: perhaps? it's also surprising to see ssh having issues? [14:17:19] inflatador: let's try to do that on Friday instead? [14:18:07] gehel OK, sounds good [14:20:27] dcausse oh yeah, that doesn't explain SSH. Maybe we can look at it later today or whenever [14:26:22] sure [14:34:15] looking at historical data it seems that the sanitizer never fixed anything on cloudelastic til yesterday when the checkerJob moved to using jobrunners in k8s [14:36:01] Oh yeah, I saw that alert...is there anything we need to do for that? [14:42:18] I think we need to understand why that changed, but there are multiple things going on, the test of cirrusCheckerJob to k8s and the SUP [16:00:22] Search Platform office hours are starting https://meet.google.com/vgj-bbeb-uyi [16:04:17] dcausse: ^ [16:36:04] I don't seem to find a link to docs about the WDQS Recent Changes updater [16:48:02] gehel I have a patch up for the IRC changes if/when you get a chance https://gerrit.wikimedia.org/r/c/wikimedia/irc/ircservserv-config/+/980898 [17:38:21] found the problem with the compare-clusters.py script. Now it reports ~23k variances between relforge and prod frwiki_content :( [17:38:38] but 0 between eqiad and cloudelastic [17:41:18] mostly, that means probably not shipping new updater->cloudelastic this week, need to figure out where these come from, and perhaps bring in a new snapshot afterwords to get them aligned [17:42:04] could probably deploy testwiki if we just want to see something running, testwiki is tiny and can be replaced easily enough [17:48:30] back [18:44:04] #wikimedia-data-platform and #wikimedia-data-platform-alerts are up. wikimedia-data-platform will eventually replace this channel, so please join it if you want [19:04:13] frwiki_content incorrect rev ids has gone down, from 22861 to 22637. I have it running the check every 30 minutes (just in a tmux session), will see how it changes over the day [19:05:58] and the cloudelastic answer is: "message" => "upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED", [19:06:57] i'm not sure who to poke about that, but we should switch the jobs back until it's fixed [19:11:24] randomly curious, the php debug shell in k8s can't see any files, `exec('ls -l /')` has a single workdir directory, and that directory is empty [19:11:38] * ebernhardson was hoping for bare curl to compare php vs bare queries [19:15:18] curiously curl from php land can query https://cloudelastic.wikimedia.org:9243, but http://localhost:6105 fails [19:39:34] ahh, and the reason Checker doesn't blow up on a bad get is a traditional elastica failure. The response is a 503 but elastica doesn't care and Response::hasError() is false [19:40:12] and get was a bit special cased and doesn't run through our normal msearch stuff [20:02:04] lunch, back in ~30 [20:26:51] back [20:39:33] on the upside, been re-running compare-clusters every 30m against frwiki_content on relforge, and the count has gone down by a few hundred every time it runs. Hopefully that means the pipeline itself is fine, but either the lack of rerenders or an error aligning the snapshot with the updater caused the discrepency [20:39:39] school run, back in a bit [21:16:58] thx ryankemper and inflatador for the expertise getting loadData.sh running on wdqs1024 on the n-triples files for nt_wd_main! see you tomorrow. [21:17:31] dr0ptp4kt np, excited to see what happens! [21:21:17] back [21:59:35] randomly interesting, they updated bard today. Now if asking questions that involve math there is a "show me the code" button, which (at least in this case) printed out a python function that sets variables with specific values and then does the actual math [22:19:56] d.causse i applied the prod logback.xml on my gaming rig (commenting out the udp logstash parts), and that seems to have similar throughput, at least in these earlier stages - it's about 270 files through the september dump .ttl.gz munged files, and except for pausing and restarting with the log config has been running for i guess about 10 hours. i'll let it keep running with this more production like configuration.. [22:21:49] maybe i'll be able to see if there's degradation at some point and if so try the simpler log hack if so. but i'm figuring it doesn't hurt to try it out this way...i can alwasy do another run with the even more minimalist logback [22:34:48] ebernhardson: I’m trying to follow along what’s causing problems with cloudelastic? Just so I know where may pick up tomorrow. [22:47:02] pfischer: it's not directly related to the updater, rather some lower-impact mediawiki jobs were migrated to k8s, and the envoy instance fronting cloudelastic in k8s gets a TLS failure talking to cloudelastic. We suspect it's because most certs are self-signed internal puppet certs, and cloudelastic uses acmechief because it is (in theory) a "public" service [22:48:00] pfischer: i have a patch in gerrit for cirrus that will "fix" how cirrus doesn't manage to fail with those bad responses. But it will simply turn the existing deployment into 400 exceptions per minute [22:48:03] s/minute/second/ [22:49:37] i don't think we fix the TLS side, i think we get them to put the CheckerJob back on the old job runners, and provide a way to validate if they will work the next time (i have a repro in the ticket for that) [22:53:40] ebernhardson can you link me to that ticket if you don't mind? [22:54:03] inflatador: https://phabricator.wikimedia.org/T352906 [22:58:53] ebernhardson ACK, subscribed...hopefully will get a fix soon. [23:02:49] ebernhardson: thanks! But the SUP would hit the same envoy and therefore run into the same 5xx responses? [23:03:12] pfischer: quite possibly, yes. It should be the sample impl [23:03:17] s/sample/same/ [23:08:56] I'm heading out, but if anyone feels like reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/980499 LMK. The PCC failure can be ignored, apparently it's due to PCC's cloud environment not supporting IPv6 or something? [23:12:44] ebernhardson: okay, I wonder why this has not been noticed sooner, or did we hit cloudelastic without envoy until now? [23:13:58] I’m calling it a day. [23:13:59] pfischer I think it's because the jobrunners moved to K8s, so the TLS settings are different there [23:14:55] I'm out too