[08:45:18] o/ dcausse: Good morning! Are you around for another spark 3 deployment/DAG update round? Would 10:00 suit you? [08:46:41] pfischer: hey! I might be distracted by the k8s upgrade in codfw but we can try [08:47:23] could also be something interesting for you to follow [09:01:15] Sure, I’m happy to do so. [09:01:26] pfischer: I'll be in https://meet.google.com/fad-kypz-inn for the next hour, feel free to join [09:08:30] ryankemper: is there something more on T325324 ? Or should we move it to needs reporting? [09:08:30] T325324: Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 [10:01:58] dcausse: do you want to cancel our 1:1 due to the k8s upgrade? [10:02:10] gehel: oh yes sorry about that [10:02:22] no problem, that sounds like the right priority! [10:48:04] lunch [10:58:06] lunch [11:16:12] lunch [14:06:46] o/ [14:09:58] gehel thanks for your help with the switch maint, sorry I missed it! [14:42:29] inflatador: no problem! [14:43:13] note that I did not ban the nodes from the cluster, given the short timeframe. And that I was fairly confident that this would work just fine without banning them [14:43:43] inflatador: can you repool all servers? Looks like the maintenance is ove [14:43:49] yeah, that's what we did for the last switch maintenance. I forgot about this one though ;( [14:43:50] s/ove/over/ [14:43:55] ACK, will do [14:44:59] inflatador: note that we depooled the whole wdqs service from codfw, as the updater was down for the k8s upgrade [14:45:29] repooling the individual wdqs nodes should be fine, as long the the service is still depooled (and depooling the nodes was probably not necessary) [14:45:42] dcausse might have more context on the wdqs side [14:46:13] and now that I think about it, did we also depool wcqs? It should be affected in the same way for the updater upgrade? [14:46:14] we're waiting for the k8s upgrade to be done before restarting the rdf-streaming-updater job [14:46:20] Cool, I'll work on Elastic for now [14:46:33] quick errand (need to get food for this evening) [14:46:44] dcausse: can we do our 1:1 in ~20' when I'm back? [14:46:54] sure [14:46:59] I'll ping you [14:47:05] see you! [14:47:10] see you! [14:47:29] dcausse that might affect our meeting but if k8s is being updated, maybe we can't do anything. Anyway we can always meet tomorrow if need be [14:48:21] oh right :/ [14:48:47] I think we can still meet if you have time [14:49:06] we'll have to restart the rdf-streaming-updater in codfw anyways today [14:52:08] OK, whenever you're done with g-ehel hit me up [14:52:58] sure we can quickly meet before as well if you have time (to sync-up on the wdqs reload and k8s@codfw) [14:53:26] dcausse sure, I'm up at https://meet.google.com/qve-fycn-vpw now [14:53:36] cool, joining [15:12:05] dcausse: I'm in https://meet.google.com/meg-wrep-dru for when your done with Brian [15:27:28] inflatador: I joined the same meet to do the rdf-streaming-updater restart if you want [15:58:22] \o [15:58:30] o/ [16:21:39] looks like about 6 failures reindexing across all clusters, not that bad. Mostly on cloudelastic. Started up a few, will start up the rest once some of these have finished [16:46:44] o/ ebernhardson: I’m migrating import_ttl.py to analytics-dags but my IDE still complains about missing imports (the build/test is running, however). What setup do you use for editing those dags? [16:47:37] pfischer: for editing python code i typically use vim, to check for missing imports i typically run the pytest suite [16:48:18] mypy also does the static analysis bit, i suppose it would complain about missing imports as well (pytest and mypy should both be triggered by tox) [16:48:47] Dinner, back later [16:55:45] pfischer: which deps is it not finding? It might be that we are depending on something thats default-installed to debian instances but not listed in the airflow/setup.py [17:00:26] wdqs/wdqs-internal/wcqs CODFW are now repooled [17:01:00] ebernhardson: do you use miniconda? [17:01:24] dcausse: hmm, not sure. For the new repo i use the blubber image [17:02:14] it installs the conda-analytics package from wmf which has a variety of things i didn't look into directly [17:02:15] oh ok and you run tox from that image? [17:02:17] ya [17:03:12] or i just run the pytest/flake8/etc. commands directly in the image, since the tox configuration here isn't setup to allow running specific things and runs it all [17:03:56] oh nevermind, thats the airflow-dags repo that has them all bundled together. the discolytics can use `tox -e flake8` or whatever [17:05:20] I think we're trying to setup this in intellij and thus having all the deps in a conda env [17:05:29] that intellij can point to [17:07:46] workout, back in ~40 [17:08:54] dcausse: for discolytics it should have most of them, but it's possible we are depending on something that the conda-analytics package provided. The package is almost a GB so seems plausible something in there that we need to add to our conda-environment.yaml to be explicit [17:10:19] can get a package list from /opt/conda-analytics/conda-meta/*.json on one of the stat hosts i think [17:10:34] that's where I'm a struggling a bit [17:15:24] for instance I don't get why tox is running fine but the conda env it's running in does not have the deps in the end [17:15:59] ah tox installs stuff in .tox/py37/lib/python3.7/site-packages/ hm... [17:16:50] hmm, tox also installs requirements-test.txt, those are test-only requirements like pytest and mypy [17:22:46] python's so mysterious to me [17:23:33] finally ran "pip install ." in the right conda env (found that in "run_dev_instance.sh") [17:23:49] and intellij is now happy [17:23:52] python packaging is certainly a mess [17:25:29] I guess it's sourcing the package from setup.cfg [17:27:13] toying with intellij, it doesn't seem like adding a conda environment actually imported the conda-environment.yaml [17:27:27] i suppose the pip install . might have done that for you [17:27:32] where did you find that conda-env.yaml? [17:29:34] i'm mucking about in discolytics. The airflow-dags repo probably depends on the conda environment installed by the airflow package andrew made, sec [17:30:28] oh sorry was talking about airflow-dags [17:31:17] see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/airflow/+/refs/heads/debian [17:31:27] looks like they're adding such yaml file in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/200 [17:32:27] setup.cfg in the airflow-dags repo might cover it all though, since it manages to run in the tests [17:33:11] ok [17:34:04] making a quick mr to see if I'm having this right [17:34:47] switching from airflow.operators.dummy_operator to airflow.operators.dummy (former is deprecated) [17:35:00] people love renaming and moving things around :) [17:35:33] :) [17:36:30] i guess it's nice execution_date got renamed to data_interval_start though, less foot-gun'y [17:37:38] oh indeed [17:47:36] dcausse for the data transfer, the arg '--blazegraph_instance' should be 'wikidata' for WDQS? Do we need to do a separate transfer for 'categories'? [17:48:23] inflatador: I need to re-read the code can't remember [17:48:41] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/wdqs/data-transfer.py [17:48:49] (we don't want to transfer the categories journal) [17:49:53] inflatador: yes --blazegraph_instance wikidata should be what we need [17:50:06] you can test transfering to wdqs1009 first [17:50:18] dcausse ACK, starting first xfer (from 1010 to 1009) now [17:50:23] (stopping the current reload if it's still running) [17:52:16] dcausse oh crap, I reversed the source and dest. I stopped myself in time though ;) [17:52:31] ouch :) [17:53:11] OK, here goes [17:56:40] instant failure...looks like encrypt/decrypt is the problem again [17:57:16] :( [17:59:27] we'll probably have to hack around that...if anyone knows of a python library that would handle that transparently LMK. In the meantime I'll remove the encryption part from the cookbook and give a heads-up when it's back [17:59:34] I mean, when it's ready to run again [18:00:35] should we run a checksum at the end to make sure that the data is sane? [18:00:48] (if we remove encryption) [18:03:33] we're doing some rudimentary sanity checking here https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/wdqs/data-transfer.py#L123 [18:03:44] but yeah, checksum could be good [18:04:08] shasumming a 1.5T file might take awhiel, if there is a more efficient way to do that LMK [18:04:15] do you have a particular error on the instant failure btw? [18:04:51] bad decrypt [18:04:51] 140321574442176:error:06065064:digital envelope routines:EVP_DecryptFinal_ex:bad decrypt:../crypto/evp/evp_enc.c:610: [18:04:55] this transfer technique used to work a year ago, I wonder what broke it [18:05:32] It wasn't reliable with encryption, probably due to network issues outside of our control. We tried to simplify the encryption command but I'm not sure we helped much ;( [18:06:40] theoretically, we could copy the files to the NFS server and have the rest of the hosts pull from there, but we've been asked not to use NFS if possible [18:07:39] hm... I think we want the source to catch up on lag before transfering to another host (so that the lag does not cumulate) [18:08:50] ACK. Other SREs have also asked for a helper function or library for long-running transfers so I might check in with v-olans on that [18:12:47] anyway, probably won't start anything for awhile [18:13:17] ryankemper gehel I won't make pairing today. I have to take my son to the doctor for a couple hours. I'll be online an working from there though [18:13:35] hmm, isn't super clear how to run refinery-drop-older-than from airflow 2. they didn't package up refinery/python as an installable package, so we can't really bundle it into discolytics without forking [18:13:52] ryankemper we can probably touch base once I get back (~2 pm your time) and work on the transfer cookbook [18:14:33] tempted to fork the relevant bits, a bit messy though and puts us on the hook for future maintenance [18:14:34] ebernhardson are you using an-airflow1005 yet? I was under the impression we had to change some hieradata stuff so you could have a postgres DB, but havent' checked in yet this wk [18:14:58] inflatador: nope, it's not usable yet afaik [18:16:30] ebernhardson ACK, that confirms my assumption. I'll try and look at that today as well [18:17:11] quick lunch, back in ~1h or so [19:05:22] back [20:30:38] re: an-airflow1005 . Still confirming w/Andrew and Ben, but my current guess is that we would update https://gerrit.wikimedia.org/r/c/operations/puppet/+/883680/4/hieradata/role/common/analytics_cluster/airflow/search.yaml#45 to point to the new postgres DB shown in https://gerrit.wikimedia.org/r/c/operations/puppet/+/889572/2/hieradata/role/common/analytics_cluster/postgresql.yaml#2 [20:41:25] gehel: re T325324, I need to look into whether it's feasible to tune the pybal-related alerts as well. after that the ticket will be done [20:41:26] T325324: Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 [20:43:52] inflatador: next up then would be figuring out why the scap deployment of the dags failed, /srv/deployment/airflow-dags/search is missing on an-airflow1005 but defined in puppet. iirc we had some issue with a puppet run previously related to that [20:44:08] i suppose i can try a deploy from deployment host and see if it ships [20:45:12] ebernhardson cool, let me know how it goes. I'll be back home in 30m and can dig deeper if that doesn't do the trick [20:45:13] from deployment host: sign_and_send_pubkey: signing failed: agent refused operation [20:45:15] analytics-search@an-airflow1005.eqiad.wmnet: Permission denied (publickey,keyboard-interactive) [20:45:26] Ah, OK [20:46:18] I have no idea how scap works, but it sounds like an SSH problem? [20:47:02] something like that. What should be happening here is it ssh's from the deploy host into the an-airflow1005 instance and runs a scap command there (that then pulls from the deployment host) [20:48:43] maybe my scap config is wrong there, or some other puppet config is necessary. In the past we used `deploy-service` as the ssh_user, here we use `analytics-search`. The other airflow-dags deployments also use the destination user but maybe they have additional configuration [20:49:51] yeah, seeing the same thing in /etc/passwd [20:50:23] /srv/deployment/airflow-dags/search exists on 1005 FWiW [20:50:59] oh indeed, it looks like even though scap bailed it managed to put some files in place [20:52:57] If I manually run the same command (without sudo) as the error code (`/usr/bin/scap deploy-local --repo airflow-dags/search -D log_json:False`) ... [20:54:39] ... I get an error `fatal: detected dubious ownership in repository` [20:54:49] not sure if that actually tells us anything [20:55:28] hmm, not sure either [20:56:41] it probably wants to be run as the analytics-search user [20:58:14] probably unrelated, but i have NOPASSWD: ALL for analytics-search sudo rights on an-airflow1001 but not 1005 [20:58:36] (and on stat100* hosts) [21:03:24] and looks like the airflow webserver is trying to start, but is in a bootloop [21:04:03] ERROR - DB Creation and initialization failed: (MySQLdb._exceptions.OperationalError) (1045, "Access denied for user 'airflow_search'@'2620:0:861:106:10:64:36:11' (using password: YES)") [21:04:14] suggests it's not pointed at postgresql [21:12:16] back [22:41:42] ryankemper gave the data transfer c-ookbook a little push, and it looks like it's working this time. ~300G/1.2T copied so far