[07:32:48] errand (be back in ~2h) [07:55:07] gehel: I think the import is done, when you have a sec please run the lexeme import [07:55:23] on wdqs1009 [07:56:34] Will do [07:56:47] thanks! [08:32:17] lexeme reload started on wdqs1009 [08:32:51] wdqs2008 will probably catch up on lag in the next 4-5 h [08:33:50] thanks, will monitor wdqs1009 and start the updater once complete [08:49:39] Hello everyone [08:49:59] I am on IRC cloud 🙃 [08:50:05] Emmanuel has a new nick and is not IRCCloud enabled [09:16:01] I don't know who recommended this song but i love it 'La Maison Tellier - Sur un Volcan' [10:10:23] Lunch [10:16:08] gehel: it seems I just missed you, can you ping me once you have the time for retesting the data-transfer cookbook? [10:33:58] lunch [10:34:23] bonjour ejoseph :-] [12:17:14] ejoseph: glad you liked La maison Tellier :), I love The Cavemen thanks for sharing it! [12:43:15] lexeme reload seems done, restart the updater on wdqs1009 [12:44:18] dcausse: ping me when you want me to clean up the downloads and munged files [12:44:27] zpapierski: I'm around for another test [12:44:39] gehel: you can clean them up when you want/can [12:45:00] gehel: I was just going to get myself a coffe and I'm ready [12:45:09] updater restarted [12:45:37] ~10m [12:45:44] cleanup completed [12:45:50] ^ on wdqs1009 [12:47:04] thanks! [12:48:34] wdqs2008 is back to real time [12:53:00] \o/ [12:55:02] ~22h to catch 2.2 week of lag vs 15days for 2week (according to T241128) [12:55:03] T241128: EPIC: Reduce the time needed to do the initial WDQS import - https://phabricator.wikimedia.org/T241128 [12:57:24] can we say "significant performance improvement" ? [12:57:49] zpapierski: I'm in the open hangout https://meet.google.com/ugw-nsih-qyw [12:57:54] :) [12:58:04] if anyone else wants to join (ejoseph ?) [12:59:20] I'll be there in a sec [13:25:52] volans: want to have a last look at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/727021/12 before we merge it? [13:29:15] sure I can [13:35:15] left a couple of comments, up to you [13:35:22] non blocking [14:34:09] volans: since we tested the change I'll change the commit message, but push the changes for the comments into next patch, is that ok? [14:34:39] sure, all up to you, as I said not a blocker neither mandatory [14:42:45] errand [14:54:24] zpapierski: cookbook merged [14:54:30] yay [14:54:33] after retro? [14:54:54] zpapierski: maybe you can work with ryankemper to do the first run? I have another meeting after retro and then needs to feed the kids [14:55:34] since I assume that ryankemper knows how to run cookbooks, I'm absolutely fine with that :) [14:55:51] of course, if he can participate [14:57:39] he can dig in my bash_history to find the commands we used for testing [14:58:06] reminder: update T288231 as you go and merge puppet configuration changes to the updater [14:58:06] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [14:58:56] I'm assuming it needs to be merged before we start? [14:58:59] codfw servers can go from this patch chain: https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 [14:59:26] it must be merged just after the start of the cookbook (once the updater has been stopped) [14:59:37] ah,ok [14:59:58] will puppet make it in time, though? [15:00:18] should be 30mins for puppet so I hope so [15:01:08] puppet run can be manually done, so no issue there [15:02:17] as a reminder https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Deployment ;) [15:02:23] retrospective is starting: https://meet.google.com/ssh-zegc-cyw (cc: ebernhardson) [15:28:38] So looks like first up is ` wdqs2008 -> wdqs2005` per https://phabricator.wikimedia.org/T288231? [15:29:16] ryankemper: yes, I also made the patch chain https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 which should have proper ordering for codfw [15:29:40] great, thanks [15:29:49] there's a similar one for eqiad but the source (wdqs1009) is not ready yet [15:29:55] ack [15:30:27] won't wdqs2005 repooled right after data-transfer? [15:30:52] I mean, because of the cookbook we will have a few hours of lag on wdqs2005? [15:32:13] It will be automatically repooled, yeah [15:32:20] There might be an option to not repool, lemme check [15:32:22] do we actually want that? [15:32:49] I mean, it will catch up very quickly, we might not care [15:33:06] but for every cookbook run we will cause a lag on both source and destination [15:33:19] anyway, I'm ready when you are [15:33:24] I thought we were waiting for the lag to catch-up before triggering the new transfer but perhaps that's all manual [15:34:03] I mean, even if we do, it still takes a couple of hours for a data transfer to complete [15:34:23] during which I guess both consumers are off? [15:34:43] yes since blazegraph is off [15:35:12] exactly, so each time we do a data transfer we cause two instance to lag few hours [15:35:35] so I'm thinking maybe we should pool them only when they catchup? [15:35:50] Looks like `data-reload` has a flag to depool the host but not the data-transfer [15:36:05] data-transfer depools them as well [15:36:31] I think there's an opposite flag there [15:36:39] Could just append a `&& ssh ryankemper@wdqs2005.codfw.wmnet 'sudo depool'` after the command [15:36:40] there's with_lvs flag apparently [15:37:06] `with_lvs` will make it not depool though [15:37:13] I guess we could manually depool beforehand and pass it that flag though [15:37:21] that would work [15:37:28] Okay we'll do that for simplicity [15:37:35] depool them, data-transfer, wait, pool them [15:37:50] I love this: parser.add_argument('--without-lvs', action='store_false', dest='with_lvs', help='This cluster does not use LVS.') [15:37:59] it's double negative in english [15:38:13] I understand nothing of this line :) [15:38:21] where's the double negative? [15:38:22] I, store_false [15:38:32] on with_lvs, with the flag name without_lvs [15:38:38] s/I/ahh [15:38:45] ah yeah i see what you mean [15:38:55] yeah it defaults to a false for --without-lvs haha [15:39:14] it sort of makes since in that the without lvs is a flag we flip on for the special case (test host as either source or dest) [15:39:17] but does read a bit weird [15:39:40] So far as the order things need to happen in: dispatch email to mailing list -> start data-transfer -> merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 -> transfer finishes, wait for lag catchup - > repool? [15:40:00] do we need to email anybody for a depooled hosts? [15:40:23] no but there was a note in the phab ticket about reminding people on the list that the streaming updater cutover is beginning [15:40:32] https://phabricator.wikimedia.org/T288231 [15:40:34] ` Send a quick reminder com to users` [15:40:41] ah, right [15:40:42] sure [15:40:55] (I should have mentioned that's a special case for the first transfer) [15:41:42] merge = puppet merge + puppet apply just in case [15:41:54] ack [15:41:58] for the com I don't know if it's strictly needed [15:42:32] we could wait till next week when we really start the transfers in earnest [15:42:46] dcausse: https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&var-site=codfw&var-k8sds=codfw%20prometheus%2Fk8s&var-opsds=codfw%20prometheus%2Fops this only shows latency for wdqs2008 [15:42:51] Okay going to compose the transfer command real quick (won't take long), just to sanity check we already merged the cookbook changes right? [15:42:52] is there something else available? [15:42:54] mpham: do you think it's needed to send a quick reminder that we're about to affect users. I've put this in T288231 but not sure we reallly agreed that it was necessary [15:42:55] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [15:44:25] zpapierski: it's using a streaming updater dedicated metric (wdqs_streaming_updater_kafka_stream_consumer_lag_Value) [15:44:59] as new servers are migrated more series will appear [15:45:05] aa, I see [15:45:30] ok, so this is what we need to look at then [15:45:36] ryankemper: the first servers are in the internal cluster so won't affect users so I believe it's fine to start now [15:46:03] open hangout? [15:46:19] sure [15:46:22] `cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "streaming updater cutover for wdqs2005" --blazegraph_instance blazegraph --task-id T288231` look good? [15:47:06] "blazegraph" really means "wdqs" in `--blazegraph-instance` btw [15:47:19] s/wdqs/wikidata [15:48:45] ryankemper: can you join open hangout? [15:55:01] `sudo cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "streaming updater cutover for wdqs2005" --blazegraph_instance blazegraph --task-id T288231 --without-lvs` [15:55:02] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [15:57:28] https://config-master.wikimedia.org/pybal/codfw/wdqs-internal [15:58:21] zpapierski: dcausse: now that the cookbook is starting, time for me to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 and run puppet-agent on `wdqs2005` right? [15:59:51] ryankemper: in case of emergency - https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/727021/12 to revert and run the cookbook with the other machine that doesn't have the new updater [16:37:06] Transfer failed with a broken pipe on `wdqs2008`; it's prob just a one-off so starting the cookbook over again [16:38:13] ryankemper: I see multiple nc and pigz commands on wdqs2005 [16:38:39] ah [16:38:50] so it's possible the initial ctrl+c never stopped the pigz I guess [16:38:55] taking a look at the process list [16:39:42] Okay yeah I'll manually terminate the processes, one sec [16:44:44] Okay cleaned those up, and kicked off the cookbook again [17:03:14] ryankemper: wcqs meeting? [17:04:16] dcausse: thx [17:58:47] dcausse: I think you were asking about reminding WDQS users that we're starting data transfer? I sent something on Monday, but they won't know we just started [17:59:23] Do we know yet when we expect it to finish? because the original timeline is tomorrow, but I want to let people know what the new timeline is [18:00:24] mpham: 22 Oct I think [18:00:42] did see that you already sent quick reminder com, thanks! [18:00:48] *not* [18:01:15] thanks. i'll send a reminder for 22 oct then [18:17:08] ryankemper: I'll be a bit late for our meeting, still need to give Oscar his injections [18:18:42] gehel: ack [18:39:46] ryankemper: I'm around [19:36:45] dcausse: (for tomorrow) is T105427 still relevant with the streaming updater? [19:36:45] T105427: Need a way for WDQS updater to become aware of suppressed deletes - https://phabricator.wikimedia.org/T105427 [19:37:26] gehel: it's fixed by it, so can be closed once we have it running in production [20:24:11] dcausse: I wrote an implementation sketch for infobox filtering in https://phabricator.wikimedia.org/T292141#7429811 . If you think that's reasonable (both the implementation and choosing that high-level approach), we'd start working on it soon. Otherwise happy to go with what you think is best. [22:14:48] has anyone taken a look at T293394 yet? [22:14:49] T293394: CirrusSearch results include out of date pages - https://phabricator.wikimedia.org/T293394 [22:21:52] legoktm: hadn't seen it. can look around but we haven't changed anything in cirrus in awhile, nothing clear to look into for regression candidates [22:22:35] job backlogs don't seem anything significant in last few days [22:23:18] * legoktm nods, I don't actually know whether it's related to the job queue issues or not [22:34:07] hmm, something does look wrong. or my script is broken. But i have a think that picks a random page from rc feeds every 10 seconds and then monitors search for when the revision shows up. It's only adding things and hasn't found any updated docs yet :S [22:38:32] it is finding things now, typical latency of about 2 minutes for wikidata.org, i need to add something to it to report if there are old entries still being checked [22:53:08] hmm, no it looks like everything its picking out of the rc feeds is landing within a few minutes. I suppose this only checks revision ids, seems unlikely but possible it's getting content for older revision [22:54:00] could also be fairly itermittent...would be good to turn this into more useful stats [23:01:44] ebernhardson: FWIW I haven't had time to look into it yet but rolling restarts of cloudelastic are failing on 133 items stuck in one of the partitions: `[25/60, retrying in 60.00s] Attempt to run 'spicerack.elasticsearch_cluster.ElasticsearchClusters.wait_for_all_write_queues_empty' raised: Write queue not empty (had value of 133) for partition 1 of topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.` [23:02:25] So there might be some job queue stuff going on - had the exact same value when I tried restarting towards the end of last week so there's definitely 133 events that are never draining [23:02:47] I need to step out for a little bit but I'm gonna take a look when I get back and see what might be going on there