[10:46:57] While working on the Search Platform team interface page (https://office.wikimedia.org/wiki/ERC/Search), I'm wondering if team information should really be on mediawiki (https://www.mediawiki.org/wiki/Wikimedia_Search_Platform) or on wikitech. [10:47:43] mw.o seems to really be about MediaWiki as a software. Our team pages are more about the general work that we do. [10:48:38] For example, all SRE teams seems to be on wikitech (https://wikitech.wikimedia.org/wiki/SRE), so is Data Engineering (https://wikitech.wikimedia.org/wiki/Data_Engineering) [10:49:25] In the end, it doesn't change much. But if I'm creating new pages and dusting off the existing ones, I think it also makes sense to be coherent with what other teams are doing. [10:49:27] Thoughts? [10:52:32] other teams like Growth are using mw.org tho, not sure there are guidelines here (i.e. product team are better fit in mw.org and tech team in wikitech?) [10:53:02] ML is on mw.org [10:55:41] our docs is spread on the two sites so it's really hard to pick one [10:58:21] with mw.org you get the translate features but I think it matters more for teams with more end-user interactions than we actually do [11:10:49] Product teams also tend to be more focused on MediaWiki itself. For us, mw is only a small part of what we do. [11:11:19] I think the CirrusSearch documentation makes a ton of sense on mw.o. The operational documentation of our services make a lot more sense on wikitech. [11:11:27] the team documentation is more fuzzy [11:12:24] We talked about standardizing on Wikitech as part of the team interfaces ERC, but in the end, we've been focused on the landing page on office wiki. [11:15:46] if this becomes the standard for tech teams then we should probably migrate there but until then perhaps it's not worth the effort? [11:26:27] lunch [14:03:00] o/ [14:05:38] Data transfer from 1010 to 1009 cookbook finished! [14:06:47] starting an xfer from 1010 to 2009 now [14:17:10] \o/ [15:24:41] ebernhardson: I've sent you an invite to talk about search dashboards with Connie and Mmikhail next week. It is outside of your usual working hours, so please skip if you can't be available. [15:25:10] I've also invited Trey and Mike. Let's not bring the full team yet (unless you really want to be there) [15:28:39] ebernhardson, ryankemper, inflatador, Trey314159, mpham: could you fill https://docs.google.com/spreadsheets/d/168pCszNxYBdYvMhiD_WHZwfoEDVZ00ug7O7N3lpkBK8/edit#gid=0 with your availability for an offsite? [15:29:23] I see that some of the dates are already filled, but not all. If you don't know yet if you're available or, not, could you take a guess? I'd like to fix a date sooner rather than later. [15:31:16] I only put zeros for the worst days. I'll add some ones. Thanks for the reminder! [16:01:18] inflatador: re your question yesterday in wikimedia-sre, yes the new airflow instance and the old airflow instance should be using different profiles. Sadly the naming doesn't make things clear, but the airflow 1 and airflow 2 instances use different profiles [18:22:16] Data transfers are ongoing for WDQS, myself/ ryankemper will update the list at https://phabricator.wikimedia.org/T323096#8638317 [18:24:54] took a quick look over how poolcounter handles stats, essentially they have a global variable with some counters in it that get incremented, and then it can print those values out to be processed by the prometheus collector and shipped to prometheus [18:25:09] per-pool stats would have to be done with more dynamic pieces [18:30:21] should be possible i suppose, essentially we would take a key, hash it, look it up in the hashtable, and then report the count (=queue depth)/processing (=activate workers). But thats a 10% project for another day i guess :) [18:31:19] general idea i guess would be to add a command to report stats for n keys, and have the prometheus collector take a cli argument containing a set of key's to report stats for [18:40:45] lunch, back in ~40 [19:10:05] back [19:17:09] wrote a little python to use spark and report the partitions that are available, and then a little bash loop to use airflow cli to mark things as succes(airflow run --mark_success ) to get some of the delayed dags moving. This is only a bit of a bandaid though [19:17:42] basically i'm approving things that have eqiad but not codfw partitions. I suppose an alternate would have been to deploy an update to dags that ignored codfw for a little while [19:20:32] curiously it seems like you can mark a task as success even if the dag_run hasn't been created yet [19:38:20] back [19:38:38] ...or not [19:53:17] not sure what to do with this...event.rdf_streaming_updater_lapsed_action doesn't have eqiad or codfw partitions starting at 2023-02-21T11 [19:57:19] looking in kafka-jumbo, i guess this mostly only gets canaries without actual lapsed actions, there is a gap in eqiad.rdf-streaming-updater.lapsed-action canaries from a 2023-02-21T10:45 through 2023-02-22T18 which matches our outage ...it's not clear thugh why eqiad is also effected [20:08:48] ebernhardson: because this stream is pretty much idle usually (it gets a couple events/day) so it relies on canary events for eqiad as well [20:09:06] dcausse: ahh ok. i'll mark all those as success then as well [20:09:11] the data transfer playbook always fails the first time I run it against a new host...but it seems to work after that [20:09:17] thanks for taking care of this! [20:09:23] dcausse: np [20:09:31] inflatador: random guess, does it have to wait for the ferm rules to apply? [20:09:42] * ebernhardson has no clue really :P [20:10:05] I'm guessing that too [20:11:04] it gives a 'bad decrypt' error from openssl but I'm guessing that has something to do with data transfer starting before the listener is ready and/or before the FW is open [20:12:06] well, if it works the second time through i suppose we run with that :) Could probably track it down but could take a day or two to properly understand [20:12:32] I wonder if you should not let a port open all the time allowing wdqs* hosts instead of setting up an ferm exception on the fly? [20:15:17] maybe, although the current 'book picks a random port, so we would have to account for that. I think I'm happy enough for now...especially considering we're xferring over 10G now [20:16:06] hm... that's we have tho (port 9876): /etc/ferm/conf.d/10_query_service_file_transfer populated from puppet [20:16:42] but the cookbook is also opening that on the fly https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wdqs/data-transfer.py#80 [20:19:13] oh, nice! /me needs to pay more attention to the xfer [21:50:37] taking much longer than expected...but most of it is now cleaned up. Still have to figure out how subgraph/query mapping is supposed to work [21:51:44] would be easier if airflow represented ExternalTaskSensor a bit better in the ui, lots of going back and forth between dag files and the ui and figuring out what needs to be marked, why a dag is stalled, etc. [23:09:25] ryankemper heading out, but FYI I have 2 data transfers running (one for each DC) on cumin1001 [23:10:00] ack