[01:38:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [01:43:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [02:12:48] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:22:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [02:32:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [03:09:22] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:19:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [04:24:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [06:54:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [06:55:34] mgerlach, leila: I've added a line in the "Changes and known problems" sections of both pageview and webrequest about the dataloss - Thank you for pointing it out! [06:58:42] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10JAllemandou) @Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster topology asking to the known host(s). How... [06:59:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:11:03] hi teammm! [07:11:12] joining early today [07:11:22] Good morning mforns :) [07:11:29] How are you? [07:11:56] heya joal! I'm good! And how's your day? [07:12:24] Just started mforns XD [07:12:38] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) After a great talk with @Antoine_Quhen a wider discussion needs to happen: Spark3 offers the possibility to write to cassandra through SQL-like queries (see https://github.com/datastax/spark... [07:12:48] :) [07:15:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:19:50] 10Data-Engineering: Check home/HDFS leftovers of statwithlatte - https://phabricator.wikimedia.org/T307980 (10MoritzMuehlenhoff) [07:20:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:32:24] 10Data-Engineering: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link - https://phabricator.wikimedia.org/T305591 (10JAllemandou) 05Open→03Resolved [07:36:16] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:39:33] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10mforns) I don't mind waiting for Spark3, but let's see what the team thinks! On the other hand, if we write a CassandraLoadOperator() that uses the existing Cassandra loading Spark job now, and migrate t... [07:41:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:49:35] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) Actually there would some difference, as using Spark3 would make the related HQL queries in the form: ` INSERT INTO aqs.local_group_default_T_pageviews_per_project_v2.data SELECT ... ` inste... [07:57:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [08:12:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [08:13:00] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/790373 (https://phabricator.wikimedia.org/T305843) (owner: 10Aqu) [08:15:12] (03CR) 10Joal: [C: 03+1] "Actually one comment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/790373 (https://phabricator.wikimedia.org/T305843) (owner: 10Aqu) [08:24:25] mforns: Would you have a minute for me please? I looking at the anomaly detection error we had 3 days ago, and I have questions :) [08:24:37] yes! [08:24:47] joal: batcave? [08:24:56] OMW! [09:21:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [09:26:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [09:28:13] 10Analytics-Radar: [REQUEST] Extract search queries from HTTP_REFERER field for a Wikibook - https://phabricator.wikimedia.org/T144714 (10Aklapper) [09:38:41] hello, does the new stats.wikimedia.org have something about operating system or browser like the old did? [09:40:10] (03CR) 10David Caro: [C: 03+1] "LGTM, starts getting a bit weird though, user experience might not be the smoothest, but least effort I guess." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146) (owner: 10Vivian Rook) [09:42:44] i can not find anything like https://stats.wikimedia.org/wikimedia/squids/SquidReportOperatingSystems.htm [09:42:44] Hi mei[m] - we have data about OSes and browsers for readers here: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os [09:42:59] oh, thank you! [09:43:10] you're welcome :) [11:06:32] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789797 (https://phabricator.wikimedia.org/T307779) (owner: 10Joal) [11:28:23] (03PS1) 10NOkafor: Added the get Cassandra password function to process cassandra password file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790651 [11:28:25] (03PS1) 10NOkafor: Created a function to get cassandra password from file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790652 (https://phabricator.wikimedia.org/T306895) [11:29:51] (03PS2) 10NOkafor: Added the get Cassandra password function to process cassandra password file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790651 (https://phabricator.wikimedia.org/T306895) [11:33:12] (03CR) 10Btullis: [C: 03+1] Throttle heavy monthly jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789797 (https://phabricator.wikimedia.org/T307779) (owner: 10Joal) [11:55:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [12:00:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [12:22:42] (03CR) 10Joal: "A bunch of comments - let's talk about them when you wish" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790651 (https://phabricator.wikimedia.org/T306895) (owner: 10NOkafor) [12:36:39] (03CR) 10Snwachukwu: [WIP] Create a Hive to Graphite job (0315 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/775376 (https://phabricator.wikimedia.org/T304623) (owner: 10Snwachukwu) [12:42:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10mforns) Yes, they do. I didn't change them, because in this case the use of a factory is justified. However the factory should not be a DAG factory, but ra... [12:43:34] (03Abandoned) 10NOkafor: Created a function to get cassandra password from file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790652 (https://phabricator.wikimedia.org/T306895) (owner: 10NOkafor) [12:48:10] 10Data-Engineering, 10Airflow: [Airflow] Refactor anomaly detection DAG factory into a TaskGroup factory - https://phabricator.wikimedia.org/T308011 (10mforns) [12:48:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10mforns) I created a new task for that: T308011 [13:02:04] ottomata: meeting? [13:02:09] ottomata: we could talk kafka [13:05:30] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10BTullis) Sorry, I don't quite get what you mean by this: > Zookeeper should be run in more than 3 DCs, e.g. '2.5' DC... [13:06:53] aqu: airflow docs meeting? [13:19:56] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [13:22:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Well I'm not getting anywhere very fast with this. I now understand from @akosiaris... [13:29:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I identified the node process that was running eventgate-analytics-external, then ra... [13:32:20] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [13:35:19] btullis: I took a quick look at your flamegraph - the high up flames seem to be full of maps [13:58:22] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) @BTullis oopos typo! fixed. Should have said 'more than 2'. [13:58:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I've done more analysis of packet captures from eventgate-analytics-external and I s... [13:59:57] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10BTullis) Ah great, sorry I feel like a pedant now. :-) [14:14:39] btullis: sorry I haven't deployed the 10s eventgate bucket yet [14:14:44] npm got back to me and is helping me reset my 2FA [14:14:59] so i'll wait for that. i was going to hack around it buuuuut now they are helping me [14:23:24] ottomata: Cool, no worries. I was trying some more things in the meantime and writing up where I've got to ... which isn't very far from where I started, sadly. [14:33:54] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) I emailed the kafka mailing list with some of these questions and got a really nice response from Guoazhan... [14:49:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7917390, @BTullis wrote: > > We have proposed creating a new bucke... [14:54:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) > all that is needed is to deploy a version of eventgate On it. I had issues with... [14:55:43] joal: and /or btullis if you have time today, i'd love a little kafka stretch brain bounce to think about # of partiions, replicas, and traffic size estimation [14:55:48] i can do on my own but it would be way more fun to do with you! [14:56:20] maybe after meetings? [14:56:45] also milimetric. [14:57:16] okay, lets start with an inital estimate on average and max message size of a mw wikitext event [14:57:28] wow i'm about to query mw history for the first time.... [14:58:38] :) [15:01:25] hi ottomata: talk here hang in cave? [15:01:31] in airflow sync [15:02:19] k, I'm here for after [15:03:08] k [15:03:32] milimetric: i need some stats on average and maybe max(ish) p99 or p95 revision byte sizes [15:03:39] doesn't have to be for all time [15:03:42] but maybe in the last year or two [15:04:48] I'll do it by year [15:04:59] wow okay thank you [15:05:06] can you post results here: https://phabricator.wikimedia.org/T307944 ? [15:10:04] milimetric: basiically, i need to estimate the differences in cross DC traffic this event stream would create in a kafka stretch cluster vs using just mirror maker [15:12:28] yep, I'm writing the query while trying to think of what would be different between this raw rev_len byte size and the actual blob, if any [15:23:55] this will be a very rough estimate [15:24:03] order of magnitiudue is okay [15:30:15] ottomata: Sorry, just seen this. Yes I can do after meetings today. [15:32:48] We don't currently use compression of message in Kafka, right? I wonder whether it would be of benefit in this stretch cluster: https://developer.ibm.com/articles/benefits-compression-kafka-messaging/#supported-compression-types-in-kafka [15:41:25] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "merging for eventual deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/779952 (https://phabricator.wikimedia.org/T306136) (owner: 10Milimetric) [15:44:15] (EventgateLoggingExternalLatency) firing: Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [15:46:08] btullis: i think we do? [15:46:27] snappy compression [15:47:01] OK, thanks. [15:48:16] (EventgateLoggingExternalLatency) firing: Critical latency for GET events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [15:49:15] (EventgateLoggingExternalLatency) resolved: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [15:52:16] btullis: for brain bounce later, it may be helpful if you read https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment#KIP36Rackawarereplicaassignment-ProposedChanges at least briefly [15:52:35] i'm' trying to think of how to minimize cross DC leadership changes [15:52:37] i think its possible [15:53:01] or, read the part about the replica assignment algorithm under proposed changes [15:53:13] Will do. [15:53:16] (EventgateLoggingExternalLatency) resolved: Critical latency for GET events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [15:53:43] i wonder if that has changed at all recently, it is 6 years old! [15:59:12] haha joal milimetric: welcome Gobblin for streams, from LinkedIn: https://github.com/linkedin/brooklin [15:59:53] hehehe ottomata :) [16:00:05] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Milimetric) Quick stats check on revision sizes and diff sizes: ` select year(parse_datetime(event_timestamp, 'YYY... [16:01:47] ottomata: I have meetings up to late, but can do after [16:01:51] or tomorrow [16:14:54] ottomata: got a link for that covid article? [16:17:06] ottomata: this one's 40MB too: Eumat114/toolongtobetrue [16:20:29] heh: https://en.wikipedia.org/wiki/User:Eumat114/toolongtobetrue [16:22:40] no, just dcausse and i think ami r mentioned it [16:23:01] i am sorry I clicked on that link [16:23:37] milimetric: it's in https://www.wikidata.org/wiki/Special:LongPages ~4.2Mb [16:25:01] 40mb, sigh... [16:35:03] dcausse: https://en.wikipedia.org/wiki/Special:LongPages looks like it only shows main namespace pages, (running a query to verify, but it doesn't include that user page I linked above) [16:36:27] milimetric: indeed, how did you find the toolongtobetrue one? [16:37:33] dcausse: we haz dataz :) [16:37:38] :) [16:41:46] dcausse: `select max(page_len) from page`. And indeed, if I add `where page_namespace=0` then it shows me the same thing as the special page [16:42:12] thx! [17:21:25] ottomata: Do you want to chat Kafka now? [17:25:43] btullis: yes! [17:26:45] btullis: meet.google.com/jar-emdw-qdd [18:08:39] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [18:36:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster exec... [18:42:04] milimetric: around? [18:42:09] need another number for this calc [18:42:22] i'm estimating cross DC throughput [18:42:29] so, i need # of revisions in e.g. 2021 [18:42:46] just tried to run the query myself but i've reverted to a n00b [18:42:53] where's parse_datetime from? [18:43:02] oh is that presto? no? yes? [18:50:14] ottomata: yes, presto, it'll be faster [18:50:44] but wait, wikistats [18:51:48] i got it! [18:51:51] from presto [18:51:54] 533764132 [18:51:55] in 2021 [18:52:00] select count(*) from wmf.mediawiki_history where snapshot='2022-04' and event_entity='revision' and event_type='create' and year(parse_datetime(event_timestamp, 'YYYY-MM-DD HH:mm:ss.s')) = '2021' [18:52:39] I'm not sure if this includes both sides of the interval but I think it does: https://stats.wikimedia.org/#/all-projects/contributing/edits/normal%7Cbar%7C2021-01-11~2021-12-18%7C~total%7Cmonthly [18:52:59] 437 million edits all projects [18:53:19] Hm... wonder what the discrepancy is :) [18:54:07] Ah, maybe reverted and deleted don't count and I'm wrong about intervals? Who knows, you need it more accurate? [18:58:20] naw super rough is good [18:58:24] i'm overestimating everythign anyway [19:04:34] ottomata: heya - hat? [19:04:37] chat sorry? [19:06:03] heya joal one second i will post on the phab ticket, you read and check my numbers, then we chat :) [19:06:11] ack [19:09:46] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [19:10:22] ok joal. ^ added stuff under Replica Placement and Cross DC throughput calculations. will join BC in 5 minutes (or when you ready) :) [19:10:41] reading in the cave ottomata [19:16:42] (03PS2) 10Mforns: Throttle heavy monthly jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789797 (https://phabricator.wikimedia.org/T307779) (owner: 10Joal) [19:16:51] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [19:16:54] (03CR) 10Mforns: [V: 03+2 C: 03+2] Throttle heavy monthly jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/789797 (https://phabricator.wikimedia.org/T307779) (owner: 10Joal) [19:17:56] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [19:34:04] !log starting refinery deploy (regular weekly train) [19:34:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:37:21] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [19:57:48] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:02:27] ottomata: that was p50, not average, I'll get the avg to validate your multiplication [20:03:04] ottomata: unrelated q: do we have any examples of running python logic from airflow? As opposed to spark / hql? [20:04:43] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:05:35] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:07:51] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:08:06] oh milimetric scuze u right [20:08:07] thank you. [20:08:54] something is weird with stat1005 [20:09:00] can't ssh, timing out [20:09:07] milimetric: hmmm not that i know of [20:09:08] hmmm [20:09:24] since we still want job logic separate from scheduler [20:09:29] you could do i twith a skein operator [20:10:02] ok, I gotta figure out how to do that, thx [20:10:23] see SimpleSkeinOperator [20:10:32] in wmf_airflow_common [20:10:54] hmm [20:11:04] k [20:11:08] yes [20:11:09] that [20:11:10] with [20:11:49] archives='hdfs:///path/to/my/conda_env.tgz#enviroment', script='environment/bin/mypython_script.py' [20:12:42] milimetric: stat1005 seem okay to me [20:12:59] weird... I try to ssh and it just times out [20:13:26] juust stat1005? [20:13:40] ok... now it's fine [20:13:43] weird, nvm I guess [20:14:01] ooh, bad news ottomata, avg rev is 20K [20:14:08] https://www.irccloud.com/pastebin/XTobCvTp/ [20:14:31] 20623, I guess it must be those crazy outliers beyond p99 [20:14:32] that'ts okay [20:14:43] hm [20:14:44] it's still accounted for by your 5M round-up but yeah [20:14:54] i'll just use that [20:17:14] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric) p:05Triage→03High a:03Milimetric I will work on this first, using my hive database, `milimetric`, and reporting findings here. [20:18:05] wait where did I get 5MB?! [20:18:10] i must have had a misplaced comma? [20:18:26] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate interaction of manual description edits and automatic description reimport - https://phabricator.wikimedia.org/T307717 (10Milimetric) p:05Triage→03High a:03Milimetric I will work on this in parallel with the schema spike since repeated ingestion... [20:18:41] hmm no kafka main is in MB per second [20:18:44] hangon... confused mysefl [20:24:11] 10Data-Engineering-Kanban, 10Data-Catalog: User Experience: Authentication - https://phabricator.wikimedia.org/T307711 (10Milimetric) p:05Triage→03High a:03BTullis @BTullis: I'm doling out these tasks per our grooming session today, just to expedite the process. We decided there's only a few of us and w... [20:32:54] !log finished refinery deploy (regular weekly train) [20:32:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:33:32] actually joal milimetric, unless i'm looking at this all wrong [20:33:59] i think adding a revision text stream e.g. to kafka main would not be a significant increase in ttraffic [20:34:19] it's pretty small, I mean 20K is nothing [20:34:33] i see kafka main input peaks at around 4MB / second [20:34:42] based just on average [20:34:47] revision byte size [20:34:48] 20623 [20:34:55] and 533764132 revs in 2021 [20:35:18] that's only 507767 uncompressed bytes / second. Okay we will add more bytes for event metadata [20:35:24] but still not that many more [20:36:01] ottomata: I guess that assumes even distribution, so that'd be the thing to look at, max bytes / second over 2021 [20:36:13] yeah joal was going to look at that [20:36:33] i suppose if there are prolonged moments of really juge max bytes/second or mintue [20:36:34] it would matter [20:36:36] ok, cool, yeah, group by second and take avg [20:36:38] milimetric: Thanks for that ticket above. Feel free to sign meet the upgrade process & docs ticket too. [20:37:03] but i somehow doubt that edits contribute to prolonged bandwidth usage [20:37:08] right? [20:37:14] good night btullis :P [20:37:26] unless some bot was allowed to do a hugue number of huge content edits all at once [20:38:03] oh that definitely happens a lot [20:42:43] milimetric: it happensa lot that a bot mmakes a lot of huge edits? [20:42:53] or, i mean, specifically lots of edits to large pages? [20:42:58] i know lots of edits can happen [20:43:12] but would it specifically happen to huge pages? [20:43:28] like ~100 edits or more per second to huge pages? [20:43:33] lots of edits, not sure about lots of big edits, but they could be small edits on big pages and it would still count since we're not sending diffs [20:43:43] yes thats what i mean [20:43:45] I think bots are throttled to less than that, but I have no proof [20:43:55] i guess, we already send events for every edit [20:44:01] I never looked at the distribution, it'll be interesting to see jo's numbers [20:44:26] adding text in a stream would only affect volume if those edit spikes were for huge edits [20:44:31] yeah, I don't think this significantly adds problems except in the top 0.1% of cases [20:44:34] yeah [20:45:10] looks like kafka main already handles around 5MB / second [20:45:20] and that is going cross DC [20:51:12] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:52:04] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) @Milimetric got me the avg revision byte size in 2021: ` presto:wmf> select avg(revision_text_bytes) from... [20:58:39] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [21:18:03] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate interaction of manual description edits and automatic description reimport - https://phabricator.wikimedia.org/T307717 (10Milimetric) Ok, got a sense for how this works: * Initially, the table comment shows up as the documentation. The timeline API wi... [21:19:08] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate interaction of manual description edits and automatic description reimport - https://phabricator.wikimedia.org/T307717 (10Milimetric) TODO: validate with @EChetty that the description here is what we want to evaluate (it looks more like what we want to... [21:20:51] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric) [21:22:41] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric) [21:27:11] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric)