[06:28:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Services, 10Patch-For-Review: Move wikireplicas dbproxy haproxy config to etcd - https://phabricator.wikimedia.org/T304478 (10razzi) By declaring the host -> ip mappings using https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#map ea... [09:06:07] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.849 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [09:26:59] ah nice, this is probably due to the mw train, it matches with the rise of eventlogging_QuickSurveyInitiation errors [09:28:35] '.event.performanceNow' should be integer [09:31:31] elukey: I think I follow. What should we do about it? Update this? https://github.com/wikimedia/schemas-event-secondary/blob/master/jsonschema/analytics/legacy/quicksurveyinitiation/current.yaml [09:32:19] btullis: o/ [09:32:23] Are we expecting a float now? [09:33:50] I think that the js code or similar now send somethign that is not an integer [09:33:58] the task for the last train is https://phabricator.wikimedia.org/T300204 [09:34:23] we could follow up with hashar :) [09:35:05] ah yeah I see, on logstash the events that fail validation have "performanceNow":1648718876207.3 [09:35:09] so a float [09:35:29] there is probably a related change that went out with the train [09:37:01] I think it is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/QuickSurveys/+/768194 [09:37:37] and probably https://phabricator.wikimedia.org/T303740 [09:37:41] I'll add a note [09:39:16] Thank you. I don't know how you find these so quickly :-) I haven't had much exposure to the MW train yet. [09:40:34] btullis: ah so I was super lucky, I tried to look for "performanceNow" in the search bar of gerrit, a picked the last change (that had an interesting commit msg) [09:41:17] not entirely sure how to track down all the changes that went out with the train, but there should be a way [09:41:41] I think that they will likely change the schema as you proposed earlier on [09:41:56] are we dropping all the events now? [09:42:25] I expect so. [09:42:55] So probably 'unbreak now' ? [09:43:58] let's verify if the msgs are completely down, I checked https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_QuickSurveyInitiation but it looks ok, weird [09:44:10] maybe it is not the right one [09:45:29] the msgs rejected per seconds are relatively low, I suspect this could be an experiment or similar [09:46:15] Yeah, or maybe the messages containing 'performanceNow' are only a subset of them. They don't appear to be a reqired element. [09:46:31] yeah [09:46:38] phuedx: o/ around ? [11:25:35] (03CR) 10Vivian Rook: [C: 04-1] "Marking -1 to not merge until flask error is investigated." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) (owner: 10Vivian Rook) [11:48:50] 10Data-Engineering, 10Data-Services, 10User-Ladsgroup: Make linktarget table visible on cloud wiki replicas - https://phabricator.wikimedia.org/T305064 (10Bugreporter) >>! In T305064#7818619, @Ladsgroup wrote: > Yes. It should have a view similar to actor. Instead we may have a script to regularly purge unu... [12:01:13] elukey: btullis: I am here [12:02:03] should we roll back the train? [12:14:20] hashar: I don't *think* so, thanks. I think that it's just a relatively small number of events being dropped, but someone probably needs to update a schema to match the events being generated. [12:26:35] good :] [12:27:10] I can't see how a javascript Date.now() or window.performance.timing.now() would suddenly become a float though [12:59:39] o/ btullis elukey generally if an instrumentation change causes event validation errors, we just make a task and then tag the relevant people / teams, and then let them resolve it [12:59:45] it should only be affecting their stream [13:00:05] ack! [13:00:14] I commented in their task, didn't open a new one [13:00:22] it'd be really cool if we could automate that task creation somehow, but to do that we'd need ownership info about streams, which we don't have now, but mayyybe one day in data catalog! [13:00:27] oh they had one, that's good [13:00:28] thanks! [13:01:46] ottomata: not sure if it was the right one, I found https://phabricator.wikimedia.org/T303740 [13:02:58] Big news - as per https://phabricator.wikimedia.org/T298087#7821434 the analytics firewall is no longer a thing. It has ceased to be. [13:07:19] \o/ [13:08:19] ottomata: Thanks for the clarification. Makes sense. [13:09:07] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:15:12] uhhh a-team: https://phabricator.wikimedia.org/T298087#7821434 [13:15:13] wow [13:15:16] 99c wow [13:15:18] also elukey ^ [13:16:04] Let's talk to elastic directly dcausse! [13:16:14] wow [13:17:07] all doors are now open? :P [13:18:39] looks like so dcausse :) [13:19:37] fresh air! [13:20:02] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:20:54] wow indeed from stat machines: curl -I https://search.svc.eqiad.wmnet:9243 -> HTTP/1.1 200 OK :) [13:26:22] milimetric is Hive2Graphite an actual suggestion or just a test comment? [13:26:36] i was about to leave a 'WHY?!?!' comment but maybe you didn't mean it! :p [13:36:50] ottomata: oh definitely didn't mean that, was just showing Sandra how commenting works in gerrit. Sorry for the accidental trolling [13:40:26] haha okay phew you succeeded in accidentally trolling me [13:40:35] XD [13:51:01] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) On that second part, we discussed it within Infrastructure Foundation. With the webproxies (and url-downloa... [13:54:34] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10mforns) a:05JAllemandou→03mforns [13:54:54] (03CR) 10Joal: "A bunch of comment - I'm nitpicky but open to discussion :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/774383 (https://phabricator.wikimedia.org/T300039) (owner: 10Aqu) [13:56:14] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10mforns) [14:05:29] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10CMacholan) Apologies for the delay -- yes this data can be deleted. Thanks! [14:20:41] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of clarakosi - https://phabricator.wikimedia.org/T304065 (10WDoranWMF) @Snwachukwu Everything here can go. Thanks. [14:20:46] 10Analytics, 10API Platform: AQS 2.0 local tests fail when mwcli is running - https://phabricator.wikimedia.org/T304735 (10BPirkle) If we choose to tweak the README, also consider making the "Wait until you see the 'Startup complete' log message, then in another terminal, bootstrap the database schema and samp... [14:25:10] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Isaac) Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly... [14:57:16] (03CR) 10Ottomata: Add archiving job for Airflow (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/774383 (https://phabricator.wikimedia.org/T300039) (owner: 10Aqu) [15:01:05] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of clarakosi - https://phabricator.wikimedia.org/T304065 (10mforns) [15:09:17] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > malware accidentally downloaded (compromised library dependency, infected executable, etc) could easily "... [15:10:35] mforns: wanna talk a bit about the refactor? I think it's great, btw, I'm just not sure how to best interact with it [15:11:19] 10Data-Engineering, 10Data-Services, 10User-Ladsgroup: Make linktarget table visible on cloud wiki replicas - https://phabricator.wikimedia.org/T305064 (10Ladsgroup) That is definitely in the medium-term work (=in a couple of months) to avoid bloating the table but it has complexities (race condition between... [15:13:41] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 3.889 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [15:14:19] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) [15:14:28] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Done! Thank you so much! [15:20:52] milimetric: mforns just curious...what refactor?! [15:26:56] ??? [15:27:05] ahhh! [15:27:15] milimetric: yes! batcave if you want! [15:27:36] ottomata: the refactor of Airflow DAGs that is in code review right now [15:27:48] oh i haven't seen it can I? [15:28:05] ottomata: ofc! https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/38 [15:28:09] got it [15:28:22] please review! [15:32:10] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10mforns) I deleted the HDFS data. The only remaining data to remove in the one in stat1006. [15:34:53] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of clarakosi - https://phabricator.wikimedia.org/T304065 (10mforns) I deleted HDFS data. Only remaining data to remove is the stat1004 and stat1005 one. Please, @razzi or @BTullis, can you delete that? :] [15:36:03] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) >>! In T300977#7821926, @Ottomata wrote: > > I appreciate the intention here, but I'm not sure if the combo... [15:36:11] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10mforns) Please, @BTullis or @razzi, can you delete this stat1005 data? :] [15:36:40] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of clarakosi - https://phabricator.wikimedia.org/T304065 (10mforns) a:05Snwachukwu→03mforns [15:40:18] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > its primary goal is limiting the capability of any such malware to 'phone home' to a command & control en... [15:40:37] mforns: added a couple of nits, looking forward to the task group refactor [15:40:51] ok, thank you! [15:40:57] hmm one q [15:41:09] so, is variable_name meant to be distinct per dag? [15:42:37] actually okay i do have a comment/q will just add on MR [16:01:27] ottomata: standup! [16:02:04] OooO [16:59:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10bd808) >>! In T304733#7809592, @Marostegui wr... [17:29:21] am rerunning the failed el legacy job [17:49:57] ottomata: there's still this alert for eventgate validation error rate, that I have not looked into yet... I searched for docs: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Validation_Errors but the links are broken, can yo uhelp me? :] [18:33:51] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10razzi) 05Open→03Resolved Ok, I have removed the data on each stat host and also did a hdfd -rmdir on the empty hive database as well. [18:38:03] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10razzi) 05Open→03Resolved a:03razzi I removed the data in stat1006. Thanks everyone. [18:52:13] (VarnishkafkaNoMessages) firing: varnishkafka for instance cp3054:9132 is not logging cache_text requests from statsv - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=statsv&var-cp_cluster=cache_text&var-instance=cp3054:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:54:05] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of clarakosi - https://phabricator.wikimedia.org/T304065 (10razzi) 05Open→03Resolved I removed stat100* directories. All done! [19:04:30] 10Analytics-Wikistats, 10Data-Engineering, 10Data-Engineering-Kanban: Broken "tooltip-breakdown-automated" tooltip on Wikistats 2 - https://phabricator.wikimedia.org/T303990 (10Milimetric) 05Open→03Resolved [19:04:48] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Data-Engineering-Kanban: Confusing filtering on "Active editors by country" topic - https://phabricator.wikimedia.org/T300365 (10Milimetric) 05Open→03Resolved [19:07:13] (VarnishkafkaNoMessages) firing: (2) varnishkafka for instance cp3050:9132 is not logging cache_text requests from statsv - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:24:04] (03PS2) 10AGueyte: Add new event action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/774980 (https://phabricator.wikimedia.org/T296428) [19:26:29] (03CR) 10jerkins-bot: [V: 04-1] Add new event action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/774980 (https://phabricator.wikimedia.org/T296428) (owner: 10AGueyte) [19:30:35] (03PS3) 10AGueyte: Add new event action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/774980 (https://phabricator.wikimedia.org/T296428) [19:32:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp1075:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:43:59] mforns: both btullis and elukey were looking at that earlier today [19:44:11] i do not know logstash links always break [19:44:19] but [19:44:58] mforns: https://phabricator.wikimedia.org/T303740#7820905 [19:47:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp1075:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:14:14] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 5 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Etonkovidova) [20:26:10] thanks ottomata [20:46:20] 10Analytics-Clusters, 10Data-Engineering, 10Superset, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) I'm thinking about requiring superset-next to have 2fa for 2 reasons: - since it's the staging environment for superset... [21:22:13] (VarnishkafkaNoMessages) resolved: (6) varnishkafka for instance cp1075:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:48:12] (VarnishkafkaNoMessages) firing: ... [23:48:12] varnishkafka for instance cp3050:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3050:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:51:50] 10Analytics-Clusters, 10Data-Engineering, 10Superset, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10MoritzMuehlenhoff) Have a look at what we do with Puppetboard, the same scheme would seem like a solution for Superset as well... [23:53:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp3050:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:58:12] (VarnishkafkaNoMessages) resolved: (3) varnishkafka for instance cp3050:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages