[05:36:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:12] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:46:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:36] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:16:02] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10Ottomata) Great stuff! TY Sam! I think th... [08:06:47] PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:37:16] !log cold-reset BMC device on analytics1073 [08:37:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:43:48] Hello a-team (sorry I wrote this message on the weekend, but this may have not been the best time to contact you), I'm currently working on a project to analyze and identify vandalism on frwiki, and in order to do this I'm trying to setup a spark cluster with the revision history as a data source. What I currently do is that I download the last xml [08:43:48] dumps, and convert them to parquet files. This process is very memory and bandwidth intensive, so I was wondering if maybe one of you here has already worked on this, or maybe the wmf has a parquet dataset I haven't heard of ? [08:50:56] Ywats0ns: Hi! Do you have access to our Hadoop Data Lake? We do already have the mediawiki_history data set, which seems to be very suitable for what you're doing. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history [08:56:24] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:56:59] hey @btullis thanks for the quick answer ! No I do not have access to the data lake, is there a way to grant it ? And the edit history you sent does not seem to contain the revision text, which I need for this project, or maybe I missed something ? [09:07:02] RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:37] Ywats0ns: Ah, sorry, I didn't twig that you needed the revision text. We have the Mediawiki_wikitext_current dataset available in Avro format on the Hadoop cluster, with details here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_current [09:20:32] As far as access to the data lake is concerned, you'll need to have production shell access and then you need to request access to the data lake as explained here: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Requesting_access [09:22:32] If that's not appropriate, then perhaps we can help you to devise a more efficient pipeline for working with the monthly dumps, as you are doing so currently. It depends a little on the nature of the project you're working on. [09:23:26] btullis Thanks a lot, that seem to be what I need ! However, because I do not work for the WMF and am "just" a frwiki volunteer, do you think my request has any chances of being accepted ? [09:25:14] What I want to do with this project is being allowed to identify users who insert links to specific external websites, as we've had several issues with promotional teams inserting spam on frwiki. The goal is to discover malicious paid contributors, and take actions on them [09:27:42] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:34:17] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10phuedx) >>! In T286344#8227782, @Ottomata w... [09:34:39] Thanks for the context. I know that in order for your shell access request to be considered, you would have to be collaborating with someone within WMF. As per: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_a_volunteer_to_an_access_group [09:34:39] Do you have someone that you're working with already? [09:41:48] I have not had the opportunity to work with anybody at the WMF foundation yet. Is there a list of persons I could contact, that would be interested in vandalism-analysis ? [10:26:42] ywats0ns: I'm not 100% sure, but maybe the Trust & Safety team would be a good place to start? https://meta.wikimedia.org/wiki/Trust_and_Safety [10:32:25] It seems that this team is more dedicated to users security (harassment, censorship,...) than content integrity. Is there a team dedicated to this ? [10:32:52] Or if I have the support of a member of the french wikimedia foundation, would that be enough? [10:47:50] ywats0ns: You may wish to reach out to some members of the research team, as they have a knowledge integrity programme that may be relevant: https://research.wikimedia.org/knowledge-integrity.html [11:04:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp3057 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:09:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp3057 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:50:47] Thanks btullis I'll contact them [13:24:58] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:58:51] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:39:42] 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Interaction Map Complete - https://phabricator.wikimedia.org/T305477 (10KCVelaga_WMF) a:05KCVelaga_WMF→03okwiri_oduor @okwiri_oduor will be working on creating the initial wireframe. [15:58:32] 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) [16:24:58] ywats0ns: I may have something that helps. We instrument link changes since 2019-02. So for every edit that adds or removes links, we have a row in a table that says User X added/removed these links, along with the link text and a boolean on whether it's internal or external [16:25:12] ywats0ns: the schema of those events is https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/+/refs/heads/master/jsonschema/mediawiki/page/links-change/1.0.0.yaml [16:25:28] ywats0ns: and the reason this might be helpful is because we deemed this data safe to keep forever: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/sanitization/event_sanitized_main_allowlist.yaml#L7 [16:26:53] so even though it's not released publicly, I think it might be a smooth process. And you're not the first to ask for it. [16:29:06] Thanks milimetric, that seems really interesting ! Is this something that can also be only accessed through the analytics internal data lake ? [16:32:06] 10Data-Engineering, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) [16:33:15] ywats0ns: right now, yes, that data is only internal. But we should be able to release it. I'm asking our privacy team to see how many other folks requested it and what they think. [16:33:47] ywats0ns: in some cases we can do a one-off release. If this data would solve your use case, what date ranges are you interested in? [16:35:35] My use case is to analyze the vandalism data, to try to understand if there's any knowledge we can use to efficiently block some of it. So it would indeed allow to solve the spammers uses case, but I would have loved to have access to the analytics cluster to also do some analysis in order to improve the editions filters (but hey, one problem at a [16:35:35] time) [16:36:15] milimetric I'm not sure about the date ranges, because I have a list of URLS that I would know the users who inserted them, and I don't know when they were inserted. I guess the older the better [16:36:37] Btw, do you know anybody in the research team that work on this kind of topics ? [16:38:00] ywats0ns: I think they're all in general interested in this, but they do usually have pretty full workloads. I've been told in the past that setting up a formal collaboration can take a while. [16:38:38] getting you access to the cluster seems like a good idea here, I'll bring it up more broadly with our team, we haven't had too many interested folks in the last few years. [16:42:15] I would love to, if it may help I already have access to the toolforge k8s cluster, I guess I'm already halfway in ? '=D [16:42:58] yep :) I hope to someday bring all this data to the toolforge cluster, I always figured that would spawn cool collaborations [16:45:48] 10Data-Engineering, 10Data Pipelines: [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10mforns) [16:45:57] joal ^ [16:53:48] Clearly, but this may need a lot of resources [16:56:37] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10JAnstee_WMF) @Mayakp.wiki Sorry I did not see these in my inbox. Q: Is there a difference between these 2 columns? or are they redundant as well ? iso2_country_code iso3166_1_alpha_2_code FYI, iso3166_1_... [17:07:16] Oh btw milimetric do you know how long getting this access could take, and if I can be of any help in the process ? Thanks a lot [17:14:00] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10odimitrijevic) Approved [18:02:46] ywats0ns: usually this kind of request is denied because it gives access to a lot of data, and because it involves training the person. In your case I'm somewhat hopeful as you know your way around Spark and you're already on toolforge. [18:08:43] Awesome, thanks a lot ! [18:10:34] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) p:05Triage→03Medium a:03BCornwall [18:12:08] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10JAnstee_WMF) @Mayakp.wiki Sorry I did not see these in my inbox. Q: Is there a difference between these 2 columns? or are they redundant as well ? iso2_country_code iso3166_1_alpha_2_code FYI, iso3166_1_... [18:13:36] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10JAnstee_WMF) a:05JAnstee_WMF→03ntsako [18:24:18] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T317566 (10rook) [18:46:58] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T317566 (10rook) [19:19:40] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10SRE, 10serviceops, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10gmodena) Hi - what is the status of the linked CR? >>! In T303543#7768019, @gerritbot wrote: > Change 738578 had a related p... [20:12:46] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10BCornwall) 05Stalled→03Resolved I'm going to mark this as resolved since no verification has occurred. If there's any unfin... [20:24:14] 10Data-Engineering, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:28:27] 10Data-Engineering, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Dzahn) > SSH: configured to access all our servers, including an-launcher1002 We can't be sure what the definition of "all our servers" is. In gener... [20:39:18] 10Data-Engineering, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:42:21] 10Data-Engineering, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Thanks for the clarification, @Dzahn! Unless there's dissent, I'll just add them to the analytics-admins group as was suggested. @Milimetri... [20:53:59] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:56:05] (03CR) 10Mforns: [V: 03+2] "Thanks for the comments! I changed all the suggestions." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [20:56:18] (03PS2) 10Mforns: Migrate unique devices queries to SparkSql and move to /hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) [20:56:34] (03CR) 10Mforns: [V: 03+2] Migrate unique devices queries to SparkSql and move to /hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [21:18:09] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) From the CR which is currently not approved: > from a glance at hieradata this groups includes a LOT of things and the access request was for "... [21:32:50] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [21:42:50] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength