[00:38:11] 10Data-Engineering (Sprint 9): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694#9562018 (10Ahoelzl) [00:51:49] 10Data-Engineering (Sprint 9): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694#9562039 (10Ahoelzl) For sync meeting updates from 2/20, see project document. Traffic team is working on configuration management and will be ready for stream production... [00:55:42] milimetric: hola hola , where did we moved the jobs that do the quality checks on webrequest data? [01:22:02] 10Analytics-Radar, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#9562078 (10Ottomata) Just came across https://www.jikkou.io/docs/tutorials/get_started/ . Worth a look! - https://www.jikkou.io/docs/prov... [01:30:28] 10Data-Engineering, 10Epic: Dataset Config Store - https://phabricator.wikimedia.org/T354557#9562081 (10Ottomata) Worth investigating? https://datacontract.com/ Have we looked around to see if there are existing 'dataset' config formats/specs we can already use? [07:44:07] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netbox: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655#9562400 (10jcrespo) an-db backups looking good: ` ✔️ r... [08:08:09] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9562410 (10MoritzMuehlenhoff) [09:50:52] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Decommission cloudelastic1001-1004 - https://phabricator.wikimedia.org/T357780#9562606 (10Gehel) 05Open→03Resolved a:03Gehel DC ops steps are tracked in T358046, we can close this. [09:51:04] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Decommission cloudelastic1001-1004 - https://phabricator.wikimedia.org/T357780#9562612 (10Gehel) a:05Gehel→03RKemper [09:55:12] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Set requests (not limits) for cirrus-streaming-updater in k8s - https://phabricator.wikimedia.org/T348350#9562623 (10Gehel) a:05brouberol→03bking Moving back to in progress (this isn't really blocked, just waiting for an answer / discussion within our team) and... [09:55:19] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T356313#9562627 (10Gehel) a:03bking [09:57:00] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9562634 (10Gehel) a:03EBernhardson [09:59:21] brouberol: I think we talked about this (T320926) before, but I can't remember what the issue was, or if it is related to our current work on superset. My limited understanding from the comments is that the issue has been fixed, probably with one of the newer version of superset we deployed since this was reported. Could you confirm and update the task or close it? [09:59:21] T320926: wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 [10:07:43] o/ brouberol btullis are you available for the sync today? [10:08:14] oops, sorry, missed notification. OMW [10:10:13] gehel: I'll have a look after the sync [11:22:19] (03PS44) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [11:23:43] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:37:27] Hey peeps, looking for info on https://phabricator.wikimedia.org/T357785 is there any particular care to be taken when moving mw-page-content-change-enrich backend to mw-on-k8s, or can I just helmfile apply the changes? [11:40:35] claime: Looking now. The most knowledgeable peeps in particular about this app are ottomata and gmodena, but neither of them are showing as around at the moment. I'm happy to have a look. [11:43:15] btullis: as far as I can tell the checkpoint is in swift, and I'm not touching the app config, just the envoy listener backend, so it should be fine [11:51:37] claime: Yes, I agree. So it's just the contents of the flink-app-main-envoy-config-volume that's being changed, but the whole flink-app-main pod will be restarted, right? I just need to be confident that the checkpoint based recovery is the right approach and that I shouldn't trigger a manual savepoint and then stop the cluster. [11:52:59] btullis: yes, the pod will restart [11:53:11] well technically it will be destroyed and recreated [11:53:46] Yep yep. How urgent is it, from your perspective? [11:57:24] It's not urgent, it's just one of the stragglers that we haven't moved to a mw-on-k8s backend. for reference, the flink-app-main pod currently has been up 14 days, last restart was to increase max.request.size [11:58:09] I do hope you don't need to stop the cluster to redeploy the app, that'd kinda defeat the purpose of running on k8s [11:58:54] Or when you say the cluster it's creating a savepoint inside of the mw-page-content-change-enrich and cluster refers to that app? I have no idea how flink apps work [12:05:07] I'm not super-familiar, which is the only reason I'm hesitant. We have both checkpoints and savepoints. Checkpoints are created every 30 seconds and are generally used for recovery. Savepoints are like a named, point-in-time recovery. You can configure the startup behaviour to use a partcicular savepoint if that's required. [12:07:26] ah right, both o.ttomata and g.modena are ooo, that would explain why I haven't had an answer on phab [12:08:03] Gabriele is out this week and Andrew is out on parental leave. I guess I'm happy for you to go ahead. I can keep an eye on it. [12:09:09] all right *cracks knuckles* [12:09:22] I have this that I can monitor: https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-flink_job_name=mw_page_content_change_enrich&var-operator_name=All [12:09:54] I have a quick question for you too, when we're done. If that's OK ;-) [12:09:58] ofc [12:12:41] btw, the max.request.size change hadn't been deployed to eqiad [12:12:52] So it'll be deployed at the same time [12:13:00] 10Data-Engineering, 10MW-on-K8s, 10SRE, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563142 (10BTullis) I'm happy for this change to go ahead. I'll keep an eye on the [[https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink... [12:13:00] Ack. Thanks. [12:14:40] staging done, proceeding with eqiad [12:16:19] eqiad done, I'll give it a minute so we can see if anything happens on the dashboard [12:18:38] Looks good to me. Successfully pulled checkpoints. `"INFO","message":"Successfully ran initialization on master in 0 ms.", "ecs.version": "1.2.0","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.jobmaster.JobMaster"` [12:19:40] ok proceeding with codfw then [12:27:11] * btullis gets logged out by grafana for no reason [12:28:07] On my side I can confirm flink is now requesting from mw-api-int and not the bare metal cluster anymore [12:28:18] https://grafana.wikimedia.org/goto/slq2bHoSk?orgId=1 [12:28:21] Great!. [12:32:03] Yep, looks good to me too. [12:33:00] awesome, thank you :) [12:33:10] 10Data-Engineering, 10MW-on-K8s, 10SRE, 10serviceops: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563249 (10Clement_Goubert) 05In progress→03Resolved I can confirm that mw-page-content-enrich now requests from mw-api-int (blue) and not the appserver... [14:26:50] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105#9563576 (10Fabfur) [14:29:01] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic: Change mtail configuration to ignore new fields in HAProxy logs - https://phabricator.wikimedia.org/T358107#9563609 (10Fabfur) [14:38:52] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105#9563660 (10Fabfur) [14:41:57] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109#9563669 (10Fabfur) [15:58:48] (03PS13) 10Joal: Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [16:03:45] (03CR) 10Bernard Wang: [C: 03+2] Adds new field to webA11y schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1005201 (https://phabricator.wikimedia.org/T356335) (owner: 10Kimberly Sarabia) [16:17:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:53:40] 10Data-Engineering, 10Data Products, 10Movement-Insights: Mediawiki_wikitext_history job often has long gaps between stages - https://phabricator.wikimedia.org/T357873#9564300 (10Antoine_Quhen) Some research: * Each XML dumps snapshot may represent ~5.5TB (including ~1.8TB for wikidata and 1.4TB for enwiki)... [17:13:01] 10Data-Engineering, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 09), 10Spike: [SPIKE] Remove mentions of MetricsClient#dispatch() and the monoschema from documentation - https://phabricator.wikimedia.org/T355046#9564391 (10WDoranWMF) a:03phuedx [17:13:22] 10Data-Engineering, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 10Spike: [SPIKE] Remove mentions of MetricsClient#dispatch() and the monoschema from documentation - https://phabricator.wikimedia.org/T355046#9459306 (10WDoranWMF) [17:18:46] 10Data-Engineering, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 10Event-Platform (Sprint 09): [SPIKE] Draft of Mediawiki extension proposal for Metrics Platform Instrumentation (& Experimentation) - https://phabricator.wikimedia.org/T355599#9564429 (10WDoranWMF) [17:26:20] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 10Technical-Debt: Fix public documentation for mw.eventLog.submit() and dispatch() - https://phabricator.wikimedia.org/T357003#9564475 (10WDoranWMF) [17:32:27] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 09): Missing contributor stats for Singapore - https://phabricator.wikimedia.org/T344624#9564540 (10WDoranWMF) @VirginiaPoundstone Would you mind saying more about why you pulled this into the sprint? (Note: Virginia is o... [18:07:50] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9564718 (10JAllemandou) a:03JAllemandou [18:12:58] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9564731 (10JAllemandou) Implementation plan: * Add a new `skip` option in https://github.com/wikimedia/analytics-refinery/blob/master/bin/impo... [18:19:15] 10Data-Engineering (Sprint 9): Turn off ReportUpdater jobs no longer used - https://phabricator.wikimedia.org/T357419#9564740 (10JAllemandou) a:03JAllemandou [18:47:24] 10Data-Platform-SRE, 10Data-Platform: Review the Airflow instance security settings to ensure that they are still suitable - https://phabricator.wikimedia.org/T358137#9564827 (10BTullis) [18:47:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:15:00] 10Data-Engineering: NEW BUG REPORT - Pageviews Missing Hourly Partition - https://phabricator.wikimedia.org/T358142#9564913 (10Mgalo1) [19:26:43] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9564947 (10nshahquinn-wmf) Excellent! Don't forget to announce the plan first, just in case there is someone unexpectedly using the data; I re... [19:44:05] 10Data-Engineering, 10Data-Platform-SRE: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466#9564990 (10xcollazo) @lbowmaker will the "dataset state store' work be done under this ticket, or are we closing this and opening a se... [19:51:48] !log Rerun pageview_actor_hourly for hour 2024-02-20T07:00 [19:51:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:09:00] (03PS2) 10Aleksandar Mastilovic: Add HQL file for CX report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 [21:21:55] !log Update airflow variable for pageview_actor-hourly leading to 64 written files instead of 32 - this should ease the job resource consumption and prevent failures [21:21:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:52:54] 10Data-Engineering, 10Pageviews-API: Missed pageview data over API - https://phabricator.wikimedia.org/T358132#9565366 (10Bugreporter) [22:41:42] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455#9565526 (10RKemper) @Loz.ross Yes the change to allowed endpoints did not get properly deployed; we've fixed that no... [22:43:21] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347#9565563 (10RKemper) @Hannah_Bast Okay, we figured out what was making the allowed endpoints not updated properly. http... [22:45:10] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488#9565578 (10RKemper) @HinMar Okay, I think we've got the endpoints properly allowed. Queries appear to be working for... [23:01:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:48:47] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347#9565775 (10Hannah_Bast) @RKemper Is your point that the queries should return a result? Neither DBLP nor Wikidata have...