[05:13:00] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10SGupta-WMF) Upon investigation , we concluded that this is a bug in AQS 2.0 media analytics service . It's missing... [06:54:48] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10gmodena) [06:55:31] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) a:03gmodena [07:29:37] (03CR) 10Joal: "One nit" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [07:35:30] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, and 2 others: ORES hook integration with EventBus - https://phabricator.wikimedia.org/T201869 (10Aklapper) [07:35:40] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Aklapper) Does this task serve any purpose in itself that is not covered by the #event-platform project... [07:36:54] 10Analytics-Radar, 10Data-Engineering-Icebox, 10MediaWiki-Action-API, 10Event-Platform, and 2 others: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Aklapper) [07:43:02] stevemunene btullis: Hello! I've added you as reviewers on https://gerrit.wikimedia.org/r/c/operations/puppet/+/965162. I think that to phase out these brokers, we need to remove them from puppet and then wait. We should see the # of established TCP connections drop over time as services restart. [08:03:25] btullis: I just saw your second comment, but was a bit trigger happy with the merge. Should we rollback first? [08:09:48] Yeah, probably. Just got the revert button in Gerrit. Sorry I misled you. [08:10:13] s/got/hit/ [08:11:05] revert PR https://gerrit.wikimedia.org/r/c/operations/puppet/+/966233 [08:11:59] brouberol: one nit - this bit https://gerrit.wikimedia.org/r/c/operations/puppet/+/965162/4/modules/role/manifests/kafka/jumbo/broker.pp may remove kafka mirror at all (at least IIUC) [08:12:17] we should leave the include in theory [08:12:48] gotcha, good spot. I'm reverting as we speak, and will rework this pr [08:13:02] oh yes yes it was just for the next version :) [08:13:25] reverted and puppet-merged [08:15:23] brouberol: nothing major, just a follow up to help with the follow ups - varnishkafka instances were being restarted by puppet, see https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cp_cluster=cache_text&var-instance=All&var-source=webrequest&from=now-1h&to=now [08:16:00] right, this, we actually expected [08:16:00] with the extra revert they will be restarted again in a close range, that is not a big deal but let's monitor all DCs to get into a stable state during the next hour or so [08:16:19] but agreed, I won't merge the reworked PR before a while [08:16:29] thanks! [08:16:54] super :) [08:18:42] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) Hi, In the description of this ticket there is a list with some items and the text "is correct" or "is not... [08:21:36] Reminder that I'm going to be performing Airflow maintenance on all instances in about 40 minutes from now. [08:22:24] Pausing all active DAGs, waiting some time, rebooting instances and the postgres database serving them, resuming all DAGs that were active. [08:22:37] I’ve lost both DSL links at home due to public constructions being too « enthusiastic ». I’ll reconnect when I can :/ [08:23:47] brouberol: elukey: Apologies for being a bit lax with my review. [08:25:11] btullis: please don't say that, it is so easy to miss a line with changes like this, happens to me all the time. I saw the change and spotted the kafka mirror thing simply because I put it in there, just luck [08:25:35] <3 [08:25:35] and it wouldn't have harmed anything, it was an easy fix [08:26:41] if I comment in here is just to help out, I don't want you folks to feel pressured, in case let me know and I'll shut up :) [08:27:38] Not at all. All help and pointers gratefully received. [08:46:30] 10Data-Engineering: Cleanup analytics/refinery/source pom.files - https://phabricator.wikimedia.org/T306193 (10Antoine_Quhen) 05Open→03Declined [08:48:02] (On 4G atm) my thinking was that, by removing the nodes from puppet itself, we wouldn’t need the mirror maker exception because puppet wouldn’t run at all, and hence the service wouldn’t be restarted. Which might be wrong or misguided [08:52:03] And I second what Ben was saying: thanks again for the feedback! [09:03:01] !log pausing all airflow dags on analytics instance [09:03:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:06:46] (03PS1) 10Joal: Add `forwarded` field to netflow druid realtime [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966495 (https://phabricator.wikimedia.org/T331707) [09:07:01] !log pausing all 28 active airflow dags on airflow-search instance [09:07:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:07:56] !log pausing all 3 active dags on airflow-research instance [09:07:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:09:09] (03PS2) 10Joal: Add `forwarded` field to netflow druid realtime [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966495 (https://phabricator.wikimedia.org/T331707) [09:09:15] !log pausing all 7 active dags on airflow-platform_eng airflow instance [09:09:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:10:08] !log pausing both active dags on the analytics_product airflow instance [09:10:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:11:29] (03CR) 10Ayounsi: [C: 03+1] Add `forwarded` field to netflow druid realtime [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966495 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [09:25:14] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) [09:31:52] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10CodeReviewBot) joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/519 Update analytics druid netfl... [09:46:48] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:47:20] ^ this is me, due to an expired downtime. Apologies. [09:50:26] !log restarting all airflow schedulers after rebooting an-db1001 [09:50:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:51:04] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:51:19] !log re-enabling all previously paused dags [09:51:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:56] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) a:03fnegri I think the best solution here (both for security and performance) is to let Quarry connect to the read... [10:01:03] (03PS5) 10Aqu: Use canonical_data.countries when populating the referer tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) [10:01:18] (03CR) 10Aqu: "Thanks for the review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [10:05:31] The airflow maintenance is all complete. [10:06:41] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10JAllemandou) There still exists stream config related errors though: https://l... [10:08:03] (03CR) 10Joal: Use canonical_data.countries when populating the referer tables (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [10:10:59] btullis: quick question on your airflow ops this morning for the analytics instance - Have you noted the allready paused jobs so as to not unpause them after the ops? [10:11:17] I'm asking because the gdi dags were unpaused, while I think they should have been paused [10:11:32] (no big deal, they are confirgured to run manually) [10:11:56] I took a screenshot of each instance. [10:12:33] ok - I assume that means those jobs were in not-paused mode - wird :( [10:12:41] thanks for letting me know btullis [10:12:56] https://usercontent.irccloud-cdn.com/file/Q03whO18/analytics-airflow-paused%20dags.png [10:13:10] These were the only two that I *thought* were paused. [10:13:21] (03CR) 10Aqu: Use canonical_data.countries when populating the referer tables (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [10:14:44] It's a bit tricky though and not a very reliable method. The toggle switch images are themselves cached, so when I switch frlom one instance (ssh tunnel) to the next, the images are not necessarily showing the truth. I have to flick about between paused and active and all, before they show the true state. [10:15:34] It's quite possible that I made a mistake. Is it the analytics instance on which the GDI jobs you mentioned are running? [10:15:46] not great btullis - We should be able to get the info using airflow API (I hope, and reset it as is by saving a state after API calls, and resending settings) [10:16:20] btullis: the gdi jobs are on the analytics instance, yes - And, they are unscheduled, so pause/unpause doesn't really mean a thing there [10:16:40] I will ask milimetric and mforns on this --^ [10:16:48] to be sure [10:17:09] I have paused them because I thought they were in that state, but it might be me not knowing that they had been unpaused [10:18:16] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [10:18:51] Yes, true. Not great. I haven't invested any time in automating this. I have been thinking quite a lot about how we could remove the SPOFs that are the cause of the planned maintenance windows though. [10:20:57] no worries btullis - thanks for the ops this morning <3 [10:37:34] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10SD0001) >>! In T348407#9257164, @fnegri wrote: > but I also want to consider another thing before opening this access: do we... [11:21:29] joal: I've been checking back through the analytics airflow webserver logs to check which dags I paused. [11:22:49] I'm pretty certain that the four gdi jobs were in an un-paused state when I started the work. [11:25:49] If we look at a job that was definitely paused and then un-paused, we can see this: [11:25:54] https://www.irccloud.com/pastebin/aWF4xOIi/ [11:26:24] thank you for checking btullis - that's no big deal really :) [11:26:39] Then if we look at a gdi job, this received three clicks, the third one being yours. [11:26:43] https://www.irccloud.com/pastebin/DiidvrGT/ [11:27:46] I'm not really that worried, but it makes me worry that it could have been a bigger deal. We should definitely try to get a programmatic way of doing this. [11:28:01] +1 for this :) [11:29:36] The search platform airflow instance has 42 dags, of which 28 are active and 14 are paused. Fiddly mouse work and error prone. [11:30:48] yeahhh [11:31:24] This is the best option I have seen so far for pausing multiple dags. It involves creating a dag for it :-) https://stackoverflow.com/a/77278842 [11:32:09] Or doing it in plain postgresql: https://stackoverflow.com/a/63286952 [11:35:37] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) [12:04:14] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Great! Thanks all. `s3` and `s4` are running on dbstore1007 So my updated plan will be like this: * Announce planned maintenance for `s2-anal... [12:22:15] o/ some tasks in our airflow instance are in "scheduled" mode and do not seem to start [12:24:05] only one task seems to be running actively [12:24:44] I see sensors moving tho hmm... [12:38:18] dcausse: Would you like me to have a look? I did the same as with the analytics instance 1) paused all active dags, 2) rebooted the instance, 3) rebooted the database server, 4) restarted the airflow-scheduler service, 5) un-paused all dags that were paused in (1) [12:38:24] my bad the ones I wanted to see running are in the "sequential" pool [12:38:45] btullis: I think I just understood why it's not doing what I want :) [12:39:00] Cool, ok. [12:39:17] joal: yes, the GDI jobs being unscheduled makes them do nothing even when they are unpaused. they only will trigger with a manual run (hitting the play button). I remember having seen them unpaused last week, but it really does not matter. [12:45:50] I'm planning to go head with this change to deploy multiple versions of spark to the test cluster shortly: 963304: Deploy multiple spark shuffler services to the test cluster | https://gerrit.wikimedia.org/r/c/operations/puppet/+/963304 [12:55:16] (03CR) 10Elukey: [C: 03+1] "JGreen: Do you want us to merge and release a new version, or are you going to take care of it? (I saw the change lingering, let me know :" [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [12:56:04] !log deploying multiple spark shufflers to the test cluster [12:56:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:08:07] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-p... [13:08:13] (03CR) 10Sbisson: T348613 Add new wiki_highlights_experiments schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1) [13:12:39] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [13:16:17] 10Data-Platform-SRE, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) [13:16:29] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) Indeed. A `ca_monitoring` module with a puppet-agnostic script exporting metrics to prometheus might be the way to go then. [13:27:21] (03CR) 10TChin: [C: 03+2] Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [13:27:26] (03CR) 10TChin: [V: 03+2 C: 03+2] Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [13:28:00] (03CR) 10TChin: [V: 03+2 C: 03+2] Use canonical_data.countries when populating the referer tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [13:28:27] (03CR) 10TChin: [V: 03+2 C: 03+2] Update referer archive job to use icerberg table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) (owner: 10Joal) [13:29:20] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10BTullis) >>! In T329398#9257835, @brouberol wrote: > Indeed. A `ca_monitoring` module with a puppet-agnostic script exporting metrics to prometheus might be the way to go then. I think it's more likely tha... [13:29:41] (03CR) 10TChin: [V: 03+2 C: 03+2] Add `forwarded` field to netflow druid realtime [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966495 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [13:30:28] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/3 Fix typo in control.templ... [13:30:47] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/3 Fix typo in control.templ... [13:32:25] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) It does! I'm not yet familiar enough with Puppet to explain my train of thoughts with the appropriate terms, but I was indeed thinking about something like this. I didn't know about prometheus::n... [13:33:17] (03CR) 10TChin: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/963835 (https://phabricator.wikimedia.org/T348578) (owner: 10Milimetric) [13:34:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) [13:34:35] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [13:36:36] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/519 Update analytics druid netf... [13:39:26] !log deploying refinery [13:39:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:48:27] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/4 Fix the postinst and prer... [13:48:42] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/4 Fix the postinst and prer... [14:01:15] !log deploying airflow analytics [14:01:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:14:03] 10Data-Engineering, 10serviceops, 10Event-Platform: Traffic for eventstreams-internal seems to be zero for the past months - https://phabricator.wikimedia.org/T348763 (10Ottomata) Good question. I had expected Product teams to use this more often, but perhaps the ssh tunnel barrier is enough for them to nev... [14:24:32] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10Ottomata) Exciting! [14:25:42] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10Ottomata) Q: at the moment, as is, we don't actually lose any events, right? We just end... [14:27:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Documentation, 10Epic, 10Event-Platform: Event Platform Value Stream Documentation Tasks - https://phabricator.wikimedia.org/T329628 (10Ottomata) 05Open→03Resolved Being bold and resolving, lots of docs available. [14:27:42] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata) [14:28:03] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata) 05Open→03Resolved a:03Ottomata Being bold and resolving. [14:29:01] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [14:30:43] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) 05Open→03Resolved a:03Ottomata We've made good progress for the Stream Processing compo... [14:36:13] 10Data-Engineering, 10Tech-Docs-Team: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) [14:37:00] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) 05Open→03In progress p:05Triage→03Medium [14:37:38] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) [14:37:41] 10Data-Engineering, 10Data-Catalog, 10Documentation: Data Catalog Documentation Style Guide - https://phabricator.wikimedia.org/T310229 (10TBurmeister) [14:40:34] (03CR) 10Ottomata: "Actually, even if we fix the main bug, this change can't hurt, right? It never hurts to produce more canary events, and it is not unlikel" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 (owner: 10Aqu) [14:55:54] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Ladsgroup) >>! In T347899#9256819, @Sfaci wrote: > Hi, > > In the description of this ticket there is a list with... [15:04:19] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10BTullis) a:03brouberol [15:12:44] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: Bump memory to enable large artifacts sync on HDFS - https://phabricator.wikimedia.org/T348958 (10xcollazo) We could bump memory or we could check whether there is a better API to do the transfer ( See https://gitlab.wikimedia.org/repos/data-eng... [15:32:20] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) 05Open→03Resolved [15:32:25] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) [15:33:03] 10Data-Platform-SRE: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10Gehel) 05Open→03Resolved [15:33:08] 10Data-Platform-SRE: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890 (10Gehel) 05Open→03Resolved [15:33:14] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) [15:33:18] 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10Gehel) 05Open→03Resolved [15:33:20] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10Gehel) 05Open→03Resolved a:03Gehel [15:33:25] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10Gehel) [15:35:53] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) 05Open→03In progress [15:35:58] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10Gehel) a:03bking [15:37:19] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [15:41:57] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) Just wondering, for example, why this item `File:)(_-_Flickr_-_Time.Captured..jpg` is included as "is not co... [16:13:48] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10BTullis) @bking and I have been discussing this and we think that the best course of action would be to deploy this in multiple steps. e.g. Somethi... [16:17:48] !log restarting hadoop-yarn-nodemanager on an-test-worker1001 [16:17:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:19:42] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:57] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) The first restart of the hadoop-yarn-nodemanager service on an-test-woerker1001 was unsuccessful. This is shown in `/var/lo... [16:21:46] Hey btullis - could you please merge/deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/966499 please? [16:21:57] this requires a restart of turnilo please btullis [16:23:23] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) > Only the public tool databases (the ones with names ending in _p) are planned to be made accessible from Quarry.... [16:24:28] joal: Done. Patch deployed and turnilo restarted. [16:24:35] Thanks a milion btullis :) [16:25:13] A pleasure. [16:26:29] (03PS1) 10Mforns: Revert "Add the wikifunctions_ui metrics platform schema to the allowlist" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966244 [16:26:45] (03PS2) 10Mforns: Revert "Add the wikifunctions_ui metrics platform schema to the allowlist" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966244 [16:27:53] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging revert to unbreak production." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966244 (owner: 10Mforns) [16:37:10] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 5 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [16:37:18] 10Data-Engineering, 10MediaWiki-extensions-UserMerge, 10Event-Platform, 10MW-1.42-notes (1.42.0-wmf.1; 2023-10-17), and 2 others: MergeUserTest::testMovePages: Trying to get property 'rd_namespace' of non-object - https://phabricator.wikimedia.org/T348881 (10matmarex) I think this is actually a real proble... [16:37:28] 10Data-Engineering, 10MediaWiki-extensions-UserMerge, 10Event-Platform, 10MW-1.41-release, and 3 others: MergeUserTest::testMovePages: Trying to get property 'rd_namespace' of non-object - https://phabricator.wikimedia.org/T348881 (10matmarex) [17:05:26] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cae0d6d1-edbc-4b22-8059-9236cb8823bc) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service... [17:30:38] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 7 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Krinkle) > Platform: > mediawiki/services/example-node-api > mediawiki/services/image-suggestion-api > mediawi... [17:33:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:49:16] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 7 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) >>! In T349118#9259044, @Krinkle wrote: >> Platform: >> mediawiki/services/example-node-api >> me... [17:49:33] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 7 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [18:12:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [19:10:16] 10Data-Engineering, 10MediaWiki-extensions-UserMerge, 10Event-Platform, 10MW-1.41-release, and 3 others: ArticleDeleteComplete and PageDeleteComplete hooks receive a WikiPage with inconsistent redirect data - https://phabricator.wikimedia.org/T348881 (10matmarex) [19:10:42] 10Data-Engineering, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-UserMerge, 10Event-Platform, and 4 others: ArticleDeleteComplete and PageDeleteComplete hooks receive a WikiPage with inconsistent redirect data - https://phabricator.wikimedia.org/T348881 (10matmarex) [19:19:10] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) https://github.com/wikimedia/eventgate/pull/24 [19:23:26] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Ladsgroup) I don't know that file (I didn't report it) so I can't say it was among the incorrect ones or not. The... [19:45:41] 10Quarry, 10cloud-services-team: Should quarry use our standard secrets management - https://phabricator.wikimedia.org/T290184 (10rook) [19:45:51] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10rook) [19:47:10] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10nskaggs) Is this a feature we want to also make accessible to Superset? I suspect it could use the same technical implementa... [19:48:29] 10Quarry: Deduplicate config load - https://phabricator.wikimedia.org/T349135 (10rook) [19:51:24] 10Quarry, 10Toolforge, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10rook) >>! In T348407#9259469, @nskaggs wrote: > Is this a feature we want to also make accessible to Superset? I suspect it... [21:24:16] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) We could also deploy via a new namespace, but I wonder what implications that would have for our monitoring/tooling etc. Open to feedback/su... [21:27:37] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10Ottomata) I wonder if this is an async race bug introduced by https://gerrit.wi... [21:46:50] 10Data-Platform-SRE, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10bking) [22:13:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [23:08:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:38:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:58:18] (03CR) 10David Martin: [C: 03+1] Add the wikifunctions_ui metrics platform schema to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) (owner: 10MNeisler)