[02:48:00] (03PS1) 10Amire80: Enable Punjabi language [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/955857 (https://phabricator.wikimedia.org/T344572) [02:52:42] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1126:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:29] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:05] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:16:27] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:03] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:22:42] (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1126:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:21] (03CR) 10DCausse: [C: 03+1] Adapt schema to meet latest requirements. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [06:27:21] 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10JMeybohm) 05Open→03Resolved a:03JMeybohm >>! In T341042#9150714, @bking wrote: > @JMeybohm these hosts have been reimaged, are you still seeing... [07:37:01] (03CR) 10Joal: [C: 03+2] "Merging for later deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal) [07:46:36] (03Merged) 10jenkins-bot: Make refine SchemaLoader main function thread safe [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal) [07:59:31] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) @Hannah_Bast: anything is possible :) That being said, that's not a feature that is possible out... [08:13:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Hannah_Bast) @Gehel Thanks for the reply! But to clarify, what I am asking is **not** to do something dif... [08:32:41] (03Abandoned) 10Joal: Increase the max kafka message size for gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/954968 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [08:43:14] 10Data-Platform-SRE, 10superset.wikimedia.org: Bad Gateway Error when uploading csv to Superset - https://phabricator.wikimedia.org/T300440 (10BTullis) [09:05:36] btullis: would you be avaible for a quick touchdown regarding the cookbook task? As expected, things are not simple, as we'd possibly need to add the openserarch client as a dependency to spicerack itself, and see whether we could transparently mirror the API of our `spicerack.elasticsearch_clusters.ElasticSearchCluster` object based on the [09:05:36] opensearch client. [09:05:58] So this is getting interesting [09:06:00] * brouberol rubs hands [09:06:36] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for EventBus? - https://phabricator.wikimedia.org/T345195 (10gmodena) [09:06:39] brouberol: Yes, sure thing. [09:06:55] (given that it's friday, you might want to push that to monday. It'd be completely fine by me [09:08:42] hi folks! [09:08:56] I was checking druid's version, any plans to upgrade to latest during this fiscal? [09:09:11] I am wondering if there could be more features or perf improvements that we could use [09:09:22] Hi elukey. Nothing prioritized, but we probably should :-) [09:26:16] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10dcausse) @Hannah_Bast Blazegraph does properly send the header `Accept: application/sparql-results+xml` b... [09:34:31] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10dcausse) Note that even if we changed blazegraph to accept multiple formats for all endpoints by setting... [09:54:25] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) [10:39:36] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Epic, 10Event-Platform, 10Patch-For-Review: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) [11:14:32] 10Data-Platform-SRE, 10superset.wikimedia.org: Bad Gateway Error when uploading csv to Superset - https://phabricator.wikimedia.org/T300440 (10BTullis) 05Open→03Declined It looks like the only database we have that supports CSV imports is `mysql_staging` If I go here: https://superset.wikimedia.org/csvtoda... [11:55:17] 10Data-Engineering: Alert for snapshot100[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10fgiunchedi) [12:13:20] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Alert for snapshot100[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10BTullis) Given that these are [[https://wikitech.wikimedia.org/wiki/Dumps/Snapshot_hosts|snapshot hosts]] - I think that they are rel... [12:20:37] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Alert for snapshot100[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10fgiunchedi) Thank you for the extensive info @BTullis ! AFAICT the check will make sure we're not leaving hosts behind not in `dsh`... [12:25:39] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Alert for snapshot101[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10taavi) [12:26:49] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Alert for snapshot101[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10taavi) > I'm not immediately sure what the check_dsh_groups is needed for, but I can help do more investigation if required. That che... [12:30:22] (03PS4) 10Aqu: WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [12:37:19] (03CR) 10CI reject: [V: 04-1] WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [13:03:15] 10Quarry: [feedback] - https://phabricator.wikimedia.org/T345913 (10Xaosflux) [13:04:05] 10Quarry: [feedback] - https://phabricator.wikimedia.org/T345913 (10Xaosflux) 05Open→03Invalid a:03Xaosflux [13:05:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for EventBus? - https://phabricator.wikimedia.org/T345195 (10lbowmaker) [13:14:23] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10Patch-For-Review: Alert for snapshot101[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10BTullis) Thanks @fgiunchedi and @taavi - I have created a patch to add them to the correct group and added a fe... [13:19:42] 10Data-Engineering, 10Product-Analytics, 10Data Engineering and Event Platform Team (Sprint 1): Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10lbowmaker) 05Open→03Resolved [13:20:14] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: mw-page-content-change-enrich: alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10lbowmaker) 05Open→03Resolved [13:20:46] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: mediawiki page_content_change should generate new meta.id field - https://phabricator.wikimedia.org/T341277 (10lbowmaker) 05Open→03Resolved [13:21:32] 10Data-Engineering-Planning, 10Epic, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10lbowmaker) [13:22:09] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10lbowmaker) 05Open→03Resolved [13:23:23] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [13:23:33] ^expected [13:23:58] I think. Could have been in downtime, but nothing to worry about. [13:25:22] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker) [13:26:23] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10lbowmaker) [13:28:01] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10Patch-For-Review: Alert for snapshot101[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10ArielGlenn) the ops-dumps email alias ought to get notified about things like this; that way all the right peop... [13:28:42] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: cookicutter template example should be updated - https://phabricator.wikimedia.org/T345390 (10lbowmaker) [13:28:45] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10lbowmaker) [13:29:49] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Epic, 10Event-Platform, 10Patch-For-Review: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [13:29:54] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10lbowmaker) [13:30:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10lbowmaker) [13:30:32] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10lbowmaker) [13:49:17] 10Data-Engineering: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10lbowmaker) [13:53:27] * brouberol is afk for about 1h (Doctor appt for my daughter) [14:11:06] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Hannah_Bast) @dcausse I am confused, where does https://data.nlg.gr/sparql come from? I thought the endpo... [14:12:45] dcausse: ^ might need a follow up [14:14:37] oops did I mix this ticket with a different endpoint? [14:27:38] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10dcausse) @Hannah_Bast sorry about this I mixed this ticket with another one, supporting `https://qlever.c... [14:46:39] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [15:01:30] * brouberol is back [15:17:04] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [16:19:50] I've run a small experiment, trying to see whether the elasticsearch-py and opensearch-py clients share the same Python API (method names, args, etc). If they did, we could easily swap the ES client for the OS one, and automagically support opensearch in spicerack. [16:20:16] Results: there are small inconsistencies in the API, meaning it won't be that easy [16:20:16] Details: https://bin.balthazar-rouberol.com/edamesianv.py [16:22:11] brouberol: we can probably swap one with the other with minimal work. Even if that means minor changes to spicerack elastic search support [16:22:47] it's actually a good thing, I think. 2 perfectly API-compatible packages at a time t might not guarantee compatibility over time, as projects diverge. This way, we'd implement 2 separate clients, with a bit of code duplication, instead of retrofitting one in the other, which migth produce less-readable code anyway [16:22:54] brouberol: Also, feel free to use https://phabricator.wikimedia.org/paste/ for pastebin type of stuff. It integrates nicely with the tickets and we can control the lifecycle/visibility etc. [16:23:03] 👍 [16:23:27] bookmarked [16:24:53] gehel: I take it that there aren't any plans to move away from ES on the main search clusters in the short term? :-) [16:25:02] gehel: I think we could do both yes. It really comes down to what we prefer, in terms of maintenance. I don't (yet) have any strong preference either way [16:26:17] There are medium term plans to migrate away from Elasticsearch to OpenSearch [16:26:29] On this, I wish you all a good weekend. It's baby sitting time. Can I let you mull it over? [16:26:49] OK, bye for now. Catch you next week. [16:27:35] The issue isn't if the client exposed different APIs, the question is to know if the client is compatible with both OpenSearch and Elasticsearch. [16:27:41] Enjoy the weekend! [16:27:58] brouberol it's been awhile, but last I checked the opensearch python libraries weren't a complete drop-in replacement for the ES libraries, but they were trying to get there [16:28:20] Elastic also did some dirty stuff with their elasticsearch-py , detecting and refusing to worth with Opensearch [16:28:40] that's what I heard as well [16:29:08] (for context, I don't have much experience with ES/OS, either on the admin or user side, so all this is pretty new to me) [16:29:44] no problemo. Didn't mean to lure you back...feel free to take your weekend ;) [16:44:50] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) I think this is fixed. I am seeing four disks in the idrac and bios. Can someone confirm? [20:41:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) [20:44:51] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) [21:43:26] 10Analytics: Requesting Kerberos access for ahoelzl - https://phabricator.wikimedia.org/T345961 (10Ahoelzl)