[00:15:51] RECOVERY - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:29:25] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:35:11] RECOVERY - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:48:51] PROBLEM - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:04:51] (03CR) 10Ottomata: [C: 03+1] mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) (owner: 10Phuedx) [03:12:30] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10Tgr) https://logstash.wikimedia.org/goto/42fcb0aed180c0a197cecdd26bc9d8ae{F35545830} {F35545831} `Failed loading schema at... [03:14:09] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10Tgr) (On an aside, the schema version is hardcoded in puppet [[https://gerrit.wikimedia.org/g/cloud/instance-puppet/+/54ea6a0... [04:15:31] RECOVERY - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:15:56] ottomata: ^ that's an UBN and I think someone with eventgate server access would have to look at it [04:16:26] (I mean T319261) [04:16:27] T319261: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 [04:29:09] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:45:03] RECOVERY - Check unit status of drop-features-actor-rollup-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-rollup-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:58:43] PROBLEM - Check unit status of drop-features-actor-rollup-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-rollup-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:07:22] 10Data-Engineering: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T319266 (10MoritzMuehlenhoff) [07:31:01] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10elukey) I think that the issue is related to how eventgate-logging-external reads the schemas, namely from local disk only:... [07:35:44] 10Data-Engineering: Check home/HDFS leftovers of nikafor - https://phabricator.wikimedia.org/T319268 (10MoritzMuehlenhoff) [07:42:18] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10elukey) Tried a roll restart of codfw pods, not successful. [07:51:10] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10elukey) It seems that the Docker image needs to be rebuilt with a new version of the schema registry, see this prev commit: h... [07:58:28] btullis: o/ [07:58:30] around? [08:01:53] I created https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067 to fix an issue with eventgate-external-logging [08:02:01] also joal --^ if around [08:16:10] elukey: yes, I'm around now. Thank you. [08:19:32] in theory IIUC we just need to rebuild the docker image and deploy [08:45:16] RECOVERY - Check unit status of drop-features-actor-rollup-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-rollup-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:50:44] going afk in a few, feel free to go ahead with the patch without me :) [08:52:16] elukey: Will do. According to this procedure, right? https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate-wikimedia_repository_change [08:53:12] btullis: yep exactly! We can test staging/canary in theory before hitting prod [08:53:30] (not sure about canary though, maybe I misremember) [08:53:44] releases: [08:53:44] - name: production [08:53:44] <<: *default [08:53:44] - name: canary [08:53:44] <<: *default [08:53:50] nono we have canary as well [08:54:20] I only have +1 on that repo. I can't +2 it. [08:55:02] lol me too [08:55:48] https://gerrit.wikimedia.org/r/admin/repos/schemas/event/primary,access [08:56:34] you should be able to submit https://gerrit.wikimedia.org/r/admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members [08:56:37] but no +2? [08:58:27] That's correct. Definitely no +2. [08:58:27] (need to go, bbl) [08:58:29] https://usercontent.irccloud-cdn.com/file/yE17HIRm/image.png [08:58:34] PROBLEM - Check unit status of drop-features-actor-rollup-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-rollup-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:29:10] This change has now been merged (thanks taavi) and I've prepared the required deployment-charts change. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838107 [09:35:22] I have merged that change, proceeding to deploy eventgate services any minute. [09:35:29] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JMeybohm) This needs SRE support to depool eventstreams from one DC. helmfile destroy/helmfile appy... [09:41:17] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10BTullis) I have merged @elukey's change to eventgate-wikimedia (https://gerrit.wikimedia.org/r/c/eventg... [09:44:20] !log deploying new eventgate-logging-external service to staging [09:44:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:45:02] !log deploying new eventgate-logging-external service to codfw [09:45:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:52:47] (03CR) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 0…599 (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [09:53:37] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10BTullis) I have deployed eventgate-logging-external and the new schema appears to be in place. {F355461... [09:53:55] !log deployed eventgate-logging-external to eqiad (a few minutes ago) [09:53:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:26] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Moving to 'Troubleshoot' column on the #ops-eqiad board. @cmjohnson have you been able to look into this at all? Thanks. [10:39:57] (03CR) 10Awight: Limit HTTP status code to 0…599 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [10:48:13] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10elukey) Both logstash and grafana look really good, I think that the task can be closed! [11:15:06] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10phuedx) >>! In T319261#8282373, @BTullis wrote: > Is anyone else able to confirm whether or not this fi... [11:39:44] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10elukey) p:05Unbreak!→03Medium [12:14:28] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262) (owner: 10Neil P. Quinn-WMF) [12:59:57] 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10valerio.bozzolan) [13:46:16] joal: This was the warning I mentioned in the meeting. [13:46:18] https://usercontent.irccloud-cdn.com/file/Yca2P59F/image.png [13:47:51] I'm going to note it and come back to it for now, but I'm pretty sure I've got around it before. Maybe it was when I was building alluxio, or perhaps atlas. Shouldn't be a biggie anyway. [13:50:19] Yup btullis - know underperformance - nothing major [14:38:45] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10Ottomata) Thank you all! Yes, only the 'analytics' eventgate clusters use dynamic schema lookup. All o... [15:17:21] 10Data-Engineering-Icebox: Improve Bot Detection Heuristics - https://phabricator.wikimedia.org/T310846 (10BTullis) This also relates closely to {T292449} We [[https://docs.google.com/document/d/1MkRw0GRti8u1SSPdaJytOJSNxdNe4ggBRVJg84DmxXg/edit#|discussed this]] in the Data Engineering Office hours with @Mayakp... [15:25:18] (03CR) 10Clare Ming: [C: 03+2] "tested locally with https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/810297/" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) (owner: 10Phuedx) [15:26:24] (03Merged) 10jenkins-bot: mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) (owner: 10Phuedx) [15:28:43] 10Data-Engineering, 10Event-Platform Value Stream, 10Data Pipelines (Sprint 02), 10Metrics-Platform-Planning (Metrics Platform Kanban), 10Patch-For-Review: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client inste... - https://phabricator.wikimedia.org/T286344 [15:50:22] 10Data-Engineering, 10Product-Analytics (Kanban): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10BTullis) We [[https://docs.google.com/document/d/1MkRw0GRti8u1SSPdaJytOJSNxdNe4ggBRVJg84DmxXg/edit#|discussed this]] in the Data Engineering office hours. We understand the the filte... [16:16:39] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10EChetty) [16:48:23] So if I'm producing to a new eventgate schema, what determines its table name in the hive event db? [16:50:31] awight: the name of the schema, normalized (lower-case, dashes/special-chars to underscore) [16:51:02] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) a:03xcollazo [17:59:27] (03PS1) 10Snwachukwu: Bump changelog.md to v0.2.8 before release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/838238 [18:00:45] (03CR) 10Snwachukwu: [V: 03+2 C: 03+1] Bump changelog.md to v0.2.8 before release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/838238 (owner: 10Snwachukwu) [18:02:41] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10brennen) 05Open→03Resolved a:03brennen Resolving per discussion here and normal-looking graphs. [18:10:34] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10Jdlrobson) Thanks for the quick response here! [18:12:30] 10Data-Engineering, 10Event-Platform Value Stream, 10Instrument-ClientError, 10Patch-For-Review: Significant and unexpected drop in JavaScript error logging - https://phabricator.wikimedia.org/T319261 (10brennen) a:05brennen→03None [18:49:45] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:10:38] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:25:33] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Milimetric) > I am sort of modeling a more abstract '[[ https://gerrit.wikimedia.org/r/c/schemas/event/... [19:36:58] 10Analytics-Wikistats, 10Data Engineering Planning, 10Data Pipelines: Wikistats in Uzbek - https://phabricator.wikimedia.org/T314477 (10Milimetric) >>! In T314477#8255150, @Nataev wrote: > Oʻzbekcha is the correct choice, as I've used the Latin script while translating. ack, makes sense, thx > As for the n... [19:54:12] (03PS1) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [20:15:03] RECOVERY - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:28:39] PROBLEM - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:18:36] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Cannot query string data from MariaDB using Wmfdata-Python - https://phabricator.wikimedia.org/T319360 (10nshahquinn-wmf) [23:20:19] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Cannot query string data from MariaDB using Wmfdata-Python - https://phabricator.wikimedia.org/T319360 (10nshahquinn-wmf) [23:34:10] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Cannot query string data from MariaDB using Wmfdata-Python - https://phabricator.wikimedia.org/T319360 (10nshahquinn-wmf) p:05Triage→03High Note that this is **currently blocking my work** and, unless it's something specific to my setup, will blo... [23:44:01] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook