[00:31:59] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:51:10] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Milimetric) I was wondering if we could disable the //Line Chart// type then, if it's deprecated, and did some digging but it doesn't... [02:11:04] 10Data-Engineering, 10SRE, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Milimetric) @BBlack: this was never our pipeline. It looks like @dr0ptp4kt's [[ https://lists.wikimedia.org/pipermail/analytics/2015-February/003426.html | original idea ]]... [03:06:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [03:11:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [06:43:14] hello! a friendly reminder that an-db-1.analytics.eqiad1.wikimedia.cloud and pontoon-1.analytics.eqiad1.wikimedia.cloud are still failing puppet runs [07:18:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:27:02] 10Data-Engineering: Check home/HDFS leftovers of nikkin - https://phabricator.wikimedia.org/T307420 (10MoritzMuehlenhoff) [07:33:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:47:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:25:43] 10Quarry, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro, 10cloud-services-team (Kanban): Request to add razzi to Quarry Cloud VPS project - https://phabricator.wikimedia.org/T307403 (10dcaro) 05Open→03Resolved a:03dcaro +1 from me, should be done, let me know if yo... [09:10:47] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general https://wikitech.wikimedia.or... [11:23:41] (VarnishkafkaNoMessages) firing: ... [11:23:41] varnishkafka for instance cp3058:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3058:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:28:41] (VarnishkafkaNoMessages) resolved: ... [11:28:41] varnishkafka for instance cp3058:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3058:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:41:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Adapt maxExecutors value by Dag - https://phabricator.wikimedia.org/T307447 (10Antoine_Quhen) [12:31:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [12:41:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [13:40:01] (03CR) 10Mforns: "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787840 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [13:51:58] 10Quarry, 10Patch-For-Review: Cannot stop unresponsive quarries - https://phabricator.wikimedia.org/T307297 (10Fuzzy) This bug is a duplicate of T290146. Please mark this one as "duplicate". It appears there is a much deeper problem. The original quarry and other quarries used to work properly, are now stuck... [14:12:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_04 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [14:16:45] 10Quarry, 10Patch-For-Review: Cannot stop unresponsive quarries - https://phabricator.wikimedia.org/T307297 (10Aklapper) Anyone can {nav icon=anchor,name=Edit Related Tasks... > Close As Duplicate} in the upper right corner. However this task has a proposed patch associated so I wouldn't do that. [14:30:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7892221, @Ottomata wrote: >> perhaps this is a client browser opening a connection but send... [14:32:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_04 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [14:52:55] (03PS1) 10Vivian Rook: Remove stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788719 (https://phabricator.wikimedia.org/T290146) [14:53:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It possible that the request aborted errors are actually requests being terminated mid-flight by the clie... [14:54:33] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) I propose we remove the feature. Would anyone here care to do a code review on https://gerrit.wikimedia.org/r/788719 seems to run, some brief tinkering in the dev env doe... [14:54:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/788623 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/788614 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:06] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/788613 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/788340 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:37] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/788339 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:46] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787901 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:52] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787727 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:55:56] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/787729 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:56:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is all in the node_modules/ folder. These are libraries that are downloaded from somewhere else." [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787785 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:56:39] (03CR) 10jerkins-bot: [V: 04-1] Remove stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788719 (https://phabricator.wikimedia.org/T290146) (owner: 10Vivian Rook) [14:58:24] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10Certes) Please don't remove the Stop feature. It sometimes works, and is better than nothing, both for users who (sometimes) get their query unlocked for fixing and for the se... [14:58:41] 10Data-Engineering-Kanban, 10Airflow, 10Documentation: [Airflow] Kick off documentation in wikitech - https://phabricator.wikimedia.org/T302400 (10EChetty) p:05Triage→03High [14:58:42] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "These are copies of the jQuery library, it seems. This is not actually maintained in this codebase." [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/788326 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:00:37] (03PS2) 10Vivian Rook: Remove stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788719 (https://phabricator.wikimedia.org/T290146) [15:01:41] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "Pretty sure this will break all code that tries to use this option. It also looks like this is an external lib that's not actually maintai" [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/788327 (owner: 10Klein Muçi) [15:02:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is the jQuery lib. A vendor/ directory is a strong sign that this is not actually maintained in this codebase." [analytics/reportcard] - 10https://gerrit.wikimedia.org/r/787904 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:02:45] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "node_modules/ again." [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/787847 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:03:33] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "Pretty sure this is an external library. Weird to see it being copy-pasted 3 times in the same codebase. But that might be intentional." [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/787851 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:04:26] 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn I mentioned over email, but I t... [15:07:03] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:10:10] (03CR) 10jerkins-bot: [V: 04-1] Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:15:48] 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) @jhathaway Yes and no. What I definitely d... [15:18:25] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) >>! In T300977#7836272, @Volans wrote: > If I may add my use case too, I would like to be able to restrict the... [15:19:54] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) hmm...maybe making it a separate button that is just always there alongside submit then... [15:35:00] (03Abandoned) 10Klein Muçi: Fix typo [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/788614 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:35:57] (03Abandoned) 10Klein Muçi: Fix typo [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/788613 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:36:44] (03Abandoned) 10Klein Muçi: Fix typo [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/788623 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:40:16] (03Abandoned) 10Klein Muçi: Fix typo [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/788340 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:40:59] (03Abandoned) 10Klein Muçi: Fix typo [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/788339 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:41:15] (03Abandoned) 10Klein Muçi: Fix typo [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787901 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:41:23] (03Abandoned) 10Klein Muçi: Fix typo [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787727 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:41:32] (03Abandoned) 10Klein Muçi: Fix typo [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/787729 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:42:10] (03Abandoned) 10Klein Muçi: Fix typo [analytics/pivot/deploy] - 10https://gerrit.wikimedia.org/r/787785 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:43:25] (03Abandoned) 10Klein Muçi: Fix typo [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/787847 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [15:45:53] 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn that makes sense, so I assume i... [15:51:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7899762, @Ottomata wrote: > It possible that the request aborted errors are actually reques... [15:59:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) to be accurate, the remote client talks to HAProxy over a TLS connection and HAProxy handles the traffi... [16:03:28] (03Abandoned) 10Klein Muçi: Fix typo [analytics/reportcard] - 10https://gerrit.wikimedia.org/r/787904 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [16:05:14] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It does seem that a 400 bad request is being sent to the client. I think that perhaps the 500 reported b... [16:34:30] 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) The part that we don't (can't actually) re... [16:34:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @Vgutierrez for the clarification on that. I hadn't picked up on the progress of the HAProxy migrat... [16:42:30] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10Ahecht) [16:43:20] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10Ahecht) p:05Triage→03Unbreak! [16:48:24] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10bd808) p:05Unbreak!→03High Per https://www.mediawiki.org/wiki/Phabricator/Project_management#Priority_levels: UBN! is reserved for isses meeting the guidelines at https://wikitech.wikimedia.o... [16:50:48] (03PS2) 10Jenniferwang: Bug: T299007 Add the mediawiki_reading_depth event platform stream to the allowlist. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787840 (https://phabricator.wikimedia.org/T299007) [17:03:50] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:15:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10Infrastructure-Foundations: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye - https://phabricator.wikimedia.org/T306148 (10razzi) 05Open→03Resolved Updated netbox status to "Active". [17:20:20] (03CR) 10Jenniferwang: "Hi," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787840 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [17:27:37] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10Ahecht) >>! In T307482#7900257, @bd808 wrote: > Per https://www.mediawiki.org/wiki/Phabricator/Project_management#Priority_levels: UBN! is reserved for isses meeting the guidelines at https://wik... [17:29:47] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10Ahecht) [17:34:51] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10nskaggs) 05Open→03In progress [18:22:10] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787840 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [18:24:17] !log remove /etc/apache2/sites-available/50-superset-wikimedia-org.conf from an-tool1005 (superset staging) since it was removed from puppet but has no ensure: absent [18:24:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:33:05] (03CR) 10Mforns: Create Hive Query to generate Wikidata CoEditors metrics (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/780749 (https://phabricator.wikimedia.org/T306177) (owner: 10Snwachukwu) [18:57:45] (03Abandoned) 10Jenniferwang: Bug: T299007 Add the mediawiki_reading_depth event platform stream to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [19:08:59] milimetric: I was looking into adding the mediawiki geoeditors jobs to the hackathon. Do you think we can solve the issues that made us pause the Airflow dags? Or is it better to wait for Spark3? [19:11:32] mforns: we can't resolve it in that it'll still write nulls but we can handle it by looking at all dependent jobs carefully. We could do that and then wait for spark, but ideally we could just add spark 3 to a hackathon [19:12:04] ^ oo [19:12:13] maybe that is a good idea...am i part of the hackathon? not certain [19:15:56] ottomata: of course you are, if you wantntntnt :) [19:16:40] ottomata: if you want to work on Spark3 stuff in the hackathon, I'll add that as a potential task [19:16:59] thanks milimetric [19:28:51] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) It's working! Visit https://superset-next.wikimedia.org/ [19:32:25] ottomata: superset-next.wikimedia.org is working! I stared at it not working (all static content giving 404s) for like 2 hours then restarted the superset service, which fixed it [19:41:12] NIIICE! [19:41:23] amazing, yeahhhh reboot it always works! [19:43:30] ^ this needs to go on bash :P [19:51:51] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) a:03rook [19:54:18] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) Looks like a few long running processes were locking up nfs on one of the workers. Thus quarry worked about half the time (the half that it got the not loaded worker). Cleared the old jobs... [19:55:24] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [19:57:41] ottomata: Right off the bat people are hitting the require_u2f and unable to log into superset, so I'm disabling it for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/788774 [19:57:41] small patch but security related [20:00:03] ottomata: do you want me then to add Spark3 to the hackathon? [20:01:38] (03PS1) 10GoranSMilovanovic: dir_tree [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788778 [20:02:06] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] dir_tree [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788778 (owner: 10GoranSMilovanovic) [20:05:00] (03PS1) 10GoranSMilovanovic: data_dir [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788779 [20:05:14] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] data_dir [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788779 (owner: 10GoranSMilovanovic) [20:06:58] (03PS1) 10GoranSMilovanovic: keep_empty_data_dirs [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788780 [20:07:09] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] keep_empty_data_dirs [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/788780 (owner: 10GoranSMilovanovic) [20:34:04] 10Data-Engineering, 10Airflow: Migrate 1+ Refine jobs - https://phabricator.wikimedia.org/T307505 (10mforns) [20:34:21] 10Data-Engineering, 10Airflow: Migrate 1+ Refine jobs - https://phabricator.wikimedia.org/T307505 (10mforns) [20:34:23] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [20:41:46] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:59:19] 10Data-Engineering, 10Airflow: Migrate 1+ Druid load jobs - https://phabricator.wikimedia.org/T307508 (10mforns) [20:59:31] 10Data-Engineering, 10Airflow: Migrate 1+ Druid load jobs - https://phabricator.wikimedia.org/T307508 (10mforns) [20:59:33] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [21:04:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [21:19:14] lol, razzi, we call that "the hammer" [21:28:57] 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) But what isn't is that there seems to be a... [22:05:47] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10MarioGom) Queries sometimes get stuck (T307263), and the Stop button (with the double click trick) seems to be the only workaround. So removing the Stop button altogether does... [22:55:13] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) @MarioGom if the stop function works sometimes, then would it be better to have it appear along side the submit function, rather than replace the submit function all toge... [23:19:36] (03Abandoned) 10Klein Muçi: Fix typo [analytics/pageview-api] - 10https://gerrit.wikimedia.org/r/788326 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [23:19:57] (03Abandoned) 10Klein Muçi: Fix typo [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/788327 (owner: 10Klein Muçi) [23:20:28] (03Abandoned) 10Klein Muçi: Fix typo [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/787851 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi)