[04:21:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:16] PROBLEM - Check unit status of monitor_refine_event_sanitized_main_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:52:40] !log restarted prometheus-mysqld-exported on an-coord1001 due to apparent failure [09:12:34] (03CR) 10Btullis: [C: 03+2] Update the main logo that is used in DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806445 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [09:28:46] 10Data-Engineering, 10Data-Engineering-Kanban: aqs1008.mgmt interface SSH check flapping - https://phabricator.wikimedia.org/T311042 (10BTullis) [09:32:49] (03Merged) 10jenkins-bot: Update the main logo that is used in DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806445 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [09:44:14] 10Data-Engineering, 10SRE, 10Traffic, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10phuedx) >>! In T306181#8013301, @Ottomata wrote: > Thanks ben! Seconded. Thanks for all of your w... [09:44:36] 10Data-Engineering, 10Data-Engineering-Kanban: aqs1008.mgmt interface SSH check flapping - https://phabricator.wikimedia.org/T311042 (10BTullis) When attempting to log into the management interface with SSH I got this: ` btullis@marlin-wsl:~/wmf/pw$ ssh -l root aqs1008.mgmt.eqiad.wmnet Warning: Permanently add... [09:44:59] 10Data-Engineering, 10Data-Engineering-Kanban: aqs1008.mgmt interface SSH check flapping - https://phabricator.wikimedia.org/T311042 (10BTullis) p:05Triage→03Low [09:46:26] 10Data-Engineering, 10Data-Engineering-Kanban: aqs1008.mgmt interface SSH check flapping - https://phabricator.wikimedia.org/T311042 (10BTullis) The check is fixed for now, but I'll monitor for stability. {F35259200,width=70%} [10:17:59] I'm going to test out `sre.hadoop.roll-restart-masters` cookbook again today for T310293 unless anyone has any objections. [10:18:00] T310293: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 [10:42:22] 10Data-Engineering, 10Data-Engineering-Kanban: Add the conftool pooled/depooled status and weight into prometheus for each service - https://phabricator.wikimedia.org/T309189 (10BTullis) There is a slight hiccup with this change, as pointed out by @fgiunchedi on https://gerrit.wikimedia.org/r/c/operations/pupp... [10:47:54] !log proceeding with the hadoop.roll-restart-masters cookbook [10:47:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:39:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Update branding for DataHub to include WMF customizations - https://phabricator.wikimedia.org/T310629 (10BTullis) This has now been updated (and the varnish cache for the asset expired). Are we happy with it? Maybe we should make it a bit bigger?... [11:44:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Update branding for DataHub to include WMF customizations - https://phabricator.wikimedia.org/T310629 (10BTullis) p:05Triage→03Low [11:48:23] (03CR) 10Vivian Rook: [C: 03+2] Show username on 404 page when logged in [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806504 (https://phabricator.wikimedia.org/T310888) (owner: 10Jiyu) [11:51:36] (03Merged) 10jenkins-bot: Show username on 404 page when logged in [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806504 (https://phabricator.wikimedia.org/T310888) (owner: 10Jiyu) [11:52:24] 10Quarry, 10Patch-For-Review: 404 page for no user shows login - https://phabricator.wikimedia.org/T310888 (10rook) 05Open→03Resolved [11:55:17] 10Data-Engineering, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog (Current Work): [M] No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10mfossati) a:03mfossati [11:55:39] (03PS1) 10Masoud Shokohi: Review access change [analytics/aggregator/projectview/data] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/806997 [11:57:40] The `sre.hadoop.roll-restart-masters` cookbook failed again at the same point. I will look to fail back manually. [11:58:20] (03CR) 10Masoud Shokohi: "to aggregation" [analytics/aggregator/projectview/data] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/806997 (owner: 10Masoud Shokohi) [12:04:04] RECOVERY - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1001 is OK: OK: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:10:52] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) The cookbook failed again at the same place. {F35259786,width=60%} I will have to look at this again to see if there is anything more we can do to increase the stability of thi... [12:47:46] 10Quarry, 10Patch-For-Review: Quarry history feature not showing history - https://phabricator.wikimedia.org/T306658 (10rook) a:03rook [12:47:52] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) This time I left it for 38 minutes between starting the namenode on an-master1001 and attempting the failback. It still failed even with almost four times as long to settle. ` b... [13:03:34] (03PS2) 10Vivian Rook: Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) [13:14:27] (03CR) 10Ottomata: Add Schema for Enriched MW Streams (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker) [13:14:54] (03CR) 10Ottomata: Add Schema for Enriched MW Streams (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker) [13:23:57] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7293 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:24:26] btullis: ^ is you working on namenodes, yes? [13:28:11] Yes. This is me. We're running on the standby server, which means that this new check I created last week is alerting. [13:29:45] c.f. https://phabricator.wikimedia.org/T309649#8008629 [13:32:27] ACKNOWLEDGEMENT - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7714 seconds old and 217 bytes Btullis T310293 - running on standby server temporarily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:32:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:50] !log sudo systemctl start monitor_refine_event_sanitized_main_immediate.service on an-launcher1002 [13:33:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:34:09] RECOVERY - Check unit status of monitor_refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:34:46] (03CR) 10Vivian Rook: "This is looking more functional" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook) [13:35:07] aqu: ty, was just looking at that. [13:36:09] I am trying to rerun sanitization on mediawiki_revision_create/datacenter=eqiad/year=2022/month=6/day=19/hour=10. Currently looking for the script... [13:37:07] For this partition, the refine in `event` is indeed newer than the refine in `event_sanitized`. [13:50:11] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) Trying once again, this time having left it for an hour after starting the namenode service on an-master1001. This is the GC count over the last three hours. {F35260058} ` btull... [13:51:44] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) Negative. It did not work. It failed with a slightly different error message this time though. ` btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1... [13:53:58] heya aqu can I help?! [13:54:00] heya teammm [13:54:56] What do you think about that to rerun an hourly partition: [13:54:56] spark2-submit \ [13:54:56] --class org.wikimedia.analytics.refinery.job.refine.RefineSanitize \ [13:54:56] /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.1.15.jar \ [13:54:56] --config_file /home/aqu/config_file.properties \ [13:54:56] --output_database event_sanitized \ [13:54:56] --since 52 \ [13:54:57] --until 51 \ [13:54:57] --table_include_regex mediawiki_revision_create [13:55:09] mforns ^ [13:55:13] lookin! [13:55:44] 10Analytics, 10Observability-Logging, 10Patch-For-Review: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10Eevans) >>! In T310760#8012888, @colewhite wrote: >>>! In T310760#8012470, @Eevans wrote: >> I don't know what these are; The bi... [13:57:56] aqu: on which machine are you running it? [13:58:05] to check the properties file [13:58:05] stat1004 [13:58:11] k [13:59:37] aqu: there's no config_file.properties in stat1004 no? [14:00:11] sorry refine.properties [14:00:18] ah , ok ok [14:00:40] i wrote the path in irc, but it was beginning with a slash... [14:03:29] hehe, happens to me all the time [14:03:52] the properties file looks good to me!~ [14:04:20] you are passing output_database, but I think it is already in the config file no? [14:05:24] the since and until can be ISO timestamps as well, maybe it's easier, but works with 51, 52 [14:05:50] maybe you need: [14:05:57] --master yarn \ [14:05:57] --deploy-mode cluster \ [14:05:59] ? [14:06:36] maybe you need also spark configurations, but I'm not sure [14:11:16] Hello, we at Wikidata team are trying to visualize pageviews for a certain Namespace (EntitySchema) and subpaths of a Special page. CT304793 [14:11:17] How do I get pageview stats for namespace or a special page and is there a way to save them to visualize them in Grafana? Thanks! [14:11:50] T304793 [14:11:50] T304793: Get metrics on usage of Entity Schemas - https://phabricator.wikimedia.org/T304793 [14:21:53] \o/ I'm back! [14:32:39] woohoo [14:35:42] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:52:27] (03CR) 10Snwachukwu: Add projectview hql scripts to analytics/refinery/hql path. (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [14:52:38] (03CR) 10Snwachukwu: [C: 03+1] Add projectview hql scripts to analytics/refinery/hql path. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [14:56:29] !log RefineSanitize from an-launcher1002: sudo -u analytics kerberos-run-command analytics spark2-submit --class org.wikimedia.analytics.refinery.job.refine.RefineSanitize --master yarn --deploy-mode client /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.1.15.jar --config_file /home/aqu/refine.properties --since "2022-06-19T09:52:00+0000" --until [14:56:29] "2022-06-19T11:02:00+0000" --table_include_regex mediawiki_revision_create [14:56:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:56:45] 10Data-Engineering, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10Snwachukwu) a:03Snwachukwu [15:02:10] ottomata: ping meeting [15:04:43] OH i went do a different meeting [15:04:44] coming [15:21:18] (03CR) 10Joal: Add projectview hql scripts to analytics/refinery/hql path. (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [15:28:12] 10Quarry: Have an easy way to ban users from Quarry - https://phabricator.wikimedia.org/T104322 (10rook) 05Open→03Declined [15:28:16] 10Quarry, 10Patch-For-Review: Create simple admin management tool - https://phabricator.wikimedia.org/T224376 (10rook) [15:29:29] 10Quarry: Add 'download as SQL' option - https://phabricator.wikimedia.org/T71191 (10rook) 05Open→03Declined [15:56:10] 10Analytics-Radar, 10Data-Engineering, 10ChangeProp, 10Event-Platform, and 4 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 (10daniel) [16:13:08] 10Analytics-Radar, 10Product-Analytics, 10Campaign-Registration: Develop a consistent rule for which special pages count as pageviews - https://phabricator.wikimedia.org/T240676 (10mpopov) Back in March I asked about an issue with us seeing views for Special pages that aren't allowlisted (as mentioned above)... [16:14:19] (03CR) 10Snwachukwu: [C: 03+1] Add projectview hql scripts to analytics/refinery/hql path. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [16:16:39] (03PS1) 10Joal: Update webrequest load warning and error thresholds [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807175 (https://phabricator.wikimedia.org/T310576) [16:21:30] (03CR) 10Snwachukwu: [C: 03+1] Add projectview hql scripts to analytics/refinery/hql path. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [16:25:56] (03CR) 10Joal: Add projectview hql scripts to analytics/refinery/hql path. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [16:27:18] (03PS5) 10Snwachukwu: Add projectview hql scripts to analytics/refinery/hql path. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) [16:32:49] (03PS6) 10Snwachukwu: Add projectview hql scripts to analytics/refinery/hql path. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) [16:32:59] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [16:40:56] (03CR) 10Snwachukwu: Add projectview hql scripts to analytics/refinery/hql path. (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [16:59:47] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @BTullis Can you confirm raid configuration and partman recipe to use please? [17:01:23] (03CR) 10Ottomata: [C: 03+1] Update webrequest load warning and error thresholds [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807175 (https://phabricator.wikimedia.org/T310576) (owner: 10Joal) [17:31:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow: pin dependency versions to prevent long installs - https://phabricator.wikimedia.org/T309046 (10JAllemandou) I faced the same issue and the problem was due to a failed install of a previous package due to a missing dependency on the host (see... [17:42:22] mforns, aqu: would you have a minute to talk about my airflow cassandra loading job ? [17:54:49] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) [18:07:28] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @btullis can you confirm what the raid configuration is supposed to be please. 2 SSD Raid 1? and Raid 10 th... [18:07:40] how hard would it be to get some basic data on web requests containing duplicate query parameters -- how prevalent they are, and whether there is any pattern (particular query parameters) [18:10:26] I'm not sure if there's a clever way to do that with HQL built-ins (it looks like it might be possible to use explode() / lateral views?) or if it requires a bespoke UDF [18:53:05] ori: easiest and most efficient would probably be to use a UDF with Spark [18:53:37] ori: With spark no need to compile/release, you can write UDF in scala/python in a notebook and use them straight away [18:57:10] joal was in a huddle, I can meet now if you;re avail? [19:01:16] ping me whenever joal :] [19:10:26] joal: cool. do i need kerberos access for that? [19:13:37] 10Analytics: Requesting Kerberos access for ori - https://phabricator.wikimedia.org/T311088 (10ori) [19:29:54] ori: you do - you need kerberos to access the cluster where data is stored and the compute associated [19:31:02] joal: ack, filed a request [19:44:41] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10colewhite) >>! In T310760#8016779, @Eevans wrote: > I can't tell what these are without the missing `info` field; Hopefully things will be clearer whe... [19:46:03] 10Analytics: Requesting Kerberos access for ori - https://phabricator.wikimedia.org/T311088 (10Ottomata) Approved and fulfilling... [19:53:13] ori, done! [19:53:37] 10Analytics, 10Patch-For-Review: Requesting Kerberos access for ori - https://phabricator.wikimedia.org/T311088 (10Ottomata) 05Open→03Resolved a:03Ottomata [20:08:39] 10Analytics-Radar, 10Data-Engineering, 10ChangeProp, 10Event-Platform, and 4 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 (10Krinkle) [20:13:18] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [20:14:42] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [20:15:29] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [20:17:06] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [20:20:45] <3 thanks! [20:23:05] <3 [20:23:07] yw [20:26:51] RECOVERY - AQS root url on aqs2001 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [20:54:14] 10Analytics, 10Observability-Logging, 10Patch-For-Review: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10Eevans) >>! In T310760#8017823, @colewhite wrote: >>>! In T310760#8016779, @Eevans wrote: >> I can't tell what these are without...