[00:09:05] (03CR) 10Jforrester: [C: 03+1] "I don't have merge/deploy rights in this world, but it LGTM." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/879595 (https://phabricator.wikimedia.org/T326825) (owner: 10Cicalese) [01:34:39] (03CR) 10Gergő Tisza: [C: 03+1] image-suggestions-feedback: Bump to version 2.0.0 (035 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [02:26:46] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:58:34] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:10:32] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) @JAnstee_WMF Affiliate inputs QA at https://docs.google.com/spreadsheets/d/1yx4x96407HT9fTq1KrQxB_ZChKK8bJ9_NKGRPqynNjA/edit?pli=1#gid=0&range=Q3 [09:16:06] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) a:05ntsako→03JAnstee_WMF [09:50:22] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:13:05] (03CR) 10Joal: "Two small things:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [10:22:10] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:36:00] 10Data-Engineering-Planning, 10Epic, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) Thanks for those insights @MatthewVernon - I think I'll go ahead and try the packag... [11:18:12] 10Data-Engineering, 10Equity-Landscape: Affiliates output rank metrics - https://phabricator.wikimedia.org/T306619 (10KCVelaga_WMF) @JAnstee_WMF: affiliate outputs are QC'ed Transformations within the sheet from the inputs: https://docs.google.com/spreadsheets/d/1yx4x96407HT9fTq1KrQxB_ZChKK8bJ9_NKGRPqynNjA/ed... [11:38:10] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [11:38:27] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [11:38:58] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [11:39:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [11:39:56] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): Investigate wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10EChetty) 05Open→03Resolved [11:40:58] 10Data-Engineering, 10Data Pipelines: Migrate 1+ Druid load jobs - https://phabricator.wikimedia.org/T307508 (10EChetty) [11:41:05] 10Data-Engineering-Planning, 10Data Pipelines: Back-fill Wikidata reliability Graphite metrics - https://phabricator.wikimedia.org/T321838 (10EChetty) [11:41:10] 10Data-Engineering-Planning, 10Data Pipelines: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10EChetty) [11:41:15] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: Review why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily - https://phabricator.wikimedia.org/T316896 (10EChetty) [11:41:19] 10Data-Engineering-Planning, 10Data Pipelines: Implement periodical cleaning of Airflow databases - https://phabricator.wikimedia.org/T322036 (10EChetty) [11:41:23] 10Data-Engineering-Planning, 10Data Pipelines: NEW FEATURE REQUEST: sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties - https://phabricator.wikimedia.org/T323456 (10EChetty) [11:41:48] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) [11:42:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: Add TikTok's in-app browser to ua-parser library - https://phabricator.wikimedia.org/T325611 (10EChetty) [11:45:47] 10Data-Engineering-Planning, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10EChetty) [11:46:01] 10Data-Engineering-Planning, 10Data Pipelines: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10EChetty) [11:47:12] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10EChetty) [11:47:26] 10Data-Engineering-Planning, 10Data Pipelines: NEW FEATURE REQUEST: Dataset with active and non-active Wikis - https://phabricator.wikimedia.org/T323662 (10EChetty) p:05Triage→03Medium [11:47:46] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10EChetty) [11:48:58] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10Zabe) In 3 days or so the `cuc_comment_id` will be fully populated (it already is everywhere except wikidatawiki), thus you can also migrate to read from that ins... [12:12:02] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.3 - https://phabricator.wikimedia.org/T323458 (10BTullis) [12:49:48] PROBLEM - Host aqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:49] PROBLEM - Host aqs2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:26] PROBLEM - Host aqs2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:26] PROBLEM - Host aqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:39] --^ There is an issue at the moment affecting codfw - It's being discussed in #mediawiki_security but I don't believe that we need to do anything at the moment. [13:00:42] (03PS1) 10Simone Cuomo: Add new action to be able to track sessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) [13:00:54] ack btullis - thanks [13:01:12] btullis: Would you mind checking if this will affect webrequest traffic data please? [13:04:55] 10Data-Engineering-Planning, 10Data Pipelines: NEW FEATURE REQUEST: Dataset with active and non-active Wikis - https://phabricator.wikimedia.org/T323662 (10EChetty) @kzimmerman Do we have an existing definition of active we want to use here? dan has: from editors where edits > 4 and from active_... [13:07:19] (03PS2) 10Simone Cuomo: Update searchPreview schema to be inline with required changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) [13:07:47] (03CR) 10CI reject: [V: 04-1] Update searchPreview schema to be inline with required changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) (owner: 10Simone Cuomo) [13:20:31] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected [13:20:31] 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:20:43] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:20:49] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returned [13:20:49] expected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:21:11] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fil [13:21:11] sts returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:33] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:28:12] joal: Yes, I will check. [13:32:01] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:34:19] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CR [13:34:19] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:40:25] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:41:45] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:45:15] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: [13:45:15] t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:54:34] (03CR) 10Matthias Mullie: Update searchPreview schema to be inline with required changes (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) (owner: 10Simone Cuomo) [13:54:35] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:57:59] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:34] (03PS3) 10Simone Cuomo: Update searchPreview schema to be inline with required changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) [14:02:09] (03CR) 10Simone Cuomo: "Yeah I just realised that while testing the UI! All fixed now" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/880953 (https://phabricator.wikimedia.org/T326663) (owner: 10Simone Cuomo) [14:04:25] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CR [14:04:25] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:01] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:37] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:57] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:18:41] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fil [14:18:41] sts returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:21:01] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [14:21:01] ggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:22:37] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:23:35] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:28:27] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file [14:28:27] s returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:28:39] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:29:07] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:32:21] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:34:47] (03CR) 10Snwachukwu: Refactor and Expand External referer classification (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [14:35:14] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by pa [14:35:14] s returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:35:15] (VarnishkafkaNoMessages) firing: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:36:43] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:37:12] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpecte [14:37:12] s 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:38:05] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:41:39] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [14:41:39] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CR [14:41:39] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:42:07] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:46:57] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fil [14:46:57] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:54:59] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:55:51] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:59:49] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected [14:59:49] 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:01:56] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/873013 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [15:05:45] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:06:13] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:10:37] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top [15:10:37] ies by page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:11:03] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get a [15:11:03] e mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:22:27] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:27:39] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:31:29] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:33:15] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:34:57] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:50:17] (03CR) 10Ottomata: image-suggestions-feedback: Bump to version 2.0.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [16:00:45] 10Data-Engineering, 10Equity-Landscape: Affiliates output rank metrics - https://phabricator.wikimedia.org/T306619 (10KCVelaga_WMF) a:05KCVelaga_WMF→03JAnstee_WMF [16:02:15] 10Data-Engineering, 10Equity-Landscape: Overall Engagement output rank metric - https://phabricator.wikimedia.org/T306622 (10KCVelaga_WMF) @JAnstee_WMF The QA of overall engagement metric is ready for your review: https://docs.google.com/spreadsheets/d/1GnKHC9yT5tN_xmEltCGdHEONI5GqjTiWaVI9zXNp4rQ/edit#gid=155... [16:02:20] 10Data-Engineering, 10Equity-Landscape: Overall Engagement output rank metric - https://phabricator.wikimedia.org/T306622 (10KCVelaga_WMF) a:05KCVelaga_WMF→03JAnstee_WMF [16:07:11] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [16:09:22] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [16:20:43] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:25:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2036 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:28:21] The incident affecting the network at codfw is largely over, in that the network connectivity appears stable again. codfw is about to be repooled, I believe. [16:29:53] joal: There aren't expected to be any issues with webrequest or anything else related to the event platform. Neither kafka nor hadoop was affected. aqs/cassandra is almost back to normal. [16:30:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2036 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:40:49] ACKNOWLEDGEMENT - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Another BBU failure - I will add it to: T326127 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:45:33] 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1080, an-worker1084, and an-worker1086 - https://phabricator.wikimedia.org/T325984 (10BTullis) [16:47:50] (☞゚ヮ゚)☞ [16:48:13] (03PS13) 10Snwachukwu: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [16:50:33] !log shutdown an-worker1086 for RAID BBU replacement [16:50:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:52:59] (03CR) 10CI reject: [V: 04-1] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [17:00:45] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) [17:01:20] 10Data-Engineering-Planning, 10Data Pipelines: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10EChetty) [17:04:59] thanks btullis for the heads up on codfw issue [17:09:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2030 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:14:03] (03CR) 10Joal: "All my comments have been tackled - the jenkins tests don't pass though :(" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [17:14:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2030 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:24:46] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) @BTullis any update on this? [17:25:22] (03CR) 10BPirkle: [C: 03+2] Update pingback MediaWiki versions to include new values [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/879595 (https://phabricator.wikimedia.org/T326825) (owner: 10Cicalese) [17:27:17] (03CR) 10Snwachukwu: Refactor and Expand External referer classification (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [17:27:45] 10Data-Engineering-Planning, 10Epic, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) Having examined the puppet manifests that we have for ceph, I believe that we can r... [17:28:20] (03CR) 10BPirkle: [V: 03+2 C: 03+2] Update pingback MediaWiki versions to include new values [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/879595 (https://phabricator.wikimedia.org/T326825) (owner: 10Cicalese) [17:30:50] 10Data-Engineering-Planning, 10Epic, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) [18:10:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:26:33] are there any examples of deployments that ship custom conda environments via spark or skein onto hadoop workers? [18:27:49] starting to review our update to spark3, and probably the main thing is moving from virtualenv's to conda [19:00:37] ebernhardson: I don't know of any :( [19:06:41] no worries, i'm sure i'll figure something out :) might save a day or two if there were examples but it looks doable from docs [19:09:17] Thanks a lot for you debunking this ebernhardson - I guess we'll take examples :) [19:14:56] the sizes are a bit scary though :S first creation of a conda env is 300M without even installing anything. But hopefully can figure out how to get it to reference the conda-analytics that's already on the nodes [19:17:40] ebernhardson: this is a known issue unfortunately :( your feedback will be very welcome on that front [19:18:40] joal: we do, no? our airflow deployment does it [19:19:08] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide#Artifacts [19:19:08] ottomata: I can't recall we do - possbily I didn't know we do! [19:19:43] that plus our SparkSubmitOperator with launcher=skein [19:19:56] ebernhardson: ...want to switch to airflow 2 and our airflow-dags repo? :) [19:20:04] we can make you a new airflow instance [19:20:28] i think maybe the image suggestions folks do this with spark3 now? cc xcollazo ? [19:21:19] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide#SparkSubmitOperator [19:21:20] ottomata: hmm, maybe. I'd have to review how much work that would be. it's about 20 dags [19:22:03] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/operators/spark.py#L63-69 [19:22:12] ebernhardson: if you are making a new spark 3 / conda job anyway, you could just start with that one? [19:22:38] and migrate the others as a separate task [19:23:12] well, we have one spark3 job already and all it involved was changing the spark-submit executable in the airflow connection. But that one is a plain pyspark with no additional deps [19:23:41] switched it last week [19:24:01] i'm sure you could do it all in yours, and/or we could move our SparkSubmitOperator to a more easily lib (we kept it in airflow-dags to make it easier to develop together) [19:24:31] i also have a custom SparkSubmitOperator, would have too see how they vary :) [19:24:34] if you switch though, you get artifact (conda env) deployment and Spark skein stuff built in [19:24:56] Operator: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/operators/spark.py [19:25:00] but more interesting is hook [19:25:14] we ahve a simple skein and a spark skein one [19:25:14] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/wmf_airflow_common/hooks [19:25:28] the SparkSkein one [19:25:29] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/spark.py#L185 [19:25:35] handles doing the right thign with the artifacts [19:25:53] whats the benefit of skein over spark in cluster mode? [19:25:55] e.g. [19:25:55] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/spark.py#L289 [19:26:17] in docs of hook: [19:26:20] spark in cluster mode does keep work off of the airflow executor, but [19:26:20] still requires that e.g. python scripts or other resources needed to [19:26:20] launch the spark job are deployed locally to the executor. By using [19:26:20] skein, we can pull down files/archives. [19:26:51] for java there is no real diff, as you can do e.g. hdfs://path/to/app.jar [19:27:08] ahh, i suppose so far i've always shipped those with --files, which decompresses into the target. but indeed i didn't figure out how to have custom setup other than whats in the zip being unzip'd [19:27:10] and the yarn app master in cluster mode will handle launching from that jar [19:27:27] for python, you can't launch unless your python script where you are launching from. [19:28:42] skein yarn client also works a little nicer with airflow UI, you get spark master logs in airflow UI [19:29:58] yea, we ship the python script to run with --files a well. I can check this all out, it seems to replicate what we already have in spark2 [19:32:05] oh ebernhardson we also have a nice for_virtualenv factory for SparkSubmitOperator [19:32:06] example: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image_suggestions_dag.py#L265-272 [19:32:56] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/operators/spark.py#L211-290 [19:48:07] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Rats, neither the [[ https://gerrit.wikimedia.org/r/879618 | NetworkPolicy... [19:57:44] 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov) > Would that affect my ability to login to Jupyter because I haven't been able to? Yep. The original ticket has been re-opened and the username will need to be updated before you're able to log in. [20:36:55] (03CR) 10Kosta Harlan: image-suggestions-feedback: Bump to version 2.0.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [20:44:10] (03CR) 10Ottomata: image-suggestions-feedback: Bump to version 2.0.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/809150 (https://phabricator.wikimedia.org/T302925) (owner: 10Kosta Harlan) [20:54:53] !log dropping old partitions from image_suggestions Hive tables as per https://phabricator.wikimedia.org/T325837 [20:54:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:55:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink application and flink-kubernetes-operator production docker images - https://phabricator.wikimedia.org/T316519 (10Ottomata) Hm, am confused by a production-images vs blubber user thing. In operation/production-... [22:26:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:31:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:43:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:48:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:53:42] (VarnishkafkaNoMessages) firing: (5) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:53:42] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:54:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:57:36] 10Data-Engineering, 10Data-Engineering-Kanban: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10TimTheK) I just tried 2.3.0 and I got the same error: Failed to execute goal on project atlas-testtools: Could not resolve dependencies... [22:58:42] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:58:42] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:59:12] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages