[00:00:18] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [00:05:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [03:30:03] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:17] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:09] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:27:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:39] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:46:50] 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10Aklapper) I believe that random third-party hosting locations like GitHub should be off the table. In my understanding, everything should be on Wikimedia GitLab in the long run, instead of Wi... [06:49:19] 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10Aklapper) (In any case, please disable "Issues" on https://github.com/toolforge/quarry to avoid fragmentation and duplication - thanks.) [09:52:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5004 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5004%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:57:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5004 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5004%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:28:13] 10Analytics, 10Analytics-Wikistats, 10Data Engineering Planning, 10Data Pipelines: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Iflaq) There is a related discussion on the Kashmiri Wikipedia about this task at [[https://ks.wikipedia.org/wiki/%D9%88%D9%90%DA%A9%DB%8C%D9%96%... [10:34:36] 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10ayounsi) [10:35:45] 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10ayounsi) [11:21:51] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10EChetty) [12:06:36] milimetric: My apologizes for not catching the error in the webrequest CR [12:39:41] 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10rook) Issues disabled. > everything should be on Wikimedia GitLab in the long run I generally agree. As I mentioned in the email once gitlab has necessary features enabled quarry can move t... [13:09:27] It seems that we have experienced the "airflow failure without email" error again :( [13:29:58] there are quite some red tasks [13:31:19] The reason for which I noticed is because we received an alert for a dependent job still running in oozie [13:31:30] But this would have gone unnoticed otherwise [13:32:38] I have found a way for us to check failed dagRuns through an API call - We could therefore devise a small systemd timer or an icinga alert to let us know [13:33:13] the only issue is that the API parameter I use seems not yet present in the version we have - so I'll need to wait for the new version [13:33:46] 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10Ottomata) +1 > if external resources (eg. git) needs to be fetched from the Internet JVM Dependencies are automatically fetched from the internet (maven central usually) and cached in our archiva. [13:36:11] !log rerun failed airflow tasks [13:36:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:45:54] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10ayounsi) Cf. {T317182} for when it's time to tackle archiva (or before) [13:46:44] joal: the systemd timer to check airflow is ironic but sounds like a solid idea [13:46:51] :) [13:46:54] (no worries about the webrequest thing, btw) [13:47:24] is the airflow API you need in the new version Sandra is upgrading to? [13:47:36] I'm gonna check this [14:15:10] milimetric: I'd be even happier if it was not an a systemd timer but an icinga (or alert-manager) alert [14:15:15] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830260 (owner: 10Milimetric) [14:21:50] milimetric: in http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/release_notes.html , search for 20485 - This was added in version 2.3.0 [14:22:10] We plan on upgrading to version 2.3.2 - so ti would work [14:22:34] I'm gonna add all this on the ticket about failures, and will answer the SLA email [14:29:26] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10JAllemandou) The problem has happened again today. I'm in favor of creating a dedicated for airflow failed tasks: if any da... [14:34:37] sweet :) [16:12:53] (03PS1) 10Eigyan: Update mobilewebuiactionstracking with userGroups reference. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/830664 (https://phabricator.wikimedia.org/T316230) [16:14:19] 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10BTullis) I'm also fine with this move and the use of the squid proxies. [16:23:23] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) @gmodena, @Milimetric, @dcausse, I need some help with modeling current revision visibility c... [16:27:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Hm, actually doing ^^ (skipping extraneous revision details in prior_state) makes dealing wit... [16:37:40] (03Abandoned) 10Vivian Rook: [WIP] Revert "Revert "Database selection"" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/286094 (https://phabricator.wikimedia.org/T76466) (owner: 10Alex Monk) [16:44:30] 10Quarry, 10Patch-For-Review: Make available more options for number of shown rows of resultset (Quarry) - https://phabricator.wikimedia.org/T126540 (10rook) Moving proposed change https://github.com/toolforge/quarry/pull/4 [16:55:58] (03Abandoned) 10Vivian Rook: Use flask.jsonify instead of json.dumps [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/455237 (owner: 10Framawiki) [17:02:50] 10Quarry: Change toggle highlighting button from btn-sm to btn-xs - https://phabricator.wikimedia.org/T317222 (10rook) [17:03:27] (03PS2) 10Vivian Rook: view.html: Change toggle highlighting button from btn-sm to btn-xs [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/561339 (https://phabricator.wikimedia.org/T317222) (owner: 10Zhuyifei1999) [17:03:31] ok folks, done for today [17:22:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) One more question for you all. The existent EventBus code uses an anonymous actor role whe... [17:40:25] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10apaskulin) Looks great. Thanks, @KBach! [18:25:23] btullis: anything I should know before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/830666 ? [18:26:07] I see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo#Test_config_changes [19:08:34] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/830664 (https://phabricator.wikimedia.org/T316230) (owner: 10Eigyan) [19:10:32] (all done) [22:06:06] cdanis: thanks for checking in and apologies for missing the boat. All good though. 🙂 [22:27:47] (03CR) 10EllenR: [C: 03+1] "nice Essex!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/830664 (https://phabricator.wikimedia.org/T316230) (owner: 10Eigyan) [22:28:22] 10Data-Engineering, 10DBA, 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10BrandonXLF) [22:29:30] 10Data-Engineering, 10DBA, 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10BrandonXLF) [22:30:32] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10JJMC89) [22:33:19] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10JJMC89) This is due to the work for {T299417}. > For dewiki, the table only has 3 columns when it should have 5 (it's missing tl_t... [22:36:28] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10BrandonXLF) >>! In T317258#8219365, @JJMC89 wrote: > This is due to the work for {T299417}. > >> For dewiki, the table only has 3... [22:59:48] (03CR) 10Scardenasmolinar: [C: 03+2] "LGTM!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/830664 (https://phabricator.wikimedia.org/T316230) (owner: 10Eigyan) [23:00:35] (03Merged) 10jenkins-bot: Update mobilewebuiactionstracking with userGroups reference. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/830664 (https://phabricator.wikimedia.org/T316230) (owner: 10Eigyan) [23:13:27] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10JArguello-WMF)