[00:04:41] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:57] (03PS1) 10Xcollazo: movement_metrics: Make wmfdata happy by not starting a Hive SET command with a newline char. [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/907550 (https://phabricator.wikimedia.org/T334302) [00:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:22] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:41] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:22] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:41] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:41] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:22] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:32] (03CR) 10Phedenskog: "Hi Larissa, I think you need to run npm run build-modified and rebuild so that all version files are re-generated." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [03:53:54] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) [03:56:21] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: PyHive ignores SET statements with a leading newline - https://phabricator.wikimedia.org/T334442 (10nshahquinn-wmf) [03:58:57] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: PyHive ignores SET statements with a leading newline - https://phabricator.wikimedia.org/T334442 (10nshahquinn-wmf) [04:10:24] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: PyHive ignores SET statements with a leading newline - https://phabricator.wikimedia.org/T334442 (10nshahquinn-wmf) This issue caused T334302. [07:34:18] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:19] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10elukey) [07:40:08] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10elukey) [07:42:39] 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Diffusion-Repository-Administrators, and 4 others: Archive analytics/wikistats - https://phabricator.wikimedia.org/T332004 (10hashar) @Nemo_bis can you rebase your pending patches to the `analytics/wikistats` repo and I will happily mer... [07:57:36] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10dcausse) >>! In T331401#8765288, @Isaac wrote: >> Q: Will it be us... [08:14:56] !log About to deploy analytics/refinery (To migrate webrequest load from Oozie to Airflow) [08:14:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:47] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) @Jclark-ctr thanks! I tried to check the serial console but I still see the error msg about the preserved cache, and I can't really do much on the menu.. the mai... [08:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:15] Hello steve_munene , do you have some time for a fix about analyics/refinery deploy [09:14:30] Hello aqu yes I do [09:30:30] Cool. Batcave ? steve_munene> [09:31:10] yes, please reshape link [09:31:25] reshare [09:32:27] https://meet.google.com/rxb-bjxn-nip?authuser=0 [09:37:47] (03CR) 10Ayounsi: [C: 03+1] "The fields overall make sens to me. I don't know event-stream enough to review the full change though." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [10:07:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:25] joal With steve_munene we haven't find the root cause. [10:19:25] * but enlarging permission temporarily to .git/fat [10:19:25] * and marking the local repo as safe (sudo -u hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/bed78f6e5c52ce9752af10064cd26dde01462db0 ) [10:19:25] did the job [10:35:27] (03PS1) 10Gerrit maintenance bot: Add guw.wikinews to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/907826 (https://phabricator.wikimedia.org/T334459) [11:20:46] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediwiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10JMeybohm) Actually (depending on the request volume) you may want to use `mw-api-int-async... [11:29:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Clement_Goubert) [12:08:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Clement_Goubert) What is the current request volume for these calls? [12:16:20] (03PS1) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907871 (https://phabricator.wikimedia.org/T332012) [12:17:07] (03Abandoned) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [13:16:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10JArguello-WMF) a:03tchin [13:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:19] steve_munene: o/ [13:42:27] is analytics1069 expected to be down? [13:42:40] I noticed by chance in icinga that it has been down for days [13:43:07] (and it should be a journalnode host) [13:43:36] (03PS2) 10Bearloga: movement_metrics: Fix whitespace PyHive error [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/907550 (https://phabricator.wikimedia.org/T334302) (owner: 10Xcollazo) [13:46:06] !log powercycle analytics1069, down for some days now, host stuck from the mgmt/serial console [13:46:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:10] powercycled [13:47:43] RECOVERY - Host analytics1069 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [13:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:03] (03CR) 10Phedenskog: [C: 03+1] Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907871 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [13:53:09] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet last ran 14 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:53:14] ok it is recovering [13:57:58] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Tchanders) [14:05:37] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: PyHive ignores SET statements with a leading newline - https://phabricator.wikimedia.org/T334442 (10xcollazo) > Alternatively, we could simply deprecate the Hive module altogether. Hive's SQL-on-MapReduce functionality is officially deprecated, and i... [14:10:08] (03CR) 10CDanis: "Looks good!! Just one naming nit" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [14:16:36] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey I was able to cleared configurations [14:21:12] (03CR) 10Bearloga: [V: 03+2 C: 03+2] "Thanks! Works great!" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/907550 (https://phabricator.wikimedia.org/T334302) (owner: 10Xcollazo) [14:36:36] (03CR) 10Snwachukwu: Add referer_name field to pageview_hourly table in hive. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906073 (https://phabricator.wikimedia.org/T334120) (owner: 10Snwachukwu) [14:37:12] (03CR) 10Snwachukwu: [V: 03+2 C: 03+2] Add referer_name field to pageview_hourly table in hive. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906073 (https://phabricator.wikimedia.org/T334120) (owner: 10Snwachukwu) [14:45:50] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10mpopov) [14:48:33] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [14:56:34] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10mpopov) Created the repo https://gitlab.wikimedia.org/repos/product-analytics/data-pipelines Uploaded the update & create queries to a non-main branch to b... [14:57:43] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10mpopov) [15:04:45] 10Data-Engineering, 10Product-Analytics (Kanban): Job Failed: product-analytics-movement-metrics - https://phabricator.wikimedia.org/T334302 (10mpopov) a:03mpopov Huge thanks to @xcollazo for identifying what's causing the error in the job and writing up a hotfix. And thanks to @nshahquinn-wmf for identifyin... [15:04:57] 10Data-Engineering, 10Product-Analytics (Kanban): Job Failed: product-analytics-movement-metrics - https://phabricator.wikimedia.org/T334302 (10mpopov) p:05Triage→03High [15:05:48] 10Data-Engineering, 10Product-Analytics (Kanban): Job Failed: product-analytics-movement-metrics - https://phabricator.wikimedia.org/T334302 (10mpopov) @Mayakp.wiki Can you please verify & sign off? [15:27:31] !log Deployed refinery using scap, then deployed onto hdfs. [15:27:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:54] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) [15:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:32] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) > I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi fo... [16:00:05] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) p:05Triage→03Medium [16:00:17] (03PS2) 10Jameel Kaisar: Created experimental/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [16:03:26] 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, 10Patch-For-Review: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10Ottomata) > opens to the door to other questions like whether the intent to also include templatelinks, categorylinks, etc.... [16:04:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Define Service Level Objective (SLO) for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T333833 (10Ottomata) [16:06:10] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster completed: - an-worker1132 (**PASS... [16:08:09] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Jelto) [16:10:05] Hi elukey thanks for analytics1069. [16:10:57] steve_munene: np :) [16:11:08] I reimaged an-worker1132, all good but I see one disk less [16:11:13] that is weird [16:12:51] Could this be related to the foreign drives? [16:14:15] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) Reimaged the node, but I still see 11 4TB disks and not 12. Mega cli shows 12 phisical disks but only 11 VDs, so probably we'll need to fix it. I downtimed the... [16:14:38] steve_munene: I left a comment, is it ok if I leave the task to you? There are some info in https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration [16:14:55] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ottomata) > we should be forward-looking and make it an actor_type column instead +1 However, in a relevant conversation in T308017#8309324, @tstarling said: > I don't... [16:15:03] it may be that one disk needs to be "visible" as Virtual Disk so the OS can use it [16:15:16] the host is downtimed with puppet disabled and hdfs/yarn down [16:17:54] (stepping afk, will read tomorrow! Have a nice rest of the day :) [16:27:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Q4 eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10Ottomata) > I actually don't see how we'd include the jars in the classpath without injecting them at runtime. Does something in pyflink auto... [16:28:02] !log deployed airflow analytics to fix network flows internal dags in deployment [16:28:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:29:52] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) Okay, thanks all. No `prior_state` for now then. We ca... [16:34:31] (03CR) 10Ottomata: Add event schema for ML classification change on current page state (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/905965 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [16:37:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Ottomata) From MW event enrichment? Very low. Really only on job startup. Other jobs i... [16:39:59] PROBLEM - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:40:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for stream config in - https://phabricator.wikimedia.org/T333575 (10Ottomata) Hm, actually tho, mediawiki-page-content-change-enrichment does more than just... [16:54:47] joal: hello :] I talked with the SRE team, and they said that network_flows_internal does not need sanitization and that we can remove the corresponding job. I created an MR, can you please review? :] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/359 Thank you!! [16:56:02] 10Data-Engineering, 10Product-Analytics: Delete the leading question mark from uri_query in the webrequest table - https://phabricator.wikimedia.org/T334495 (10nettrom_WMF) [17:06:06] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: PyHive ignores SET statements with a leading newline - https://phabricator.wikimedia.org/T334442 (10mpopov) p:05Triage→03Low [17:07:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:59] 10Data-Engineering, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10mpopov) p:05Medium→03Low [17:20:31] (03CR) 10Ottomata: Created experimental/geoip/network_latency 1.0.0 schema (0310 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [17:21:00] (03CR) 10Ottomata: "I added a bunch of comments, but then got confused at the end. Read the comment at the end first before responding to all the others :)" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [17:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:09] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) [18:02:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:18] 10Data-Engineering, 10Event-Platform Value Stream, 10observability: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10Ottomata) [18:38:35] 10Data-Engineering, 10Event-Platform Value Stream, 10observability: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10Ottomata) I have to run for an appointment, but I can try and look at this more tomorrow. [18:53:06] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Airflow ArchiveOperator should have a number of retries of 0 - https://phabricator.wikimedia.org/T332216 (10xcollazo) a:03xcollazo [19:05:34] 10Data-Engineering, 10Product-Analytics (Kanban): Job Failed: product-analytics-movement-metrics - https://phabricator.wikimedia.org/T334302 (10Mayakp.wiki) 05Open→03Resolved I QAed the movement metric tables on wmf-product and data for March 2023 has been updated. Thanks @xcollazo and @mpopov for all your... [19:17:12] !log deployed airflow fix for pageview_hourly dag memory error [19:17:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:17:46] !log Unpaused pageview_hourly airflow dag. [19:17:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:20:28] 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, 10Patch-For-Review: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10Isaac) > Can you say more about this? IIUC, these are different kinds of links, yes? The page and image links are similar as... [19:32:07] (03PS1) 10Bartosz Dziewoński: Add user_is_temp field to Editing team schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) [19:37:07] (03PS2) 10Bartosz Dziewoński: Update Editing team schemas for IP masking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) [19:53:49] (03CR) 10Bartosz Dziewoński: "I haven't done this before, please double-check." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) (owner: 10Bartosz Dziewoński) [20:07:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:17] !log deployed airflow analytics to remove network flows sanitization dag [20:22:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:07:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:28] 10Data-Engineering, 10Discovery-Search: Determine which team should own airflow1005/update contact info - https://phabricator.wikimedia.org/T334522 (10bking) [21:17:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:22] 10Data-Engineering, 10Discovery-Search, 10Patch-For-Review: Determine which team should own airflow1005/update contact info - https://phabricator.wikimedia.org/T334522 (10bking) OK, I've confirmed that my team is fine with this change. Sending a Puppet patch now. [22:09:07] (03PS3) 10Jameel Kaisar: Created experimental/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [22:09:57] (03CR) 10Jameel Kaisar: Created experimental/geoip/network_latency 1.0.0 schema (0311 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [22:11:07] (03PS4) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [22:53:11] PROBLEM - eventlogging Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:53:11] PROBLEM - Webrequests Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:53:11] PROBLEM - statsv Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - eventlogging Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - Webrequests Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - statsv Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:01:14] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) [23:03:19] 10Data-Engineering, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) removed #Research tag, added T334511 as a subtask for us to take care of one item we should help you all with. I'm coor... [23:04:35] PROBLEM - eventlogging Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:04:37] PROBLEM - Webrequests Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:04:37] PROBLEM - statsv Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:25:41] RECOVERY - eventlogging Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:25:41] RECOVERY - statsv Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:25:41] RECOVERY - Webrequests Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:37:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed