[01:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:49:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:02:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:08] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:32:40] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:35:46] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:48:39] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[03:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[03:56:46] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:51:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[07:01:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[07:20:47] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit) Checking the usage of ks-deva: * https://codesearch.wmcloud.org/search/?q=.&files=ks-deva&excludeFiles=&repos= * http...
[07:48:39] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[07:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[08:21:32] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10MaryMunyoki)
[08:21:57] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10MaryMunyoki)
[08:54:46] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit)
[09:26:42] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Align throttling configuration naming for WDQC / WCQS - https://phabricator.wikimedia.org/T344413 (10Gehel)
[09:28:49] <wikibugs>	 10Data-Platform-SRE, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) That’s a false positive, we don’t have active traffic in codfw. There was WIP to fix it before I went on PTO but I guess i...
[09:29:00] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1106.eqiad.wmnet with OS bullseye
[09:29:05] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye
[09:29:38] <btullis>	 !log deploying airflow-analytics
[09:29:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:35:28] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) These are the values added for our initial idp test and a brief explanation on each.  `     - name: AUTH_OIDC_ENABLED       val...
[09:59:41] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) I have a feeling that for `AUTH_OIDC_USER_NAME_CLAIM` we may want to use `cn`.
[10:09:15] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye executed with errors: - an-worker1107 (**FAIL**)   - Downtimed on Icinga...
[10:10:05] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1106.eqiad.wmnet with OS bullseye completed: - an-worker1106 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[10:10:41] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye
[10:19:17] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) Just adding a bit more context  >>! In T305874#9098561, @BTullis wrote: > I have a feeling that for `AUTH_OIDC_USER_NAME_CLAIM` we ma...
[10:32:30] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) That's really helpful. Thanks @jbond  > preferred_username maps to uid Great, so we can keep with the default of `preferred_usernam...
[10:49:44] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye completed: - an-worker1107 (**PASS**)   - Removed from Puppet and Puppet...
[10:53:53] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10BTullis) @nshahquinn-wmf - Could you try to log in again now please? I have imported your user account, added you to the necessary groups and transferred ownership of all of the assets...
[11:12:06] <wikibugs>	 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis)
[11:23:47] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) >>! In T305874#9098663, @BTullis wrote: > Great, so we can keep with the default of `preferred_username` and this will map to the `ui...
[11:48:40] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[11:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[12:46:09] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) > Initiate a connection to staging datahub front-end via ssh -N -L 30443:k8s-ingress-staging.svc.eqiad.wmnet:30443 deploy1002.eqiad.w...
[13:06:56] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add blk.wiktionary to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/945023 (https://phabricator.wikimedia.org/T343542) (owner: 10Gerrit maintenance bot)
[13:07:22] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add su.wikisource to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/945025 (https://phabricator.wikimedia.org/T343548) (owner: 10Gerrit maintenance bot)
[13:21:57] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Update aqs scap targets with new hosts and tidy up [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/947385 (https://phabricator.wikimedia.org/T342213) (owner: 10Btullis)
[13:23:33] <milimetric>	 btullis: thanks so much for that knowledge gaps endpoint fix, hope the deploy worked for you
[13:33:10] <wikibugs>	 (03PS4) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse)
[13:35:29] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse)
[13:38:45] <btullis>	 milimetric: Thanks. Yes I think it was fine in the end. It took a few goes. There was a scary moment when I tried to deploy it without a scap environment: https://wikitech.wikimedia.org/w/index.php?title=Data_Engineering/Systems/AQS&diff=prev&oldid=2100312
[13:40:35] <btullis>	 It seemed to pick up a default list of targets somewhere, including lots of mw* and mwmaint* servers. no harm done though, just some slightly scary error messages.
[13:42:20] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10BTullis) Tentatively moving to Done whilst we await confirmation of whether or not it has worked.
[13:43:48] <milimetric>	 You are a brave man
[13:44:23] <btullis>	 <3 I think sometimes there is another word for it :-)
[13:58:31] <wikibugs>	 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) @Kizule - I believe that this is fixed now. ` btullis@tools-sgebastion-10:~$ sql ptwikisource Reading table information for completion of table and column...
[14:02:18] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) >>! In T305874#9099070, @jbond wrote: > At this point you are already hijacking all 443 traffic so it might be easier to use `CAP_N...
[14:04:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:15:21] <wikibugs>	 10Data-Engineering, 10Web-Team-Backlog: Deal with minified scripts in JS error logging - https://phabricator.wikimedia.org/T520 (10MSantos) @Tgr is right, I'll add the suggested teams for triage and remove CTT. Let me know if you have any questions or concerns.
[14:46:18] <wikibugs>	 10Data-Platform-SRE: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) 05Resolved→03Open Reopening this ticket, as it's still happening.
[14:47:30] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bo...
[14:49:31] <btullis>	 !log failing over hive to an-coord1002 to permit restart of hive on an-coord1001
[14:49:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:53:03] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox, 10Community-consensus-needed: Decide whether enable per-editor edits stats (community decision) - https://phabricator.wikimedia.org/T203826 (10Multichill) 05Open→03Declined I'm being bold here: I don't think we have community consensus for this and I see no effort...
[14:59:03] <btullis>	 !log restarting hive-server2 and hive-metastore services on an-coord1001 after failover.
[14:59:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:04:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[15:09:39] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) From the  [] Create WMDE airflow admin group review, the `aiflow-wmde-admins` group requires a system user in order to perform the "admin tasks" for the airflow instanc...
[15:12:28] <btullis>	 !log failing hive back to an-coord1001 following maintenance
[15:12:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:46:23] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products, 10Data Pipelines (Sprint 12): Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10VirginiaPoundstone) 05Open→03Stalled @MarkAHershberger is it fair to assume, give this unfortunate time lag of...
[15:48:40] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[15:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[15:49:32] <wikibugs>	 10Data-Platform-SRE, 10Data-Persistence, 10SRE-swift-storage, 10Discovery-Search (Current work), 10Patch-For-Review: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel)
[16:02:45] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookwo...
[16:07:38] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Install jupyterhub separately from conda-analytics - https://phabricator.wikimedia.org/T321512 (10BTullis)
[16:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:54:28] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bo...
[16:54:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:48:22] <wikibugs>	 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) Oh, this is a little less elegant than I first thought. The reason being that we use `AUTH_REMOTE_USER` and have it enabled with CAS via an Apache proxy. Here's a link to [[htt...
[18:02:44] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:08:34] <wikibugs>	 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I think I'm inclined to go with option 1 for now. It's the quicker and easier option.  Ideally, I'd like to get us to [[https://preset.io/blog/superset-2-1-release-notes/|Super...
[18:12:41] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookwo...
[18:16:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:43:48] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10nshahquinn-wmf) 05Open→03Resolved Yes, I can edit now! Thanks @BTullis 😁
[19:28:03] <wikibugs>	 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10RKemper) Some observations from last two patches, tested on `wdqs2007` before reverting due to issues:  - Discrepancy between location of `wdqs-blazegraph` unit files:  ` wdqs2007: (bullseye, wdqs-public, ru...
[19:48:40] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[19:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[20:30:55] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking)
[20:31:24] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking)
[20:35:39] <wikibugs>	 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) ^^ Above changes were one-offs I made to troubleshoot. Sorry for the confusion!
[20:40:17] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye
[20:40:25] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye
[20:44:42] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors: - wdqs1010 (**FAIL**)   - Downtimed on Icin...
[20:49:27] <wikibugs>	 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-...
[20:54:27] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking)
[21:27:33] <wikibugs>	 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) Airflow instance has been updated. I manually changed the permissions of the exis...
[21:34:40] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye completed: - wdqs1011 (**WARN**)   - Downtimed on Icinga/Alertman...
[23:48:40] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[23:48:40] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability