[01:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:08] PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:32:40] RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:35:46] PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [03:56:46] RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:51:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:01:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:20:47] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit) Checking the usage of ks-deva: * https://codesearch.wmcloud.org/search/?q=.&files=ks-deva&excludeFiles=&repos= * http... [07:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:21:32] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10MaryMunyoki) [08:21:57] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10MaryMunyoki) [08:54:46] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit) [09:26:42] 10Data-Platform-SRE, 10Wikidata-Query-Service: Align throttling configuration naming for WDQC / WCQS - https://phabricator.wikimedia.org/T344413 (10Gehel) [09:28:49] 10Data-Platform-SRE, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) That’s a false positive, we don’t have active traffic in codfw. There was WIP to fix it before I went on PTO but I guess i... [09:29:00] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1106.eqiad.wmnet with OS bullseye [09:29:05] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye [09:29:38] !log deploying airflow-analytics [09:29:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:35:28] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) These are the values added for our initial idp test and a brief explanation on each. ` - name: AUTH_OIDC_ENABLED val... [09:59:41] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) I have a feeling that for `AUTH_OIDC_USER_NAME_CLAIM` we may want to use `cn`. [10:09:15] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye executed with errors: - an-worker1107 (**FAIL**) - Downtimed on Icinga... [10:10:05] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1106.eqiad.wmnet with OS bullseye completed: - an-worker1106 (**PASS**) - Downtimed on Icinga/Alertmanag... [10:10:41] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye [10:19:17] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) Just adding a bit more context >>! In T305874#9098561, @BTullis wrote: > I have a feeling that for `AUTH_OIDC_USER_NAME_CLAIM` we ma... [10:32:30] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) That's really helpful. Thanks @jbond > preferred_username maps to uid Great, so we can keep with the default of `preferred_usernam... [10:49:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1107.eqiad.wmnet with OS bullseye completed: - an-worker1107 (**PASS**) - Removed from Puppet and Puppet... [10:53:53] 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10BTullis) @nshahquinn-wmf - Could you try to log in again now please? I have imported your user account, added you to the necessary groups and transferred ownership of all of the assets... [11:12:06] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) [11:23:47] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) >>! In T305874#9098663, @BTullis wrote: > Great, so we can keep with the default of `preferred_username` and this will map to the `ui... [11:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:46:09] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) > Initiate a connection to staging datahub front-end via ssh -N -L 30443:k8s-ingress-staging.svc.eqiad.wmnet:30443 deploy1002.eqiad.w... [13:06:56] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add blk.wiktionary to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/945023 (https://phabricator.wikimedia.org/T343542) (owner: 10Gerrit maintenance bot) [13:07:22] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add su.wikisource to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/945025 (https://phabricator.wikimedia.org/T343548) (owner: 10Gerrit maintenance bot) [13:21:57] (03CR) 10Milimetric: [C: 03+2] Update aqs scap targets with new hosts and tidy up [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/947385 (https://phabricator.wikimedia.org/T342213) (owner: 10Btullis) [13:23:33] btullis: thanks so much for that knowledge gaps endpoint fix, hope the deploy worked for you [13:33:10] (03PS4) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [13:35:29] (03CR) 10Peter Fischer: [C: 03+1] cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [13:38:45] milimetric: Thanks. Yes I think it was fine in the end. It took a few goes. There was a scary moment when I tried to deploy it without a scap environment: https://wikitech.wikimedia.org/w/index.php?title=Data_Engineering/Systems/AQS&diff=prev&oldid=2100312 [13:40:35] It seemed to pick up a default list of targets somewhere, including lots of mw* and mwmaint* servers. no harm done though, just some slightly scary error messages. [13:42:20] 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10BTullis) Tentatively moving to Done whilst we await confirmation of whether or not it has worked. [13:43:48] You are a brave man [13:44:23] <3 I think sometimes there is another word for it :-) [13:58:31] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) @Kizule - I believe that this is fixed now. ` btullis@tools-sgebastion-10:~$ sql ptwikisource Reading table information for completion of table and column... [14:02:18] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) >>! In T305874#9099070, @jbond wrote: > At this point you are already hijacking all 443 traffic so it might be easier to use `CAP_N... [14:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:15:21] 10Data-Engineering, 10Web-Team-Backlog: Deal with minified scripts in JS error logging - https://phabricator.wikimedia.org/T520 (10MSantos) @Tgr is right, I'll add the suggested teams for triage and remove CTT. Let me know if you have any questions or concerns. [14:46:18] 10Data-Platform-SRE: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) 05Resolved→03Open Reopening this ticket, as it's still happening. [14:47:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bo... [14:49:31] !log failing over hive to an-coord1002 to permit restart of hive on an-coord1001 [14:49:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:53:03] 10Analytics, 10Data-Engineering-Icebox, 10Community-consensus-needed: Decide whether enable per-editor edits stats (community decision) - https://phabricator.wikimedia.org/T203826 (10Multichill) 05Open→03Declined I'm being bold here: I don't think we have community consensus for this and I see no effort... [14:59:03] !log restarting hive-server2 and hive-metastore services on an-coord1001 after failover. [14:59:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:04:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:09:39] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) From the [] Create WMDE airflow admin group review, the `aiflow-wmde-admins` group requires a system user in order to perform the "admin tasks" for the airflow instanc... [15:12:28] !log failing hive back to an-coord1001 following maintenance [15:12:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:46:23] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products, 10Data Pipelines (Sprint 12): Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10VirginiaPoundstone) 05Open→03Stalled @MarkAHershberger is it fair to assume, give this unfortunate time lag of... [15:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:49:32] 10Data-Platform-SRE, 10Data-Persistence, 10SRE-swift-storage, 10Discovery-Search (Current work), 10Patch-For-Review: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel) [16:02:45] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookwo... [16:07:38] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Install jupyterhub separately from conda-analytics - https://phabricator.wikimedia.org/T321512 (10BTullis) [16:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:28] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bo... [16:54:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:22] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) Oh, this is a little less elegant than I first thought. The reason being that we use `AUTH_REMOTE_USER` and have it enabled with CAS via an Apache proxy. Here's a link to [[htt... [18:02:44] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:34] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I think I'm inclined to go with option 1 for now. It's the quicker and easier option. Ideally, I'd like to get us to [[https://preset.io/blog/superset-2-1-release-notes/|Super... [18:12:41] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookwo... [18:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:44] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:48] 10Data-Engineering, 10Data-Platform-SRE: nshahquinn-wmf cannot edit in DataHub - https://phabricator.wikimedia.org/T344212 (10nshahquinn-wmf) 05Open→03Resolved Yes, I can edit now! Thanks @BTullis 😁 [19:28:03] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10RKemper) Some observations from last two patches, tested on `wdqs2007` before reverting due to issues: - Discrepancy between location of `wdqs-blazegraph` unit files: ` wdqs2007: (bullseye, wdqs-public, ru... [19:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:30:55] 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking) [20:31:24] 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking) [20:35:39] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10bking) ^^ Above changes were one-offs I made to troubleshoot. Sorry for the confusion! [20:40:17] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye [20:40:25] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye [20:44:42] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors: - wdqs1010 (**FAIL**) - Downtimed on Icin... [20:49:27] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-... [20:54:27] 10Data-Platform-SRE, 10Discovery-Search: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking) [21:27:33] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) Airflow instance has been updated. I manually changed the permissions of the exis... [21:34:40] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye completed: - wdqs1011 (**WARN**) - Downtimed on Icinga/Alertman... [23:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability