[00:48:30] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Krinkle) >>! In T342267#9030197, @Milimetric wrote: > We can join to this and choose to do pretty much anything we... [00:49:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [02:58:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [04:49:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:52] 10Data-Platform-SRE, 10DBA, 10cloud-services-team, 10Patch-For-Review: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) [06:52:21] 10Data-Platform-SRE, 10DBA, 10cloud-services-team, 10Patch-For-Review: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) clouddb1021 has been upgraded to 10.6. I will keep a close eye, but if you notice something weird or complaints about some... [06:58:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [06:58:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:44:36] 10Data-Engineering, 10Patch-For-Review, 10Product-Analytics (Kanban): Add client_dt to EditAttemptStep allowlist - https://phabricator.wikimedia.org/T341888 (10KCVelaga_WMF) [08:21:33] (03PS2) 10DCausse: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) [08:27:14] (03CR) 10DCausse: "Thanks for the review!" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [08:29:44] (SystemdUnitFailed) resolved: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:00] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10BTullis) >>! In T288983#9043068, @Mayakp.wiki wrote: > @BTullis Currently I am knee deep in OKR SDS1.1 work a... [10:01:52] (03PS8) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) [10:58:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [10:58:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:51:27] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10RobH) [11:51:42] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10RobH) [12:07:01] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, and 2 others: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10CodeReviewBot) pfischer merged https://gitlab.wikimedia.org/repos/search-platform/cirru... [12:19:44] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:44] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:44] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10fgiunchedi) [13:12:18] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10LSobanski) [13:12:52] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10LSobanski) [13:18:53] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) That's interesting. I don't know why the puppet run alerts are still unknown in icinga for these four analytics hosts. {F37150542} https://alerts.wikimedia.org/?q=alertname%3Dpuppet%20last%20run&q=team... [13:22:44] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10BTullis) Tagging @bking and @RKemper in case they're unaware of this. [13:24:38] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10BTullis) A manual run on analytics1070 as the nagios user reports success. ` nagios@analytics1070:/home/btullis$ /usr/bin/sudo /usr/local/lib/nagios/plugins/check_puppetrun -w 10800 -c 21600 OK: Puppet is curre... [14:11:01] (03PS1) 10Milimetric: Update spark2 references to spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941939 [14:11:58] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update spark2 references to spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941939 (owner: 10Milimetric) [14:16:35] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) btullis opened https://gitlab.wiki... [14:31:37] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) I have now preapred a merge request to e... [14:52:58] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) I created two new crush rules to use with any replicated pools. ` btullis@cephosd1001:~$ sudo ceph osd crush rule create-replicated hdd default host hdd btullis@cephosd1001:~$ sudo cep... [15:03:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:03:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:45:01] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [15:49:40] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10MatthewVernon) >>! In T326945#9042137, @BTullis wrote: >>>! In T326945#9041443, @MatthewVernon wrote: >> [are you intending the RGW service to be general-purpose?] > > Good question. Yes, I th... [15:59:00] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) As per the instructions here: https://docs.ceph.com/en/quincy/rados/operations/pools/#creating-a-pool ...I have now created four pools with the following commands: ` btullis@cephosd100... [16:13:42] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) >>! In T326945#9045226, @MatthewVernon wrote: >>>! In T326945#9042137, @BTullis wrote: >>>>! In T326945#9041443, @MatthewVernon wrote: > That's interesting; we are expecting to have en... [16:26:33] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [16:52:21] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) I created two test volumes, one on the HDDs and the other on the SSDs. Data is stored on the erasure coded pool. Metadata for both was on replicated pool on the same medium. ` btullis@... [17:07:01] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [17:07:46] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [17:11:09] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [18:35:54] (03PS1) 10David Martin: Create a wiki list for Wikifunctions' call to sqoop-mediawiki-tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941985 (https://phabricator.wikimedia.org/T342199) [19:03:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:03:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [21:24:49] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki) > Maybe we can find some time to come back to review the permissions issues you're still facing and the... [21:42:01] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10bking) Posting info from today's pairing session w @RKemper: Known-good host: ` bking@wcqs2002:~$ curl -kIL https://localhost/readiness-probe HTTP/1.1 200 OK server: nginx/1.14.2 date... [22:07:55] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10bking) Troubleshooting steps taken so far on wcqs2001: - Restarted envoy, nginx, and wcqs-blazegraph . - Rebooted the host - Verified that Blazegraph WebUI is up and responding on... [23:03:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:03:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability