[00:19:14] (SystemdUnitFailed) firing: (6) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:14] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:14] (SystemdUnitFailed) firing: (8) monitor_refine_event_sanitized_main_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:27] (SystemdUnitFailed) firing: (9) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:27] (SystemdUnitFailed) firing: (9) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:14] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:16] 10Data-Engineering: Check home/HDFS leftovers of daniram - https://phabricator.wikimedia.org/T355108 (10MoritzMuehlenhoff) [09:19:31] (03PS8) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [09:29:21] (03CR) 10CI reject: [V: 04-1] refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [10:21:28] (03PS4) 10Cyndywikime: Add user_is_temp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) [10:27:45] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972348 (owner: 10Phuedx) [10:29:49] (03PS9) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [10:34:41] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, 10Release-Engineering-Team: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10Gehel) [10:35:48] (PuppetFailure) firing: Puppet has failed on stat1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:38:49] (03CR) 10Joal: [C: 03+1] "Thanks Thomas :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [10:41:02] (03CR) 10CI reject: [V: 04-1] refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [10:45:48] (PuppetFailure) firing: (4) Puppet has failed on stat1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:47:47] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [10:50:48] (PuppetFailure) firing: (5) Puppet has failed on stat1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:55:48] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:25:37] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) I deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/709713 so now the presto cluster is using PKI certificates. All of the servers are also... [11:27:37] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, and 2 others: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10hashar) [11:30:01] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, and 2 others: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10hashar) I though I got them all after @pwangai filed {T349983} but that only covered the job using the releng/sonar-scanner image. [11:37:20] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, and 2 others: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10hashar) 05Open→03Resolved a:03dcausse I have updated all the jobs. Please reopen / poke me on IRC if adjustments are need... [11:46:52] (03CR) 10Gmodena: [V: 03+2 C: 03+2] "Got a +1 from Joseph to merge." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [11:47:33] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:03] ^ expected [11:49:05] (KafkaReplicationFactorTooLow) firing: (2) Kafka topic codfw.mediawiki.job.mediaModerationScanFileJob replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:54:05] (KafkaReplicationFactorTooLow) resolved: (2) Kafka topic codfw.mediawiki.job.mediaModerationScanFileJob replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:56:37] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:48] 10Data-Engineering, 10Metrics Platform Backlog: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) [12:04:14] (03CR) 10Joal: [V: 03+2 C: 03+2] refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [12:17:47] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:50] (03PS9) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [12:21:35] (03PS10) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [12:24:03] which phabricator tag should I add to tasks that need reviews of proposed wiki replica view changes? like https://phabricator.wikimedia.org/T344108 [12:27:57] taavi: I /think/ that it's #data-platform from where it will get triaged, depending on whether it needs a security review from #data-engineering, or whether it goes straight to #data-platform-sre for deployment. But I confess I'm not 100% sure. [12:29:52] (03CR) 10CI reject: [V: 04-1] refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [12:34:21] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:40] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, 10Release-Engineering-Team: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10gmodena) Java CI jobs started failing with ` 13:29:44 [ERROR] Failed to execute goal on project refinery-spark:... [12:38:51] 10Data-Engineering, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Add global_edit_count to wikireplicas - https://phabricator.wikimedia.org/T344108 (10taavi) AIUI currently Data Engineering reviews the view changes to ensure the data is ok to publish and then WMCS (or Data Platform?) SREs dep... [12:39:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) I was thinking about the chart migration... [12:40:00] btullis: thanks, I tagged it #data-platform. let's use this task to figure out how this works and then document it on https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Deploy_wiki_replicas_view_change? [13:02:47] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal merged htt... [13:02:59] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal merged htt... [13:03:54] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal merged htt... [13:09:14] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Discovery-Search, 10Release-Engineering-Team: SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10dcausse) tools.jar should only be in jdk8, I'm surprised that this problem did not occur while sonar was runnin... [13:24:17] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Platform, 10Release-Engineering-Team, 10Discovery-Search (Current work): SonarQube build are failing with Java 11 - https://phabricator.wikimedia.org/T355122 (10Gehel) [13:31:41] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:11] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:34] (03PS1) 10DCausse: Switch to jdk17 for sonar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991010 (https://phabricator.wikimedia.org/T355122) [13:35:47] (03CR) 10Gmodena: [C: 03+1] Switch to jdk17 for sonar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991010 (https://phabricator.wikimedia.org/T355122) (owner: 10DCausse) [13:48:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) I've now deployed the change so that Presto is using PKI certificates and each node has a keytab containing two principals.... [14:18:46] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/582 Feed Druid unique de... [14:22:05] PROBLEM - Presto Server on an-coord1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [14:22:40] (03CR) 10Gmodena: [V: 03+2] refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [14:23:01] PROBLEM - Check systemd state on an-coord1004 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:48] (03CR) 10Gmodena: [V: 03+2 C: 03+2] "I reviewed and got a +1 from Joseph." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [14:27:11] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10CodeReviewBot) aqu closed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/564 Unique devices druid... [14:38:49] (03PS2) 10Aqu: Unique devices druid ingestion job - Iceberg migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) [14:39:49] (03CR) 10Gmodena: [C: 03+2] Switch to jdk17 for sonar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991010 (https://phabricator.wikimedia.org/T355122) (owner: 10DCausse) [14:39:58] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) To add some information about having sequences in logs, I'd like to elaborate a bit more about our current options: 1. Let Varnish... [14:48:48] (03CR) 10Gmodena: [V: 03+2 C: 03+2] Switch to jdk17 for sonar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991010 (https://phabricator.wikimedia.org/T355122) (owner: 10DCausse) [14:55:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:53] RECOVERY - Check systemd state on an-coord1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:27] RECOVERY - Presto Server on an-coord1004 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [15:10:25] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10brouberol) Naive question for @Fabfur: could Varnish generate a request ID that is a [[ https://idtools.co/uuid/v7 | UUID v7 ]]? It's a uniqu... [15:16:14] do any of you know anything about the "wikimedia data portal" that https://toolsadmin.wikimedia.org/tools/membership/status/1632 mentions? [15:19:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [15:19:41] 10Data-Platform-SRE: Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10Gehel) [15:20:08] 10Data-Platform-SRE, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10Gehel) [15:20:53] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, 10Sustainability (Incident Followup): Automatically depool wdqs servers that are "lagged" - https://phabricator.wikimedia.org/T270614 (10Gehel) [15:21:03] 10Data-Platform-SRE, 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) [15:21:13] 10Data-Platform-SRE, 10Elasticsearch, 10Sustainability (Incident Followup): Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 (10Gehel) [15:21:32] 10Data-Platform-SRE, 10Discovery-Search, 10Elasticsearch, 10Sustainability (Incident Followup): Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 (10Gehel) [15:23:42] 10Data-Platform-SRE, 10Discovery-Search: Create SLI/SLO for search index inconsistencies - https://phabricator.wikimedia.org/T349481 (10Gehel) [15:24:15] 10Data-Platform-SRE, 10Discovery-Search: Create SLI/SLO for search index inconsistencies - https://phabricator.wikimedia.org/T349481 (10Gehel) p:05Triage→03High [15:24:54] 10Data-Platform-SRE: Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10Gehel) [15:29:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) Oh, this is going to be harder to achieve than I first thought. I have gone back to this page: http://prestodb.io/blog/202... [15:36:15] 10Data-Platform-SRE: Review the use of scap + git-fat for Data Platform Engineering use cases - https://phabricator.wikimedia.org/T354936 (10Gehel) p:05Triage→03Medium [15:39:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10ops-eqiad: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis) a:05BTullis→03Jclark-ctr [15:39:50] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10ops-eqiad: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) a:05BTullis→03Jclark-ctr [15:40:03] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) 05Open→03Resolved [15:40:05] 10Data-Platform-SRE: Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10BTullis) [15:40:08] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [15:41:05] btullis: should we close T332573 ? Or is there something more to do? [15:41:06] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [15:41:43] Oh, looks like you just did :) [15:42:24] (03PS1) 10Gmodena: Update changelog for v0.2.28 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991033 [15:46:29] !log releasing and deploying refinery source v0.2.28 [15:46:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:48:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ABran-WMF) [15:55:34] (03CR) 10Gmodena: [C: 03+2] Update changelog for v0.2.28 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991033 (owner: 10Gmodena) [15:56:19] (03CR) 10Gmodena: [V: 03+2 C: 03+2] Update changelog for v0.2.28 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991033 (owner: 10Gmodena) [15:58:27] Starting build #133 for job analytics-refinery-maven-release-docker [16:03:49] (03CR) 10Ottomata: Add user_is_temp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [16:07:24] (03CR) 10Ottomata: [C: 03+1] Add user_is_temp property (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [16:13:46] Project analytics-refinery-maven-release-docker build #133: 09SUCCESS in 15 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/133/ [16:27:33] Starting build #93 for job analytics-refinery-update-jars-docker [16:27:57] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.28 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/991068 [16:27:58] Project analytics-refinery-update-jars-docker build #93: 09SUCCESS in 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/93/ [16:31:04] (03CR) 10Gmodena: [C: 03+2] Add refinery-source jars for v0.2.28 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/991068 (owner: 10Maven-release-user) [16:31:39] (03CR) 10Gmodena: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.28 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/991068 (owner: 10Maven-release-user) [16:35:02] !log Deployed refinery-source v0.2.28 using jenkins. Jars are on archiva. [16:35:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:36:54] !log starting refinery deployment using scap [16:36:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:46:48] btullis joal a git pull in /srv/deployment/analytics/refinery failled with [16:46:50] fatal: bad object fc850230e09422a438d10184d8f45baa9ac2e2d0 [16:46:50] error: https://gerrit.wikimedia.org/r/p/analytics/refinery.git did not send all necessary objects [16:47:14] does this ring a bell? [16:47:40] I ran the pull from deploy2002 [16:52:17] (KafkaReplicationFactorTooLow) firing: (328) Kafka topic DataHubUsageEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [16:54:00] btullis joal I can't checkout the repo with git clone "ssh://gmodena@gerrit.wikimedia.org:29418/analytics/refinery" either [16:54:13] fatal: did not receive expected object b7135f11e49322a78fbecafc46d51473dd24ef41 [16:54:13] fatal: fetch-pack: invalid index-pack output [16:55:05] !log stopping refinery deployment using scap. Could not git pull latest changes from origin. [16:56:57] the latter might just be some config tweak I need to apply to my git config. Aborting refinery deployment while investigating what's up. [16:57:18] (KafkaReplicationFactorTooLow) resolved: (328) Kafka topic DataHubUsageEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:00:01] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) 05Open→03Resolved a:03bking I don't see any new unit failures since we deployed the patch. Closing... [17:04:47] gmodena: Do you have git-fat? [17:05:41] Arf - sorry, it happened on deploy node [17:05:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:05:55] It surely have git-fat [17:06:04] that's weird, I've not seen that before :( [17:07:58] joal: gmodena: looking now. I can reproduce as well. [17:09:21] joal btullis the interwebs tells me we can rm .git/refs/remotes/origin/master and hope for the best [17:09:34] but I'd like to understand what's going on :) [17:10:28] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:28] I get the same error: `fatal: bad object fc850230e09422a438d10184d8f45baa9ac2e2d0` when pulling master from my workstation, so it would seem to be a server-side issue with gerrit. [17:10:46] I think I would be tempted to ask in #wikimedia-releng [17:10:53] re local clone issues: I bumped pack.windowMemory and pack.packSizeLimit to a few GBs and nothing fixes it [17:11:00] btullis ack [17:14:05] (KafkaReplicationFactorTooLow) firing: (296) Kafka topic change-prop.retry.mediawiki.page_restore replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:15:49] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:19:18] (KafkaReplicationFactorTooLow) resolved: (296) Kafka topic change-prop.retry.mediawiki.page_restore replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:23:42] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10xcollazo) I must point out that the use case we have today that utilizes Varnish sequence number does not require uniqueness, but it does req... [17:39:28] (03PS40) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [17:39:58] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [17:45:03] btullis: could you have a look at https://phabricator.wikimedia.org/T354452 ? I'm not yet comfortable with all those wiki replicas:( [18:06:14] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) [18:09:35] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) Pulling from `gerrita-replica` works ` git clone "ssh://gmodena@gerrit-replica.wikimedia.org:29418/analytics/refinery" ` [18:09:47] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:10:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) [18:11:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10VRiley-WMF) 05Open→03Resolved [18:11:16] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10VRiley-WMF) [18:12:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:15:09] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) [18:15:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10VRiley-WMF) [18:15:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10ops-eqiad: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10VRiley-WMF) 05Open→03Resolved [18:27:28] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) I suspect we need to run an `fsck` on this repo. Both of those objects exist on the gerrit host: ` thcipriani@gerrit1003:/srv/gerrit/git/analytics/... [18:29:20] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) >>! In T355173#9463375, @gmodena wrote: > Pulling from `gerrita-replica` works > ` > git clone "ssh://gmodena@gerrit-replica.wikimedia.org:29418/analyti... [18:42:35] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10brennen) Is there possibly a git version difference involved between your local host and deploy2002? [18:42:40] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) I note the the `fc850230e09422a438d10184d8f45baa9ac2e2d0` object is still loose on gerrit2002, whereas on gerrit1003 it's only present in a packfile:... [18:45:57] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10gmodena) > Is there possibly a git version difference involved between your local host and deploy2002? There is. On deploy2002: ` gmodena@deploy2002$ git --ver... [18:49:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Made some various improvements to the dashboard: collated SLIs into a single row, added threshold markers for every SLI, added y axis labell... [18:54:25] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) By the way, here are corresponding `wmf_raw.webrequest` fields for this latest SERP. Notice how two prefetech requests were... [18:55:06] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Dreamy_Jazz) [18:57:02] 10Data-Engineering, 10Anti-Harassment, 10Data-Persistence, 10Temporary accounts, and 2 others: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Dreamy_Jazz) [18:59:33] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) >>! In T355173#9463496, @brennen wrote: > Is there possibly a git version difference involved between your local host and deploy2002? I think that m... [19:04:47] 10Data-Engineering, 10Data Products: [Bug report] Data quality issue in wmf.edit_hourly - https://phabricator.wikimedia.org/T355182 (10mpopov) [19:09:24] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) I tried: ` scp -r gerrit1003.wikimedia.org:/srv/gerrit/git/analytics/refinery.git /tmp/T355173-analytics-refinery.git git clone /tmp/T355173-analyti... [19:18:34] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10thcipriani) So @jeena suggested trying to clone using jgit. So I grabbed the jgit executable from https://www.eclipse.org/jgit/download/ and it does clone the re... [19:27:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10RKemper) We did the initial work to get envoy via PKI / cfssl operational in https://phabricator.wikimedia.org/T354555#9454855.... [19:33:53] 10Data-Engineering, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-EventLogging, 10ci-test-error: CentralNotice failing in browser test on master - https://phabricator.wikimedia.org/T354977 (10Ejegg) Very odd - in [[ https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noseleni... [19:36:20] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10RKemper) Talked with gehel, ebernhardson, and inflatador. We're going to start with `full-experimental.wikidata.org`, `main-exp... [19:42:16] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) Per IRC conversation with @Joe , he is the right person to approve this... [19:58:27] 10Data-Engineering, 10Data Products: Data quality issue in wmf.edit_hourly - https://phabricator.wikimedia.org/T355182 (10mpopov) [20:00:21] 10Data-Engineering, 10Data Products: Data quality issue in wmf.edit_hourly - https://phabricator.wikimedia.org/T355182 (10mpopov) [20:01:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10RKemper) We'll need to add 3 entries to https://gerrit.wikimedia.or... [21:10:28] (SystemdUnitFailed) firing: (13) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:04] (PuppetFailure) firing: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:15:00] 10Data-Engineering, 10Release-Engineering-Team: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) It is almost 11pm , my old local clone of analytics/refinery.git had b7135f11e49322a78fbecafc46d51473dd24ef41 which is a tree object. fc850230e09422a438... [22:21:12] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar) + #gerrit since I am entirely sure it is an issue somewhere in jgit even though it is unlikely we can fix it ourselve. I do pla... [22:21:15] 10Data-Engineering, 10Gerrit, 10Release-Engineering-Team, 10Upstream: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 (10hashar)