[00:01:05] (KafkaReplicationFactorTooLow) firing: ... [00:01:05] Kafka topic codfw.mediawiki.job.LoginNotifyPurgeSeen replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.job.LoginNotifyPurgeSeen&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [00:06:05] (KafkaReplicationFactorTooLow) resolved: ... [00:06:05] Kafka topic codfw.mediawiki.job.LoginNotifyPurgeSeen replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.job.LoginNotifyPurgeSeen&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [07:59:01] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit) [08:11:59] (03CR) 10Santiago Faci: [C: 03+2] "Looks good!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003564 (https://phabricator.wikimedia.org/T357371) (owner: 10Clare Ming) [08:13:37] (03Merged) 10jenkins-bot: Update app base schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003564 (https://phabricator.wikimedia.org/T357371) (owner: 10Clare Ming) [09:02:00] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:02:55] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) [09:03:16] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) 05Open→03Resolved All done! [09:03:20] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:32:42] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10gmodena) > the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics... [11:49:51] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10gmodena) > Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wiki... [11:53:49] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) >>! In T351117#9545687, @gmodena wrote: >> the currently suggested one is webrequest.frontend. @gmodena, the idea... [12:07:30] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:43:19] (03PS1) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [12:50:14] (03CR) 10CI reject: [V: 04-1] [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [12:50:59] (03PS2) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [12:54:10] (03PS6) 10Gmodena: development: Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [12:54:37] (03CR) 10CI reject: [V: 04-1] development: Add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [13:00:55] (03PS3) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [13:06:30] (03PS7) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [13:25:52] (03PS4) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [13:35:58] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10Stevemunene) The `rsync-published.service` error is similar to what we encountered when bringing up stat1009 on T336036. The error occurs as we try to Rsync `$source` to `$destination/$::ho... [13:42:32] 10Data-Engineering (Sprint 9): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10gmodena) Spoke a bit about this with @xcollazo. There's an API available for accessing partition metadata, which can be utilized to generate IDs compatib... [13:45:18] (03PS5) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [13:48:44] (03PS1) 10Stevemunene: Add stat1010 and stat1011 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1003042 (https://phabricator.wikimedia.org/T336040) [13:54:16] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Stevemunene) adding a link to the rsync-published.service resolution and potential discussion on the stat1011 ticket. https://phabricator.wikimed... [13:55:31] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10Stevemunene) Once the patch to Add stat1010 and stat1011 to scap targets is merged, we shall add a note on the ops week deployment and keep an eye out during the deploy... [13:56:10] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Stevemunene) Once the patch to Add stat1010 and stat1011 to scap targets is merged, we shall add a note on the ops week deployment and keep an ey... [14:01:01] (03PS6) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [14:03:32] (03PS7) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [14:04:26] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T356698 (10Gehel) a:03bking [14:16:29] (03PS8) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [14:26:25] (03PS9) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [14:31:43] (03PS10) 10Joal: [WIP] Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [14:41:43] (03PS6) 10Snwachukwu: [WIP] Add Dynamic Pivot job for reportupdater reports [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995271 (https://phabricator.wikimedia.org/T354552) [14:48:47] (03CR) 10CI reject: [V: 04-1] [WIP] Add Dynamic Pivot job for reportupdater reports [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995271 (https://phabricator.wikimedia.org/T354552) (owner: 10Snwachukwu) [15:08:22] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) 05Stalled→03Open [15:09:12] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) a:05lbowmaker→03None [15:10:10] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) Deprecating Hue is unblocked, so let's do that instead of upgrading. See T341895. [15:10:12] 10Data-Engineering, 10Data-Platform-SRE: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10Gehel) 05Open→03Declined [15:10:15] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [15:23:54] 10Data-Platform-SRE, 10Data-Platform: Audit Search Platform-owned WMCS accounts - https://phabricator.wikimedia.org/T357162 (10Gehel) p:05Triage→03High [15:24:20] 10Data-Engineering, 10Data-Platform-SRE: Implement periodical cleaning of Airflow databases - https://phabricator.wikimedia.org/T322036 (10Gehel) p:05Triage→03High [15:24:52] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10Gehel) p:05Triage→03High [15:25:14] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10sre-alert-triage: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10Gehel) [15:26:23] 10Data-Engineering, 10Data-Platform-SRE: Alerts Review: determine if we can use Prometheus to alert based on historical datasets - https://phabricator.wikimedia.org/T357537 (10Gehel) p:05Triage→03High [15:26:45] 10Data-Engineering, 10Data-Platform-SRE: Alerts Review: determine if we can use Prometheus to alert based on historical datasets - https://phabricator.wikimedia.org/T357537 (10Gehel) [15:26:48] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [15:37:31] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10sre-alert-triage: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10bking) 05Open→03In progress a:03bking [15:40:29] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10sre-alert-triage: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10bking) Per T347505 , these are graph split hosts , which means they don't run the updater at all. We need to remove this check from the... [16:04:36] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10Htriedman) thanks for awareness around your capacity, @DannyS712! @Ottoma... [16:38:13] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: superset.wikimedia.org redirects to a CAS error page - https://phabricator.wikimedia.org/T357688 (10brouberol) [16:38:25] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: superset.wikimedia.org redirects to a CAS error page - https://phabricator.wikimedia.org/T357688 (10brouberol) p:05Triage→03Unbreak! [16:40:03] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: superset.wikimedia.org redirects to a CAS error page - https://phabricator.wikimedia.org/T357688 (10brouberol) I've tried to change the service ID to _not_ match superset.wikimedia.org, just superset-k8s.wikimedia.org (https://gerrit.wikimedia.or... [16:50:40] 10Data-Engineering, 10Data Pipelines: Add support for repository artifacts in Airflow - https://phabricator.wikimedia.org/T322690 (10lbowmaker) 05Resolved→03Open @mforns - thanks for clarifying. Re-opened [16:54:10] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform, 10Patch-For-Review: superset.wikimedia.org redirects to a CAS error page - https://phabricator.wikimedia.org/T357688 (10brouberol) 05Open→03Resolved The service is back. This was due to the fact that we already had a `superset` and `superse... [18:12:47] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1006.wikimedia.org` - cloudelastic1006.wiki... [18:14:19] 10Data-Engineering (Sprint 8): [Maintenance] Delete sanitized events removed from sanitization list - https://phabricator.wikimedia.org/T347586 (10gmodena) >>! In T347586#9545598, @gmodena wrote: > Data has been deleted from HDFS. It will be quarantined in `hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/da... [18:19:01] (03PS1) 10Clare Ming: Bump app base version to 1.2 for 2 product metrics mobile schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003810 (https://phabricator.wikimedia.org/T357371) [18:20:58] (03CR) 10Clare Ming: "hi Sharvani - just bumping versions -- if it lgtu, please merge at your earliest - thanks!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003810 (https://phabricator.wikimedia.org/T357371) (owner: 10Clare Ming) [18:21:33] something seems up with hive partitions, typically canary events. not seeing hive partitions for eqiad in event.mediawiki_revision_recommendation_create or event.mediawiki_revision_score_drafttopic since 2024-02-14T14:00:00 [18:40:58] 10Data-Engineering, 10Community-Tech, 10Multiblocks, 10Data Products (Data Products Sprint 09), 10Event-Platform: Investigate if the new 'Multiblocks' user blocks feature affects the mediawiki.user-blocks-change event stream - https://phabricator.wikimedia.org/T356597 (10Ottomata) @JWheeler-WMF EventBus... [19:12:55] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Patch-For-Review: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image/-/merg... [19:17:26] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Patch-For-Review: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/cirrus-integration-test-runner/-/merge_... [19:17:31] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Patch-For-Review: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-integration-test-runner/-/merge_... [20:33:05] 10Data-Engineering, 10Community-Tech, 10Multiblocks, 10Data Products (Data Products Sprint 09), 10Event-Platform: Investigate if the new 'Multiblocks' user blocks feature affects the mediawiki.user-blocks-change event stream - https://phabricator.wikimedia.org/T356597 (10xcollazo) >>! In T356597#9535953,... [20:38:04] 10Data-Platform-SRE, 10CirrusSearch, 10Discovery-Search, 10Wikimedia-production-error: RuntimeException: Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic. - https://phabricator.wikimedia.org/T357713 (10thcipriani) >>! In T357713#9548678, @brennen wrote: > Going back to IRC logs... [20:53:24] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 3 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10MaryMunyoki) [21:24:02] 10Data-Engineering (Sprint 9): [Data Quality] Update data_quality schemas to be compatible with Iceberg tables - https://phabricator.wikimedia.org/T356866 (10xcollazo) >>! In T356866#9546027, @gmodena wrote: > Spoke a bit about this with @xcollazo. > > There's an API available for accessing partition metadata,... [22:09:29] (03CR) 10Clare Ming: "hi - just a refresher that these schemas are technically owned by the Android team." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003810 (https://phabricator.wikimedia.org/T357371) (owner: 10Clare Ming) [22:13:37] (03CR) 10Clare Ming: "I was hoping to get this merged soonish so I can test end-to-end -- the other 2 MP article instruments are using MP's base schemas which h" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1003810 (https://phabricator.wikimedia.org/T357371) (owner: 10Clare Ming) [22:19:08] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1005.wikimedia.org` - cloudelastic1005.wiki...