[00:03:16] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [00:04:20] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:19] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:14:55] 10Data-Engineering (Sprint 7): [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10Ahoelzl) [00:16:42] 10Data-Engineering: [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10Ahoelzl) [00:28:46] 10Data-Platform-SRE, 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Dzahn) [01:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:19] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:19] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:46] 10Data-Engineering, 10Fundraising Tech - Chaos Crew, 10Fundraising-Backlog, 10MediaWiki-Core-Tests, and 3 others: CentralNotice failing in browser test on master - https://phabricator.wikimedia.org/T354977 (10Cstone) [03:15:08] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:35:36] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:30] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Amire80) [05:44:55] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Amire80) [05:49:40] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Amire80) [07:15:08] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:36:12] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10MGerlach) @MoritzMuehlenhoff could we keep Shubhankar's data on stat1008 for some time (e.g. in my home directory under `/home/mgerlach/shubhankar`)? Data on hdfs can be dropped (thou... [08:39:40] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10MoritzMuehlenhoff) >>! In T355501#9479888, @MGerlach wrote: > @MoritzMuehlenhoff could we keep Shubhankar's data on stat1008 for some time (e.g. in my home directory under `/home/mger... [09:20:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) A few comments: * It is not super clear what exactly we are measuring for each of the 4 metrics (full text search, autocomplete, search prev... [09:21:19] 10Data-Platform-SRE, 10Discovery-Search: Migrate Search SLOs to prometheus based metrics - https://phabricator.wikimedia.org/T355589 (10Gehel) p:05Triage→03Medium [09:26:55] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10SRE, 10Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10hashar) When running the MediaWiki train, scap complained due to the ssh host key of `snapshot1016.eqiad.wmnet` not b... [09:34:13] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10SRE: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10MoritzMuehlenhoff) >>! In T325228#9480025, @hashar wrote: > When running the MediaWiki train, scap complained due to the ssh host key of `s... [09:35:36] (SystemdUnitFailed) firing: (21) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1001 - https://phabricator.wikimedia.org/T355653 (10BTullis) [10:06:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) [10:06:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) [10:06:41] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10BTullis) [10:07:05] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) [10:07:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) [10:08:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) [10:33:28] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), 10Patch-For-Review: Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10CodeReviewBot) phuedx updated https://gitlab.wikimedia.org/rep... [10:34:13] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `an-master1001.eqiad.wmnet` - an-master1001.eqiad.wmnet (**... [10:57:34] (DiskSpace) firing: Disk space stat1005:9100:/ 2.409% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:58:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `an-master1002.eqiad.wmnet` - an-master1002.eqiad.wmnet (**... [11:26:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) a:05BTullis→03Jclark-ctr [11:26:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) a:05BTullis→03Jclark-ctr [11:39:07] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10cmooney) >>! In T346946#9477701, @bking wrote: > @taavi a few questions to clarify scope and amount of work required, since we've already been asked to [[ https://phabr... [12:24:30] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) [12:24:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) [12:57:27] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10ayounsi) Thanks for the thorough comment ! My vote goes to option 1 :) * It's a design we've done 1000 times (expose a prod service externally through the LVS), so it... [13:35:36] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:09] 10Data-Platform-SRE (2024.01.01 - 2024.01.21): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10cmooney) >>! In T351354#9477734, @ayounsi wrote: >> and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know. >... [13:54:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10Gehel) [13:55:01] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10Gehel) [13:55:08] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10cmooney) 05Open→03Declined >>! In T346946#9480830, @ayounsi wrote: > My vote goes to option 1 :) Ok. I've no strong objection. > * "still have VM traffic connect... [13:55:58] 10Data-Platform-SRE: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10Gehel) [13:56:11] 10Data-Platform-SRE: Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10Gehel) [13:56:31] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) [13:56:43] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [13:56:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) [13:57:01] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) [13:57:07] 10Data-Platform-SRE: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10Gehel) [13:58:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10Gehel) [14:01:47] 10Data-Platform-SRE, 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Gehel) [14:01:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [14:02:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): ProbeDown - https://phabricator.wikimedia.org/T355272 (10Gehel) [14:06:35] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10SRE: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10Gehel) p:05Triage→03High [14:08:56] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10bking) Thanks @cmooney , @taavi and @ayounsi . I've created T355617 for the private IP migration and will reach out after discussing the timetable with my team lead @G... [14:15:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) 05Open→03Resolved a:03Gehel [14:15:48] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [14:16:15] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel) 05Open→03Resolved [14:16:21] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [14:16:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel) 05Open→03Resolved [14:16:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [14:19:20] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye [14:23:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10Gehel) a:05Gehel→03BTullis [14:34:07] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Amire80) [14:34:19] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:36] o/ I'm running something that will eat a lot of memory on stat1005, let me know if it's causing oom or something like that [14:44:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10Gehel) [14:54:06] btullis ^^ see Amir1 comment above [14:57:49] (DiskSpace) firing: Disk space stat1005:9100:/ 2.4% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:05:36] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10lbowmaker) All of this is just table > table lineage though right? In the one example we have in Datahub now, we show webrequest > aqs_hourly (not the transformation). Seem... [15:12:04] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) @bking what is the current status of {T351354} ? If those new nodes are not live yet we can hopefully move them all to private IPs before they are serving traffi... [15:15:14] Amir1: Thanks. understood. [15:16:20] I tried using a less memory-heavy approach using bloom filters but it was so slow, it took a day for a small wiki, would take three months for the analysis I want of commons [15:41:05] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors: - elastic2094 (**FA... [15:41:58] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) [15:43:50] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) > the new version was [updated in values-test.yaml](https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/9... [15:44:44] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Jdforrester-WMF) Yeah, think I've got the fix in {T355592} for my own service. [15:49:47] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) Then it would also make sense that `values-test.yaml` didn’t cause an error, because that has `WIKIBASE_REPO_HOSTNAME... [15:52:18] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Jdforrester-WMF) >>! In T355685#9481656, @Lucas_Werkmeister_WMDE wrote: > Then it would also make sense that `values-test.yaml` didn’t cause... [16:07:07] 10Data-Platform-SRE, 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10bking) We had good luck switching to CFSSL (which doesn't require manually touching private puppet). If you're interested, the CR is [[ https://gerrit... [16:09:31] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) I don’t know – it’s probably related to T334064, but I didn’t really understand a lot of what was going on in that ta... [16:24:48] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Gehel) [16:25:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Gehel) p:05Triage→03High [16:29:30] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10Gehel) [16:30:20] 10Data-Engineering (Sprint 7): [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10Ahoelzl) [16:30:44] 10Data-Engineering (Sprint 7): [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10Ahoelzl) a:03gmodena [16:31:11] 10Data-Engineering (Sprint 7): [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10Ahoelzl) https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality [16:41:14] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10BTullis) a:03BTullis [16:41:41] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10Gehel) [16:41:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [17:06:27] PROBLEM - Host aqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:39] RECOVERY - Host aqs2003 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [17:11:50] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) [17:12:21] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services, 10Patch-For-Review: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10RKemper) We've rolled this out following the steps in https://wikitech.wikimedia.org/wiki/Cergen#Updat... [17:12:34] (DiskSpace) resolved: Disk space stat1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:13:49] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10RKemper) [17:15:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10RKemper) Experimental microsites are up and... [17:16:34] (DiskSpace) firing: Disk space stat1005:9100:/ 2.549% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:46:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Dzahn) This should be resolved now. We confirmed the steps and that new sites are working now. [17:46:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Dzahn) [18:15:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10Dzahn) [18:17:53] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10EBernhardson) [18:35:36] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:07] (03PS1) 10Aqu: [WIP] Use str_to_map built-in fct to parse x-analytics-header [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 [18:41:50] (03PS2) 10Aqu: [WIP] Use str_to_map built-in fct to parse x-analytics-header [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) [18:42:12] (03PS1) 10Aqu: [WIP] Use str_to_map built-in fct to parse x-analytics-header [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992477 (https://phabricator.wikimedia.org/T355391) [18:43:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:47:45] (03CR) 10CI reject: [V: 04-1] [WIP] Use str_to_map built-in fct to parse x-analytics-header [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [18:48:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) [18:48:36] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) This has been completed [18:48:50] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10VRiley-WMF) [18:48:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) 05Open→03Resolved [18:49:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:55:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) This has been completed [18:55:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) [18:56:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decom an-master100[1-2] - https://phabricator.wikimedia.org/T353775 (10VRiley-WMF) [18:56:04] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) 05Open→03Resolved [19:25:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10collaboration-services: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 (10RKemper) 05Open→03Resolved [19:25:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) [19:48:06] (03CR) 10Mforns: "Note to @Eevans: I assumed the Cassandra table is going to be named aqs.local_group_default_T_aqs_config.data, but I just did that because" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [20:21:34] (DiskSpace) resolved: Disk space stat1005:9100:/ 2.543% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:48:04] (03CR) 10Joal: [C: 03+1] "I dislike having to add the hive-exec bit, but I don't really see a better solution. Looks good to me :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [20:49:04] 10Data-Engineering (Sprint 7): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10Ahoelzl) Kwaku 1/18: > The latest update is that we have a PoC implemented in Benthos and are close to testing it. From the Phab task, you'd notice there's a discuss... [20:50:10] (03CR) 10Joal: "LGTM - Needs the related refinery-source patch released before this can be applied." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992477 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [21:06:26] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) @cmooney Unfortunately, cloudelastic is a stateful, clustered service so we'll have to worry about quorum and split-brain. I'm [[ https://etherpad.wikimedia.org/p/c... [21:22:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) >>! In T355617#9482761, @bking wrote: > @cmooney Unfortunately, cloudelastic is a stateful, clustered service so we'll have to worry about quorum and split-brain... [21:22:34] (DiskSpace) firing: Disk space stat1005:9100:/ 2.547% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:35:37] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:20] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10bking) [23:13:26] (03PS3) 10Aqu: Adopt a more resilient approach to use webrequest x-analytics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) [23:14:46] (03PS2) 10Aqu: Adopt a more resilient approach to use webrequest x-analytics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992477 (https://phabricator.wikimedia.org/T355391) [23:17:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10bking) We'll need Cloudelastic TLS to look more like production Elastic's TLS config, see `modules/profile/manifests/elasticsearch/cirrus.pp` and `modules/elasti...