[00:00:30] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [00:07:13] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [00:08:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ops for swfrench - https://phabricator.wikimedia.org/T355912 (10Scott_French) 05In progress→03Resolved [00:31:06] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992654 (owner: 10TrainBranchBot) [00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957 [00:38:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957 (owner: 10TrainBranchBot) [00:46:03] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957 (owner: 10TrainBranchBot) [01:01:57] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release [01:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:33:21] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:07] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [02:38:21] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [02:39:22] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:13] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [02:44:27] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [03:00:21] (03PS12) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [03:14:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:03:11] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:41] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:19] PROBLEM - Check systemd state on db1155 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:23] PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:33] PROBLEM - Check systemd state on clouddb1021 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:26] (03PS2) 10Mxmxchere: etcd 3.4: Fix ETCD_CLIENT_CERT_AUTH=false [puppet] - 10https://gerrit.wikimedia.org/r/992629 [05:36:28] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere) [05:38:52] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:48:15] (MediaWikiEditFailures) firing: Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:53:15] (MediaWikiEditFailures) resolved: Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:10:01] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:59] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240126T0700) [07:54:54] !log failover ganeti master for codfw back to ganeti2022, switch maintenance is completed T355549 [07:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:05] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [07:59:35] PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240126T0800) [08:01:11] ^ can be ignored, monitoring blip [08:01:12] !log rebalance codfw/B following switch maintenance T355549 [08:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:20] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [08:34:35] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/992967 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff) [08:40:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:36] (03CR) 10Hashar: "Thanks, looks like that solved the rendering!" [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [08:50:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:11] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:04:39] 10SRE, 10ops-eqiad, 10Goal, 10cloud-services-team (FY2023/2024-Q1-Q2): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [09:05:17] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) 05Open→03Resolved a:03taavi [09:06:01] (03PS1) 10Majavah: cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 [09:07:02] (03PS1) 10Majavah: wikimediacloud.org: Move Rabbit traffic back to all nodes [dns] - 10https://gerrit.wikimedia.org/r/993062 (https://phabricator.wikimedia.org/T345610) [09:13:54] (03PS1) 10Muehlenhoff: puppet::agent: Remove path condition for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/993063 [09:27:14] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Stop using transition package [puppet] - 10https://gerrit.wikimedia.org/r/992891 (owner: 10Muehlenhoff) [09:33:52] (03CR) 10Muehlenhoff: [C: 03+1] "I think logically it's cleaner to have it only in the service unit? After all the timer is only meant to specifiy "when" something happens" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [09:36:39] (03PS2) 10Btullis: Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) [09:38:52] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:51:56] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/993065 (https://phabricator.wikimedia.org/T349936) [09:59:20] (ProbeDown) firing: (2) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:36] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:01:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.440 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:04:52] (03CR) 10Jbond: [C: 03+1] systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [10:05:50] (03CR) 10Jbond: [C: 03+1] puppet::agent: Remove path condition for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/993063 (owner: 10Muehlenhoff) [10:07:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:08:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.837 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:21:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:21:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:25:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2169 in db2194 for T343674', diff saved to https://phabricator.wikimedia.org/P55737 and previous config saved to /var/cache/conftool/dbconfig/20240126-102550-arnaudb.json [10:25:57] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [10:31:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:31:48] 10SRE, 10Infrastructure-Foundations: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [10:32:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:35:12] (03PS1) 10Muehlenhoff: acme_chief: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/993068 (https://phabricator.wikimedia.org/T329529) [10:35:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:35:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:36:14] 10SRE, 10Infrastructure-Foundations, 10netops: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) 05Open→03Resolved a:03cmooney [10:36:28] !log prune obsolete nginx packages from eventschema hosts after migration to new library scheme T329529 [10:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:34] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [10:37:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993068 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [10:40:14] (03CR) 10Ayounsi: [C: 03+1] cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah) [10:44:58] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Gitlab security upgrade [10:50:50] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Gitlab security upgrade [10:51:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:52:01] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [11:03:29] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10SLyngshede-WMF) Hi @Arinaigu, let's try to untangle what is going wrong :-) You have two username, as you point out: because that's what the guides tell you to. One username is for meta.wikimed... [11:08:04] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arnoldokoth) 05Open→03In progress [11:09:13] RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [11:10:00] (03PS1) 10Muehlenhoff: contint: Remove obsolete firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/993072 [11:15:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:15:34] (03PS1) 10Slyngshede: Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 [11:16:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:18:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:51] (03PS1) 10Muehlenhoff: hadoop:httpd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/993075 [11:20:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:20:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede) [11:21:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:21:09] (03CR) 10Slyngshede: [C: 03+2] Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede) [11:22:56] (03Merged) 10jenkins-bot: Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede) [11:23:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993075 (owner: 10Muehlenhoff) [11:23:47] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arnoldokoth) Hey @odimitrijevic / @Milimetric Kindly approve. [11:25:19] (03PS1) 10Slyngshede: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 [11:26:54] (03PS2) 10Slyngshede: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 [11:28:27] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Gitlab security upgrade [11:29:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede) [11:30:27] (03CR) 10Slyngshede: [C: 03+2] Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede) [11:33:22] (03Merged) 10jenkins-bot: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede) [11:38:29] (03CR) 10Majavah: [C: 03+2] cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah) [11:39:03] (03Merged) 10jenkins-bot: cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah) [11:43:54] !log reprepro: copy helm-diff_3.1.3-2 from bullseye-wikimedia to bookworm-wikimedia [11:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:22] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10BTullis) [11:46:24] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10BTullis) [11:46:30] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10BTullis) [12:00:58] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2230 MB (2% inode=83%): /tmp 2230 MB (2% inode=83%): /var/tmp 2230 MB (2% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [12:03:44] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: partial-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.397 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:00] (03PS1) 10Slyngshede: Add JQuery dependency [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 [12:15:16] (03PS3) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [12:15:47] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10WMDECyn) [12:15:56] (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [12:17:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.318 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:48] (03PS4) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [12:29:07] (03PS1) 10Slyngshede: P:debmonitor::server update to accommodate deb package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [12:30:32] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:35:13] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster svc - ayounsi@cumin1002" [12:36:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster svc - ayounsi@cumin1002" [12:36:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:36] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Cluster and cluster group created in Netbox : https://netbox.wikimedia.org/virtualization/cluster-groups/71/ Next (on Monday?) merge the... [12:43:45] Heads up, we'll be restarting gitlab in approximately 15 minutes to allow for a small update. There will be a few minutes of interruption. [12:48:58] (03CR) 10Muehlenhoff: Add JQuery dependency (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede) [12:51:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1213/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [12:51:53] (03CR) 10Muehlenhoff: P:debmonitor::server update to accommodate deb package. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [13:06:02] (03CR) 10Muehlenhoff: Puppet: Routed Ganeti support (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:18:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:52] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Gitlab security upgrade [13:22:32] (03PS2) 10Slyngshede: P:debmonitor::server update to accommodate deb package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [13:23:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [13:24:27] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) 05Open→03Resolved a:03cmooney All done, things working well on the new switches / EVPN vlans :) [13:28:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1215/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [13:29:14] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10WMDE-leszek) As an engineering manager at WMDE I endorse this request, and confirm @WMDECyn affiliation with WMDE. [13:31:52] (03PS1) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/993088 [13:33:37] (03PS2) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 [13:33:46] (03CR) 10Arnaudb: [C: 03+1] "some hosts are on 1.0.2!" [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup) [13:37:50] (03PS1) 10Ayounsi: wmf-netbox: add Ganeti BGP group support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152) [13:38:53] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:39:57] (03PS1) 10Ayounsi: Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152) [13:40:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:41:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:10] (03PS2) 10Ayounsi: Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152) [13:45:24] (03CR) 10Ayounsi: "Requires If6c7a30c9377f819c1e66fc66123e6a9deb6ad82" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:47:19] (03PS3) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 [13:47:58] (03CR) 10CI reject: [V: 04-1] mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup) [13:48:50] (03PS4) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 [13:51:41] (03CR) 10Arnaudb: [C: 03+1] "amazing" [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup) [13:52:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup) [13:57:30] (03PS15) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) [13:57:52] (03CR) 10Ayounsi: "Thanks !" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:00:36] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:18] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:06:04] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:10] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:08:14] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:48] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:49] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:13:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:14:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.342 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:21:32] (03PS1) 10Alexandros Kosiaris: mesh.configuration: Add sampling support in tracing (copy paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993097 (https://phabricator.wikimedia.org/T351567) [14:21:34] (03PS1) 10Alexandros Kosiaris: tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) [14:22:25] (03CR) 10CI reject: [V: 04-1] tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [14:24:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:24:20] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries for Ganeti PKI support [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) [14:24:34] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:24:48] (03PS12) 10Bking: cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) [14:24:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:25:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:27:41] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:27:58] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:29:03] (03PS2) 10Alexandros Kosiaris: tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) [14:31:13] (03CR) 10Ssingh: Puppet: Routed Ganeti support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:32:16] (03Abandoned) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992645 (https://phabricator.wikimedia.org/T355066) (owner: 10DCausse) [14:32:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:33:26] !log decommissioning restbase2015/cassandra-{a,b,c} — T352469 [14:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [14:34:07] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:34:23] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:34:38] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:35:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:35:58] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2015.codfw.wmnet with reason: Decommissioning — T352469 [14:36:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2015.codfw.wmnet with reason: Decommissioning — T352469 [14:36:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:37:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:37:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:37:58] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:46:59] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:47:15] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:00:51] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:01:10] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:01:18] RECOVERY - Check systemd state on db1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] Extend STORAGE_BACKEND config to support swift [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi) [15:03:15] (03PS1) 10Eevans: cassandra: create template for aqsloader role & grants [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) [15:06:57] (03PS1) 10Bking: cloudelastic: use CFSSL for TLS on canary [puppet] - 10https://gerrit.wikimedia.org/r/993103 (https://phabricator.wikimedia.org/T355617) [15:08:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993103 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:08:28] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [15:11:18] (03PS3) 10Bking: cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617) [15:11:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] D:service::docker Run Docker prune on pull. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) (owner: 10Slyngshede) [15:13:49] (03PS1) 10Eevans: added (fake) aqsloader creds (Cassandra role) [labs/private] - 10https://gerrit.wikimedia.org/r/993105 (https://phabricator.wikimedia.org/T355917) [15:16:08] (03CR) 10Eevans: [V: 03+2 C: 03+2] added (fake) aqsloader creds (Cassandra role) [labs/private] - 10https://gerrit.wikimedia.org/r/993105 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [15:27:40] (03PS1) 10Bking: cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617) [15:31:18] (03CR) 10Ssingh: [C: 03+1] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:31:38] (03PS2) 10Bking: cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617) [15:32:09] (03CR) 10CI reject: [V: 04-1] cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:33:04] (03CR) 10Bking: [C: 03+2] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:33:34] (03CR) 10JHathaway: [C: 03+1] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:37:00] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:28] (03CR) 10DCausse: "should you remove cloudelastic1010 from hieradata/role/eqiad/elasticsearch/cloudelastic.yaml and conftool-data/node/eqiad.yaml ?" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:42:45] (03PS1) 10Bking: cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617) [15:50:16] (03PS1) 10Hnowlan: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) [15:51:04] RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:24] RECOVERY - Check systemd state on clouddb1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:21] (03CR) 10DCausse: [C: 03+1] cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:03:50] (03CR) 10Bking: [C: 03+2] cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:15:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [16:16:24] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:23:07] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1010.wikimedia.org [16:29:33] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:30:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2169 in db2194 for T343674', diff saved to https://phabricator.wikimedia.org/P55740 and previous config saved to /var/cache/conftool/dbconfig/20240126-163057-arnaudb.json [16:31:18] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [16:31:25] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:31:55] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:32:09] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1010.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [16:33:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [16:33:36] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:33:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1010.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [16:33:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1010.wikimedia.org [16:43:29] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:47:30] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10odimitrijevic) Approved. [17:04:36] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:08:12] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync cloudelastic1010 IPs - bking@cumin2002" [17:09:05] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync cloudelastic1010 IPs - bking@cumin2002" [17:09:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:11] PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 120.58, 104.13, 74.77 https://wikitech.wikimedia.org/wiki/Swift [17:11:08] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010 [17:12:29] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010 [17:17:39] (03PS5) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:17:42] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [17:18:28] (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:23:33] (03PS6) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:24:17] (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:26:58] (03CR) 10CDanis: [C: 03+1] tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [17:28:17] PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 104.25, 100.17, 90.48 https://wikitech.wikimedia.org/wiki/Swift [17:31:03] (03PS7) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:31:54] (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:32:29] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) Hi @SLyngshede-WMF , I've tried logging in with the "arinaigum" (not "arinaugum" as you have in your comment, I assumed that was a typo) again this morning, and I am still getting the s... [17:34:05] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [17:34:18] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) a:05odimitrijevic→03Arnoldokoth [17:34:59] something odd happening to ms-be2075's hardware, lots of "Power-on or device reset occurred" in dmesg (a few a second) [17:37:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [17:38:53] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:39:04] hnowlan: unfortunately that seems like RAID controller since it looks like all or many of the individual drives.. or cable.. or it's overheating [17:39:09] (03PS8) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:39:47] hnowlan: lsblk -dno name,hctl,serial shows the serials and drives [17:40:01] (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:40:17] I'd create a ticket for ops-codfw [17:44:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 743443832 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:46:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 61128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:46:42] 10SRE, 10Wikimedia-Mailing-lists: Request for BHL-WIKI Group List - https://phabricator.wikimedia.org/T355941 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done: https://lists.wikimedia.org/postorius/lists/bhl-wiki.lists.wikimedia.org I made the list public with archive, feel free to change that. [17:47:16] (03PS1) 10BCornwall: Update p::markmonitor to p::ncmonitor::markmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/993168 [17:47:39] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Update p::markmonitor to p::ncmonitor::markmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/993168 (owner: 10BCornwall) [17:49:22] (03PS50) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [17:49:24] (03PS7) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [17:49:26] (03PS1) 10AOkoth: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) [17:50:20] (03PS2) 10AOkoth: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) [17:51:51] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) a:03thcipriani [17:53:57] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [17:57:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [17:57:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS bullseye [18:00:36] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:04] (03PS13) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [18:06:24] (03PS3) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) [18:11:06] (03PS2) 10Clare Ming: Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) [18:11:38] RECOVERY - very high load average likely xfs on ms-be2075 is OK: OK - load average: 62.09, 71.68, 79.83 https://wikitech.wikimedia.org/wiki/Swift [18:16:27] !log phab1004 - removing 2fa from TBurmeister (after video verification) T355958 [18:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:33] T355958: Account recovery help needed for Developer account Triciaburmeister / TBurmeister - https://phabricator.wikimedia.org/T355958 [18:24:53] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) [18:27:35] !log cloudweb1003 - OATHAuth disabled for Triciaburmeister. (after video verification - T355958) [18:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:52] T355958: Account recovery help needed for Developer account Triciaburmeister / TBurmeister - https://phabricator.wikimedia.org/T355958 [18:32:11] 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10nskaggs) a:05nskaggs→03None [18:58:56] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:03:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:04:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:08] (03PS1) 10Scott French: Ensure ssh-agent services are also enabled [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 [19:04:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:41] (03CR) 10Scott French: "Ran into this on my first reboot after running the script. Let me know if you'd like me to go at this in a different way, or drop it in fa" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 (owner: 10Scott French) [19:38:16] (03CR) 10Ryan Kemper: [C: 03+1] "needs eof newline but otherwise good" [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking) [19:39:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:47:10] (03PS2) 10Bking: cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) [19:50:10] (03CR) 10Bking: [C: 03+2] cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking) [19:55:20] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:08:22] (03PS1) 10JHathaway: interface: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190 [20:09:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway) [20:24:55] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) Hi @xcollazo, Josh from ITS has created an ops-dumps groups in Google and given you access to it. He recommends we test this before I delete the alias o... [20:27:06] (03CR) 10Dzahn: "This is unclear to me. The user provides an SSH key on the ticket but also I hear all they need is access to some dashboards and in this p" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [20:28:04] (03CR) 10Dzahn: "Also they say they need a Kerberos principal and that would mean you have to set an "krb" line here in data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [20:31:49] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:09] looking [20:33:08] also looking [20:33:25] looking at the probes it's some slowness and some unavailability, starting at 20:26ish [20:33:50] phab is up for me, fwiw [20:34:19] * jhathaway nods [20:36:49] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:38:01] https://grafana.wikimedia.org/goto/lPpM0VpSz?orgId=1 just a spike in heavier requests I guess, scraping maybe [20:38:29] we could chase that down further and requestctl it away if necessary but I don't see an immediate need unless it comes back [20:39:00] yeah saw the same, agreed [20:39:10] 👍 [20:39:12] pleasure doing business [20:39:14] There is prior art in requestctl if needed [20:39:25] oh, for filtering on phab? perfect, thanks [20:39:28] For Phab specifically [20:40:18] ah yeah, request-patterns/sites/phabricator [20:40:27] good to know [20:41:43] (03PS2) 10JHathaway: postgresql: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190 [20:42:33] (03PS1) 10JHathaway: postgresql: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993191 [20:43:28] (03CR) 10Ladsgroup: "Is it needed now? We just deployed the new captcha altogether" [puppet] - 10https://gerrit.wikimedia.org/r/990715 (owner: 10Reedy) [20:43:37] (03PS3) 10JHathaway: interface: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190 [20:43:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway) [20:44:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993191 (owner: 10JHathaway) [21:01:05] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) >Did you get any emails about this and can you control that group? Didn't get an email, but I can see via groups.google.com that I do have access and... [21:12:56] (03PS1) 10Ryan Kemper: wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) [21:13:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper) [21:24:22] (03Abandoned) 10Ebernhardson: cirrus-updater: Increase producer memory from 2g to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/993028 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:38:53] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:53:21] (03CR) 10Bking: [V: 03+1] wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper) [21:53:27] (03CR) 10Bking: [C: 03+1] wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper) [22:00:36] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:49] (PuppetZeroResources) firing: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:04:37] !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudelastic1006.wikimedia.org [22:05:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudelastic1006.wikimedia.org [22:06:36] !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudelastic1006.wikimedia.org [22:06:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudelastic1006.wikimedia.org [22:30:19] (PuppetZeroResources) resolved: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:37:55] (03CR) 10Bking: "Tentative migration plan is at https://etherpad.wikimedia.org/p/cloudelastic-T355617 . I'm always open to suggestions; ping me on IRC if y" [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:38:50] (03PS4) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) [22:39:08] (03CR) 10CI reject: [V: 04-1] cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [23:41:24] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) @xcollazo I commented the alias out of the file to test. And now our mail servers tell me this is a Google gsuite_account. I just sent a test mail to it.... [23:49:41] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) I reverted the temp change for now so that over the weekend everything works as before. You can still check if my test mail arrived and then we can close... [23:52:50] (03CR) 10Dzahn: [C: 04-1] "They phrase it "access to some of the analytics systems", so that seems like the SSH key does indeed need to be here and they want real sh" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [23:59:32] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10Dzahn) Hi @WMDECyn please send an email from your WMDE email to Katie Francis -> https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) She will follow-up with you on signing the NDA. Once...