[00:00:30] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release
[00:07:13] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release
[00:08:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ops for swfrench - https://phabricator.wikimedia.org/T355912 (10Scott_French) 05In progress→03Resolved
[00:31:06] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992654 (owner: 10TrainBranchBot)
[00:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957
[00:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957 (owner: 10TrainBranchBot)
[00:46:03] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:59:13] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992957 (owner: 10TrainBranchBot)
[01:01:57] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release
[01:38:51] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[02:33:21] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:37:07] <icinga-wm>	 PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[02:38:21] <icinga-wm>	 PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[02:39:22] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:13] <icinga-wm>	 RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[02:44:27] <icinga-wm>	 RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[03:00:21] <wikibugs>	 (03PS12) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190)
[03:14:22] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:03:11] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:41] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:19] <icinga-wm>	 PROBLEM - Check systemd state on db1155 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:23] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:33] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1021 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:36:26] <wikibugs>	 (03PS2) 10Mxmxchere: etcd 3.4: Fix ETCD_CLIENT_CERT_AUTH=false [puppet] - 10https://gerrit.wikimedia.org/r/992629
[05:36:28] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere)
[05:38:52] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[05:48:15] <jinxer-wm>	 (MediaWikiEditFailures) firing: Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[05:53:15] <jinxer-wm>	 (MediaWikiEditFailures) resolved: Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[06:10:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:59] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240126T0700)
[07:54:54] <moritzm>	 !log failover ganeti master for codfw back to ganeti2022, switch maintenance is completed T355549
[07:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:05] <stashbot>	 T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549
[07:59:35] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240126T0800)
[08:01:11] <moritzm>	 ^ can be ignored, monitoring blip
[08:01:12] <moritzm>	 !log rebalance codfw/B following switch maintenance T355549
[08:01:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:20] <stashbot>	 T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549
[08:34:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/992967 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff)
[08:40:47] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:47:59] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:49:36] <wikibugs>	 (03CR) 10Hashar: "Thanks, looks like that solved the rendering!" [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar)
[08:50:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:50:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:04:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10cloud-services-team (FY2023/2024-Q1-Q2): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi)
[09:05:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) 05Open→03Resolved a:03taavi
[09:06:01] <wikibugs>	 (03PS1) 10Majavah: cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061
[09:07:02] <wikibugs>	 (03PS1) 10Majavah: wikimediacloud.org: Move Rabbit traffic back to all nodes [dns] - 10https://gerrit.wikimedia.org/r/993062 (https://phabricator.wikimedia.org/T345610)
[09:13:54] <wikibugs>	 (03PS1) 10Muehlenhoff: puppet::agent: Remove path condition for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/993063
[09:27:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Stop using transition package [puppet] - 10https://gerrit.wikimedia.org/r/992891 (owner: 10Muehlenhoff)
[09:33:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "I think logically it's cleaner to have it only in the service unit? After all the timer is only meant to specifiy "when" something happens" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah)
[09:36:39] <wikibugs>	 (03PS2) 10Btullis: Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273)
[09:38:52] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[09:51:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/993065 (https://phabricator.wikimedia.org/T349936)
[09:59:20] <jinxer-wm>	 (ProbeDown) firing: (2) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:00:36] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:01:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:03:01] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.440 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:04:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah)
[10:05:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] puppet::agent: Remove path condition for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/993063 (owner: 10Muehlenhoff)
[10:07:47] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:08:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:13:55] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.837 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:13:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:21:31] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:21:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:23:45] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:25:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2169 in db2194 for T343674', diff saved to https://phabricator.wikimedia.org/P55737 and previous config saved to /var/cache/conftool/dbconfig/20240126-102550-arnaudb.json
[10:25:57] <stashbot>	 T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674
[10:31:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:31:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff)
[10:32:53] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:35:12] <wikibugs>	 (03PS1) 10Muehlenhoff: acme_chief: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/993068 (https://phabricator.wikimedia.org/T329529)
[10:35:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:35:27] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:36:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) 05Open→03Resolved a:03cmooney
[10:36:28] <moritzm>	 !log prune obsolete nginx packages from eventschema hosts after migration to new library scheme T329529
[10:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:34] <stashbot>	 T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529
[10:37:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993068 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[10:40:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah)
[10:44:58] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Gitlab security upgrade
[10:50:50] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Gitlab security upgrade
[10:51:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[10:52:01] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[11:03:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10SLyngshede-WMF) Hi @Arinaigu, let's try to untangle what is going wrong :-)  You have two username, as you point out: because that's what the guides tell you to.  One username is for meta.wikimed...
[11:08:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arnoldokoth) 05Open→03In progress
[11:09:13] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops
[11:10:00] <wikibugs>	 (03PS1) 10Muehlenhoff: contint: Remove obsolete firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/993072
[11:15:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:15:34] <wikibugs>	 (03PS1) 10Slyngshede: Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074
[11:16:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:18:45] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:19:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:19:51] <wikibugs>	 (03PS1) 10Muehlenhoff: hadoop:httpd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/993075
[11:20:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:20:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede)
[11:21:05] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:21:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede)
[11:22:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add MANIFEST.in [software/debmonitor] - 10https://gerrit.wikimedia.org/r/993074 (owner: 10Slyngshede)
[11:23:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993075 (owner: 10Muehlenhoff)
[11:23:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arnoldokoth) Hey @odimitrijevic / @Milimetric Kindly approve.
[11:25:19] <wikibugs>	 (03PS1) 10Slyngshede: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076
[11:26:54] <wikibugs>	 (03PS2) 10Slyngshede: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076
[11:28:27] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Gitlab security upgrade
[11:29:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede)
[11:30:27] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede)
[11:33:22] <wikibugs>	 (03Merged) 10jenkins-bot: Debian Build-Depends, add setuptools-scm [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993076 (owner: 10Slyngshede)
[11:38:29] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah)
[11:39:03] <wikibugs>	 (03Merged) 10jenkins-bot: cr-labs: Remove cloudrabbit term [homer/public] - 10https://gerrit.wikimedia.org/r/993061 (owner: 10Majavah)
[11:43:54] <taavi>	 !log reprepro: copy helm-diff_3.1.3-2 from bullseye-wikimedia to bookworm-wikimedia
[11:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:22] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10BTullis)
[11:46:24] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10BTullis)
[11:46:30] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10BTullis)
[12:00:58] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2230 MB (2% inode=83%): /tmp 2230 MB (2% inode=83%): /var/tmp 2230 MB (2% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops
[12:03:44] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: partial-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:09:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.397 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:12:00] <wikibugs>	 (03PS1) 10Slyngshede: Add JQuery dependency [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083
[12:15:16] <wikibugs>	 (03PS3) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[12:15:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <LDAP/WMDE> for <WMDE Cyn> - https://phabricator.wikimedia.org/T355937 (10WMDECyn)
[12:15:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[12:17:34] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.318 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:48] <wikibugs>	 (03PS4) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[12:29:07] <wikibugs>	 (03PS1) 10Slyngshede: P:debmonitor::server update to accommodate deb package. [puppet] - 10https://gerrit.wikimedia.org/r/993086
[12:30:32] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:35:13] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster svc - ayounsi@cumin1002"
[12:36:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster svc - ayounsi@cumin1002"
[12:36:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:41:36] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Cluster and cluster group created in Netbox : https://netbox.wikimedia.org/virtualization/cluster-groups/71/  Next (on Monday?) merge the...
[12:43:45] <eoghan>	 Heads up, we'll be restarting gitlab in approximately 15 minutes to allow for a small update. There will be a few minutes of interruption.
[12:48:58] <wikibugs>	 (03CR) 10Muehlenhoff: Add JQuery dependency (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede)
[12:51:22] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1213/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede)
[12:51:53] <wikibugs>	 (03CR) 10Muehlenhoff: P:debmonitor::server update to accommodate deb package. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede)
[13:06:02] <wikibugs>	 (03CR) 10Muehlenhoff: Puppet: Routed Ganeti support (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[13:18:44] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:18:52] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Gitlab security upgrade
[13:22:32] <wikibugs>	 (03PS2) 10Slyngshede: P:debmonitor::server update to accommodate deb package. [puppet] - 10https://gerrit.wikimedia.org/r/993086
[13:23:41] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[13:24:27] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) 05Open→03Resolved a:03cmooney All done, things working well on the new switches / EVPN vlans :)
[13:28:16] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1215/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede)
[13:29:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <LDAP/WMDE> for <WMDE Cyn> - https://phabricator.wikimedia.org/T355937 (10WMDE-leszek) As an engineering manager at WMDE I endorse this request, and confirm @WMDECyn affiliation with WMDE.
[13:31:52] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/993088
[13:33:37] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088
[13:33:46] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] "some hosts are on 1.0.2!" [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup)
[13:37:50] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: add Ganeti BGP group support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152)
[13:38:53] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[13:39:57] <wikibugs>	 (03PS1) 10Ayounsi: Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152)
[13:40:20] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:41:40] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:44:10] <wikibugs>	 (03PS2) 10Ayounsi: Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152)
[13:45:24] <wikibugs>	 (03CR) 10Ayounsi: "Requires If6c7a30c9377f819c1e66fc66123e6a9deb6ad82" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[13:47:19] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088
[13:47:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup)
[13:48:50] <wikibugs>	 (03PS4) 10Ladsgroup: mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088
[13:51:41] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] "amazing" [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup)
[13:52:05] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Fix check_private_data.py after pymysql upgrade to 1.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/993088 (owner: 10Ladsgroup)
[13:57:30] <wikibugs>	 (03PS15) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[13:57:52] <wikibugs>	 (03CR) 10Ayounsi: "Thanks !" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[14:00:36] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:02:18] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[14:06:04] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:08:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:08:14] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:08:48] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:08:49] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:13:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:14:04] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:26] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.342 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:21:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh.configuration: Add sampling support in tracing (copy paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993097 (https://phabricator.wikimedia.org/T351567)
[14:21:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566)
[14:22:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris)
[14:24:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:24:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries for Ganeti PKI support [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686)
[14:24:34] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:24:48] <wikibugs>	 (03PS12) 10Bking: cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617)
[14:24:50] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[14:25:03] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[14:27:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:27:58] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:29:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566)
[14:31:13] <wikibugs>	 (03CR) 10Ssingh: Puppet: Routed Ganeti support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[14:32:16] <wikibugs>	 (03Abandoned) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992645 (https://phabricator.wikimedia.org/T355066) (owner: 10DCausse)
[14:32:41] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[14:33:26] <urandom>	 !log decommissioning restbase2015/cassandra-{a,b,c} — T352469
[14:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:35] <stashbot>	 T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469
[14:34:07] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:34:23] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:34:38] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:26] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:35:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:35:58] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2015.codfw.wmnet with reason: Decommissioning — T352469
[14:36:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2015.codfw.wmnet with reason: Decommissioning — T352469
[14:36:48] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:37:00] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:37:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:37:58] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:46:59] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:47:15] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[15:00:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[15:01:10] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[15:01:18] <icinga-wm>	 RECOVERY - Check systemd state on db1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Extend STORAGE_BACKEND config to support swift [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi)
[15:03:15] <wikibugs>	 (03PS1) 10Eevans: cassandra: create template for aqsloader role & grants [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917)
[15:06:57] <wikibugs>	 (03PS1) 10Bking: cloudelastic: use CFSSL for TLS on canary [puppet] - 10https://gerrit.wikimedia.org/r/993103 (https://phabricator.wikimedia.org/T355617)
[15:08:15] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993103 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:08:28] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans)
[15:11:18] <wikibugs>	 (03PS3) 10Bking: cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617)
[15:11:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] D:service::docker Run Docker prune on pull. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) (owner: 10Slyngshede)
[15:13:49] <wikibugs>	 (03PS1) 10Eevans: added (fake) aqsloader creds (Cassandra role) [labs/private] - 10https://gerrit.wikimedia.org/r/993105 (https://phabricator.wikimedia.org/T355917)
[15:16:08] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] added (fake) aqsloader creds (Cassandra role) [labs/private] - 10https://gerrit.wikimedia.org/r/993105 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans)
[15:27:40] <wikibugs>	 (03PS1) 10Bking: cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617)
[15:31:18] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:31:38] <wikibugs>	 (03PS2) 10Bking: cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617)
[15:32:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:33:04] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:33:34] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] cloudelastic: config changes for migration canary [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:37:00] <icinga-wm>	 RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:28] <wikibugs>	 (03CR) 10DCausse: "should you remove cloudelastic1010 from hieradata/role/eqiad/elasticsearch/cloudelastic.yaml and conftool-data/node/eqiad.yaml ?" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[15:42:45] <wikibugs>	 (03PS1) 10Bking: cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617)
[15:50:16] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507)
[15:51:04] <icinga-wm>	 RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:24] <icinga-wm>	 RECOVERY - Check systemd state on clouddb1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:21] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[16:03:50] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: remove references to cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/993150 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[16:15:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617
[16:16:24] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[16:23:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1010.wikimedia.org
[16:29:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:30:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2169 in db2194 for T343674', diff saved to https://phabricator.wikimedia.org/P55740 and previous config saved to /var/cache/conftool/dbconfig/20240126-163057-arnaudb.json
[16:31:18] <stashbot>	 T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674
[16:31:25] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[16:31:55] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[16:32:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1010.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[16:33:09] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617
[16:33:36] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[16:33:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1010.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[16:33:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:33:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1010.wikimedia.org
[16:43:29] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[16:47:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10odimitrijevic) Approved.
[17:04:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:08:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync cloudelastic1010 IPs - bking@cumin2002"
[17:09:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync cloudelastic1010 IPs - bking@cumin2002"
[17:09:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:10:11] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 120.58, 104.13, 74.77 https://wikitech.wikimedia.org/wiki/Swift
[17:11:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010
[17:12:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010
[17:17:39] <wikibugs>	 (03PS5) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[17:17:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye
[17:18:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:23:33] <wikibugs>	 (03PS6) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[17:24:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:26:58] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris)
[17:28:17] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2075 is CRITICAL: CRITICAL - load average: 104.25, 100.17, 90.48 https://wikitech.wikimedia.org/wiki/Swift
[17:31:03] <wikibugs>	 (03PS7) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[17:31:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:32:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) Hi @SLyngshede-WMF , I've tried logging in with the "arinaigum" (not "arinaugum" as you have in your comment, I assumed that was a typo) again this morning, and I am still getting the s...
[17:34:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage
[17:34:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) a:05odimitrijevic→03Arnoldokoth
[17:34:59] <hnowlan>	 something odd happening to ms-be2075's hardware, lots of "Power-on or device reset occurred" in dmesg (a few a second) 
[17:37:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage
[17:38:53] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[17:39:04] <mutante>	 hnowlan: unfortunately that seems like RAID controller since it looks like all or many of the individual drives.. or cable.. or it's overheating
[17:39:09] <wikibugs>	 (03PS8) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[17:39:47] <mutante>	 hnowlan:  lsblk -dno name,hctl,serial    shows the serials and drives
[17:40:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:40:17] <mutante>	 I'd create a ticket for ops-codfw
[17:44:22] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 743443832 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:46:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 61128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:46:42] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request for BHL-WIKI Group List - https://phabricator.wikimedia.org/T355941 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done: https://lists.wikimedia.org/postorius/lists/bhl-wiki.lists.wikimedia.org  I made the list public with archive, feel free to change that.
[17:47:16] <wikibugs>	 (03PS1) 10BCornwall: Update p::markmonitor to p::ncmonitor::markmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/993168
[17:47:39] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] Update p::markmonitor to p::ncmonitor::markmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/993168 (owner: 10BCornwall)
[17:49:22] <wikibugs>	 (03PS50) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[17:49:24] <wikibugs>	 (03PS7) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679
[17:49:26] <wikibugs>	 (03PS1) 10AOkoth: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606)
[17:50:20] <wikibugs>	 (03PS2) 10AOkoth: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606)
[17:51:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Arnoldokoth) a:03thcipriani
[17:53:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[17:57:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[17:57:38] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS bullseye
[18:00:36] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:04] <wikibugs>	 (03PS13) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190)
[18:06:24] <wikibugs>	 (03PS3) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617)
[18:11:06] <wikibugs>	 (03PS2) 10Clare Ming: Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360)
[18:11:38] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2075 is OK: OK - load average: 62.09, 71.68, 79.83 https://wikitech.wikimedia.org/wiki/Swift
[18:16:27] <mutante>	 !log phab1004 - removing 2fa from TBurmeister (after video verification) T355958
[18:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:33] <stashbot>	 T355958: Account recovery help needed for Developer account Triciaburmeister / TBurmeister - https://phabricator.wikimedia.org/T355958
[18:24:53] <wikibugs>	 (03PS1) 10Bking: cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959)
[18:27:35] <mutante>	 !log cloudweb1003 - OATHAuth disabled for Triciaburmeister. (after video verification - T355958)
[18:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:52] <stashbot>	 T355958: Account recovery help needed for Developer account Triciaburmeister / TBurmeister - https://phabricator.wikimedia.org/T355958
[18:32:11] <wikibugs>	 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10nskaggs) a:05nskaggs→03None
[18:58:56] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[19:03:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:04:04] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:04:08] <wikibugs>	 (03PS1) 10Scott French: Ensure ssh-agent services are also enabled [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183
[19:04:12] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:06:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:09:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:10:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:10:08] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:26:41] <wikibugs>	 (03CR) 10Scott French: "Ran into this on my first reboot after running the script. Let me know if you'd like me to go at this in a different way, or drop it in fa" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 (owner: 10Scott French)
[19:38:16] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "needs eof newline but otherwise good" [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking)
[19:39:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:40:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:47:10] <wikibugs>	 (03PS2) 10Bking: cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959)
[19:50:10] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: migrate cloudelastic1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993175 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking)
[19:55:20] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:18] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[20:08:22] <wikibugs>	 (03PS1) 10JHathaway: interface: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190
[20:09:30] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway)
[20:24:55] <wikibugs>	 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) Hi @xcollazo,  Josh from ITS has created an ops-dumps groups in Google and given you access to it.  He recommends we test this before I delete the alias o...
[20:27:06] <wikibugs>	 (03CR) 10Dzahn: "This is unclear to me. The user provides an SSH key on the ticket but also I hear all they need is access to some dashboards and in this p" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth)
[20:28:04] <wikibugs>	 (03CR) 10Dzahn: "Also they say they need a Kerberos principal and that would mean you have to set an "krb" line here in data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth)
[20:31:49] <jinxer-wm>	 (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:09] <rzl>	 looking
[20:33:08] <jhathaway>	 also looking
[20:33:25] <rzl>	 looking at the probes it's some slowness and some unavailability, starting at 20:26ish
[20:33:50] <rzl>	 phab is up for me, fwiw
[20:34:19] * jhathaway nods
[20:36:49] <jinxer-wm>	 (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:38:01] <rzl>	 https://grafana.wikimedia.org/goto/lPpM0VpSz?orgId=1 just a spike in heavier requests I guess, scraping maybe
[20:38:29] <rzl>	 we could chase that down further and requestctl it away if necessary but I don't see an immediate need unless it comes back
[20:39:00] <jhathaway>	 yeah saw the same, agreed
[20:39:10] <rzl>	 👍
[20:39:12] <rzl>	 pleasure doing business
[20:39:14] <sobanski>	 There is prior art in requestctl if needed
[20:39:25] <rzl>	 oh, for filtering on phab? perfect, thanks
[20:39:28] <sobanski>	 For Phab specifically
[20:40:18] <rzl>	 ah yeah, request-patterns/sites/phabricator
[20:40:27] <rzl>	 good to know
[20:41:43] <wikibugs>	 (03PS2) 10JHathaway: postgresql: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190
[20:42:33] <wikibugs>	 (03PS1) 10JHathaway: postgresql: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993191
[20:43:28] <wikibugs>	 (03CR) 10Ladsgroup: "Is it needed now? We just deployed the new captcha altogether" [puppet] - 10https://gerrit.wikimedia.org/r/990715 (owner: 10Reedy)
[20:43:37] <wikibugs>	 (03PS3) 10JHathaway: interface: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190
[20:43:52] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway)
[20:44:11] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993191 (owner: 10JHathaway)
[21:01:05] <wikibugs>	 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) >Did you get any emails about this and can you control that group?  Didn't get an email, but I can see via groups.google.com that I do have access and...
[21:12:56] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959)
[21:13:30] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper)
[21:24:22] <wikibugs>	 (03Abandoned) 10Ebernhardson: cirrus-updater: Increase producer memory from 2g to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/993028 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[21:38:53] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[21:53:21] <wikibugs>	 (03CR) 10Bking: [V: 03+1] wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper)
[21:53:27] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper)
[22:00:36] <jinxer-wm>	 (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:03:49] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:04:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudelastic1006.wikimedia.org
[22:05:04] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudelastic1006.wikimedia.org
[22:06:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudelastic1006.wikimedia.org
[22:06:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudelastic1006.wikimedia.org
[22:30:19] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:37:55] <wikibugs>	 (03CR) 10Bking: "Tentative migration plan is at https://etherpad.wikimedia.org/p/cloudelastic-T355617 . I'm always open to suggestions; ping me on IRC if y" [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:38:50] <wikibugs>	 (03PS4) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617)
[22:39:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[23:41:24] <wikibugs>	 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) @xcollazo I commented the alias out of the file to test. And now our mail servers tell me this is a Google gsuite_account. I just sent a test mail to it....
[23:49:41] <wikibugs>	 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) I reverted the temp change for now so that over the weekend everything works as before. You can still check if my test mail arrived and then we can close...
[23:52:50] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "They phrase it "access to some of the analytics systems", so that seems like the SSH key does indeed need to be here and they want real sh" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth)
[23:59:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <LDAP/WMDE> for <WMDE Cyn> - https://phabricator.wikimedia.org/T355937 (10Dzahn) Hi @WMDECyn   please send an email from your WMDE email to Katie Francis -> https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)  She will follow-up with you on signing the NDA.  Once...