[00:00:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) (owner: 10Andrew Bogott)
[00:12:10] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite)
[00:19:26] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:19:36] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:21:34] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48536 bytes in 1.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:21:46] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:42:02] <icinga-wm_>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:36] <icinga-wm_>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:44] <icinga-wm_>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:14] <icinga-wm_>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:32] <icinga-wm_>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:44:50] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) >>! In T314789#8138964, @MusikAnimal wrote: > That's very helpf...
[01:45:12] <icinga-wm_>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:45:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json
[01:45:39] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[02:00:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
[02:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:02:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:07:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:13:20] <icinga-wm_>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:15:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
[02:19:42] <icinga-wm_>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[02:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
[02:30:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[02:30:56] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[02:31:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[02:31:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
[02:32:36] <icinga-wm_>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:34:26] <icinga-wm_>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:37:48] <icinga-wm_>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[02:43:28] <icinga-wm_>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:44:12] <icinga-wm_>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:04] <icinga-wm_>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:46] <icinga-wm_>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10nshahquinn-wmf) I don't know if they're relevant, but here are some past tickets related to Qualtrics emails from our domain: * {T164424} * {T176666}
[03:18:14] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[03:23:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:03] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:34] <wikibugs>	 (03PS10) 10MdsShakil: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820)
[04:11:39] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:17] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:34:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) It looks like pywikibot is a decent test rig for session replication races. I installed pywikibot on a Linode instance in the Dallas region, which...
[04:47:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:11:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
[05:12:00] <stashbot>	 T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370
[05:12:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
[05:12:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 T314370', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
[05:21:17] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:31:01] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/819550 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot)
[05:31:05] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/819550 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot)
[05:38:59] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:00] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:43:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:46:01] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:58:19] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:59:05] <Amir1>	 I restart mailman if it continues like this
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220809T0600).
[06:00:11] <Amir1>	 o/
[06:00:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:00:55] <Amir1>	 !log Starting s5 eqiad failover from db1130 to db1100 - T314370
[06:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:00] <stashbot>	 T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370
[06:01:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T314370', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
[06:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:01:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T314370', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
[06:02:12] <Amir1>	 done
[06:06:17] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/819551 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot)
[06:06:31] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/819551 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot)
[06:07:55] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[06:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 T314370', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
[06:08:40] <stashbot>	 T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370
[06:10:23] <Amir1>	 now it's time to clean up the old s5 master
[06:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
[06:11:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
[06:16:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Wikidata-Query-Service: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi)
[06:16:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: allow x-wikimedia-debug to pick a php backend [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto)
[06:18:56] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Expand thanos-swift sd[ab]3 SSDs - https://phabricator.wikimedia.org/T314275 (10fgiunchedi)
[06:19:15] <Amir1>	 !log dbmaint s5@eqiad (T312863 T312984 T310011 T310485)
[06:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:20] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:19:21] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[06:19:21] <stashbot>	 T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984
[06:20:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10fgiunchedi) Hi @papaul, it looks like it might be the battery indeed, I'll let @MatthewVernon check/confirm
[06:21:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10fgiunchedi) Hi @papaul, it looks like it might be the battery indeed, I'll let @MatthewVernon check/confirm
[06:22:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[06:22:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[06:24:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) We need to decide if we want to make this change, taking into consideration the fact that the resource savings (in dollar te...
[06:25:41] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255
[06:34:38] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Siko_WMDE) @BCornwall, the associated e-mail address is: simon.kock@wikimedia.de  thank you :)
[06:59:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220809T0700)
[07:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi)
[07:13:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:16:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Joe) >>! In T211661#8139622, @ori wrote: >  As several people have pointed out in conversation with me, the fact that we're stori...
[07:19:34] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:35:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Hi @BCornwall,  just to say this is a high priority for us. We are already lost 4 days of work with  @munizaA been locked-out from the servers.
[07:45:00] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10Gehel) p:05Triage→03High
[08:04:35] <wikibugs>	 (03CR) 10Jbond: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[08:04:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond)
[08:08:08] <wikibugs>	 (03PS1) 10Elukey: ml-services: add model version config for enwiki editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/821596
[08:09:59] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro)
[08:10:24] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:53] <wikibugs>	 (03CR) 10David Caro: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[08:11:48] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200
[08:17:20] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:35] <wikibugs>	 (03PS1) 10Ayounsi: sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597
[08:18:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add model version config for enwiki editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/821596 (owner: 10Elukey)
[08:21:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[08:22:27] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) @Ladsgroup @Marostegui for reference, this is the couple of one liners I am using on cumin2002 to check the latest 2 million rows for each table:   `lang=bash # mysql.py -BN -h es1021 -...
[08:24:18] <jynus>	 !log starting data check using es1021 and es2021, expect increased read traffic T314559
[08:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:21] <stashbot>	 T314559: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559
[08:26:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:27:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:27:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:27:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:28:50] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:28:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[08:29:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, having tests would be great" [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[08:30:19] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:31:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite)
[08:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:34:48] <wikibugs>	 (03PS1) 10Elukey: ml-services: update articlequality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821599 (https://phabricator.wikimedia.org/T301878)
[08:36:49] <wikibugs>	 (03PS1) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600
[08:36:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:39:28] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T314838 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:45:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341
[08:46:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: fix logstash alerts to use 'datasource' grafana variable [alerts] - 10https://gerrit.wikimedia.org/r/821601
[08:47:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[08:48:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341
[08:48:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: lvs: move ulsfo, eqsin off of conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/821602
[08:49:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:52:04] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[08:52:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:52:11] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:52:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[08:57:41] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:21] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "Example output: https://phabricator.wikimedia.org/P32324" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi)
[08:59:54] <wikibugs>	 (03CR) 10Ayounsi: "Example output (including following CR https://phabricator.wikimedia.org/P32324)" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi)
[09:00:49] <wikibugs>	 (03PS5) 10Thiemo Kreuz (WMDE): Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[09:02:00] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[09:02:35] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840)
[09:03:44] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) As of now, the debianization is done in the software, but I'm waiting to build and upload a package until I've solved some of the outstanding issues.
[09:05:41] <wikibugs>	 (03PS1) 10Cathal Mooney: Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775)
[09:06:45] <wikibugs>	 (03PS20) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[09:07:06] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775) (owner: 10Cathal Mooney)
[09:08:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775) (owner: 10Cathal Mooney)
[09:08:35] <wikibugs>	 (03PS21) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[09:10:04] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM, though I'm afraid the formatting will soon diverge again if we don't enforce it with some kind of script in Jenkins. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro)
[09:10:06] <wikibugs>	 (03CR) 10Jbond: PeeringDB API: initial commit (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[09:11:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341 (owner: 10Giuseppe Lavagetto)
[09:11:50] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe)
[09:12:23] <vgutierrez>	 !log rolling restart of pybal in codfw - T310070
[09:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:26] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[09:12:35] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[09:15:10] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:08] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe)
[09:17:42] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] wmcs: autoformat our yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro)
[09:18:37] <wikibugs>	 (03PS1) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687
[09:18:43] <wikibugs>	 (03PS1) 10Ayounsi: junos_set_interface_config: fix logic error [cookbooks] - 10https://gerrit.wikimedia.org/r/821688
[09:19:52] <wikibugs>	 (03PS2) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687
[09:20:44] <wikibugs>	 (03PS1) 10Ladsgroup: conftool-data: Remove db2135 from the list [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656)
[09:21:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update articlequality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821599 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[09:21:12] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[09:24:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:24:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:25:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:27:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Tripley-confirmed on puppet, zarcillo and actual data contents." [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656) (owner: 10Ladsgroup)
[09:27:20] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[09:28:44] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] conftool-data: Remove db2135 from the list [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656) (owner: 10Ladsgroup)
[09:29:50] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10RhinosF1) For the IRC side, it's probably better to check cloak or account:  - 1: a nickname on IRC normally has a short period of time (although this can range from 0 to infinity dependin...
[09:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:38:44] <wikibugs>	 (03PS1) 10Btullis: Change ownership of etcd certificates for cfssl usecase [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129)
[09:39:39] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: move ulsfo, eqsin off of conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/821602 (owner: 10Giuseppe Lavagetto)
[09:40:19] <vgutierrez>	 !log rolling restart of pybal in ulsfo - T310070
[09:40:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36651/console" [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:42:00] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:42:50] <wikibugs>	 (03PS22) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[09:42:57] <wikibugs>	 (03CR) 10Ayounsi: PeeringDB API: initial commit (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[09:43:04] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:43:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "Adding joe to CC as a courtesy. I have verified that this has no effect on production etcd clusters, despite a change to the profile." [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:43:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe) a:05BCornwall→03Joe Hi @BCornwall I'm going to take care of this task together with @Clement_Goubert - we already have @akosiaris' approval from before the delays we had;...
[09:44:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe)
[09:46:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10LSobanski) Approved.
[09:48:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5002 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[09:49:51] <jynus>	 ^should I be worried?
[09:51:16] <wikibugs>	 (03PS1) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693
[09:52:46] <jynus>	 oh, I just saw vgutierrez's log now
[09:52:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Clement_Goubert)
[09:53:05] <vgutierrez>	 !log rolling restart of pybal in eqsin - T310070
[09:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:08] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[09:53:09] <vgutierrez>	 jynus: nope you shouldn't
[09:53:22] <jynus>	 yeah, too much scrollback
[09:54:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Clement_Goubert)
[09:55:05] <wikibugs>	 (03PS2) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902)
[09:58:14] <wikibugs>	 (03PS1) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694
[09:58:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but add yourself to the sreadmins group" [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert)
[09:59:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (owner: 10Jbond)
[10:00:42] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5002 is OK: OK: 4 connections established with conf2005.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[10:01:14] <wikibugs>	 (03CR) 10Elukey: Change ownership of etcd certificates for cfssl usecase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[10:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:02:24] <wikibugs>	 (03PS2) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694
[10:03:05] <wikibugs>	 (03PS3) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846)
[10:04:55] <wikibugs>	 (03PS1) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882)
[10:05:06] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Change ownership of etcd certificates for cfssl usecase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[10:06:53] <wikibugs>	 (03PS3) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902)
[10:08:52] <wikibugs>	 (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[10:11:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond)
[10:14:19] <wikibugs>	 (03CR) 10Clément Goubert: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert)
[10:14:28] <_joe_>	 claime: ^^
[10:14:56] <wikibugs>	 (03PS10) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[10:15:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert)
[10:15:10] <claime>	 _joe_: I maybe should have answered with something else than Done then :D
[10:15:21] <_joe_>	 nah :P
[10:15:58] <wikibugs>	 (03PS4) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846)
[10:16:25] <claime>	 .win 13
[10:17:18] <wikibugs>	 (03PS1) 10Elukey: ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878)
[10:19:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[10:22:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846) (owner: 10Jbond)
[10:22:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe)
[10:26:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond)
[10:27:22] <wikibugs>	 (03PS21) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[10:44:18] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:44:44] <wikibugs>	 (03PS2) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882)
[10:46:37] <wikibugs>	 (03PS2) 10Elukey: ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878)
[10:47:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: add container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229)
[10:47:24] <wikibugs>	 (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[10:49:06] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:49:21] <wikibugs>	 (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[10:56:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[10:58:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[11:03:32] <wikibugs>	 (03PS1) 10Elukey: ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878)
[11:08:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1)
[11:09:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[11:09:28] <wikibugs>	 (03PS4) 10Ssingh: admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1)
[11:11:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10ssingh) Diego has confirmed via email so merging this given MunizA has been locked out. @MunizaA: echoing @Dzahn's suggestion above to update the address if possib...
[11:12:48] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[11:13:34] <wikibugs>	 (03PS3) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587)
[11:13:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10ssingh) 05Open→03Resolved a:03ssingh Change has been merged.
[11:13:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10mfossati)
[11:15:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10mfossati)
[11:20:07] <wikibugs>	 (03CR) 10Jbond: "lgtm minor nit" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi)
[11:21:31] <wikibugs>	 (03PS4) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587)
[11:21:43] <wikibugs>	 (03CR) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[11:21:44] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:27:13] <wikibugs>	 (03PS3) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853)
[11:28:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10Gehel) Approved from my side
[11:32:26] <wikibugs>	 (03CR) 10Jbond: "lgtm but see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi)
[11:35:38] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:26] <wikibugs>	 (03CR) 10Jbond: "LGTm but sorry i missed a comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[11:47:15] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: reduce backup_keep_time to 2d [puppet] - 10https://gerrit.wikimedia.org/r/820712 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:04:18] <icinga-wm>	 RECOVERY - Check systemd state on mw2393 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Thank you so much for this, @nshahquinn-wmf -- that seems super relevant! From these tickets it appears that we do use SMTP rather than Custo...
[12:09:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10EChetty)
[12:11:42] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:15:23] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821708 (https://phabricator.wikimedia.org/T314456)
[12:30:52] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:58] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:40] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Change ownership of etcd certificates for cfssl usecase [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[12:46:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:48:13] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheDJ) >>! In T314789#8139432, @Legoktm wrote: > IMO the most important...
[12:49:41] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[12:50:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[12:51:45] <jinxer-wm>	 (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:53:59] <wikibugs>	 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) As a followup to this past weekend's mi...
[12:57:53] <wikibugs>	 (03CR) 10David Caro: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[13:06:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10Ottomata) Hi, I don't know much about this, but I did a little bit of digging.  I can see that the flink session cluster jobmanager is taking ch...
[13:12:50] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:13:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821708 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[13:16:41] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime) >>! In T314789#8140256, @TheDJ wrote: > You can use lame a...
[13:16:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:17:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:18:30] <wikibugs>	 (03PS2) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859
[13:19:25] <wikibugs>	 (03PS1) 10Btullis: Change the output directory that cfssl uses for etcd [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129)
[13:20:15] <wikibugs>	 (03CR) 10Majavah: puppetmaster: remove 'allow_from' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[13:21:00] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36660/console" [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[13:21:47] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36661/console" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[13:21:48] <denisse|m>	 Hi godog , starting the logging:
[13:21:48] <denisse|m>	 !log netmon1002 to netmon1003 failover
[13:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:20] <godog>	 denisse|m: ok!
[13:22:46] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:30] <jynus>	 !log stop replication on db1117:m1 T309074
[13:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:33] <stashbot>	 T309074: Put netmon1003 in service - https://phabricator.wikimedia.org/T309074
[13:24:55] <wikibugs>	 (03CR) 10Bking: [V: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel)
[13:25:03] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel)
[13:25:13] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074)
[13:25:40] <wikibugs>	 (03PS4) 10Bking: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel)
[13:25:48] <wikibugs>	 (03CR) 10Bking: [V: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel)
[13:28:32] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074)
[13:29:02] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:14] <denisse|m>	 !log Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
[13:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:29] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:29:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:30:33] <jynus>	 I ran a backup consistent to 13:23:51 state
[13:30:38] <jynus>	 of librenms
[13:31:56] <denisse|m>	 jynus: Thanks a lot! :D
[13:32:21] <denisse|m>	 !log running '# authdns-update' in  ns0.wikimedia.org 
[13:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:42] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "I'm far from a partman expert, but I think this is correct!" [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi)
[13:34:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10elukey) ` root@thanos-fe1001:/home/elukey# source /etc/swift/account_AUTH_wdqs.env root@thanos-fe1001:/home/elukey# swift list  rdf-streaming-up...
[13:34:34] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "netmon: failover to netmon1003" [dns] - 10https://gerrit.wikimedia.org/r/821727
[13:36:04] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "netmon: failover to netmon1003" [dns] - 10https://gerrit.wikimedia.org/r/821727 (owner: 10Andrea Denisse)
[13:40:03] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074)
[13:40:22] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:40:24] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:40:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "This looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan)
[13:42:00] <denisse|m>	 !log Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
[13:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:43] <denisse|m>	 !log authdns updated successfully
[13:42:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:35] <XioNoX>	 upgrade discussion is here?
[13:43:48] <denisse|m>	 !log  Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
[13:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:55] <godog>	 XioNoX: on -sre
[13:43:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:44:16] <XioNoX>	 cool
[13:44:19] <XioNoX>	 ping me if needed
[13:44:38] <godog>	 XioNoX: thank you! will do
[13:45:33] <denisse|m>	 !log puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
[13:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:59] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[13:47:01] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[13:48:26] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:45] <wikibugs>	 (03PS1) 10Elukey: ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878)
[13:48:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:50:44] <denisse|m>	 !log Running '# run-puppet-agent' in the netmon1002 host
[13:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:34] <denisse|m>	 !log Running '# run-puppet-agent' in the netmon1003 host
[13:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) While Disk Is Cheap (TM), container listing is not and our thumbs containers are the largest in terms of number-of...
[13:52:49] <denisse|m>	 !log Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
[13:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:59] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:54:01] <denisse|m>	 !log Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
[13:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:14] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add the netmon1003 host as a syslog destination in homer [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:56:02] <wikibugs>	 (03Merged) 10jenkins-bot: netmon: Add the netmon1003 host as a syslog destination in homer [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[13:57:23] <wikibugs>	 (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[13:57:24] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:57:39] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:57:58] <wikibugs>	 (03PS23) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[13:58:11] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Yes, it's the battery that's died (presumably related to the recent power work).
[13:58:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[13:58:57] <dcausse>	 ^ this is me
[13:59:14] <inflatador>	 ACK
[14:00:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon) Yep, it's the battery (like ms-be2032).
[14:00:49] <elukey>	 .7
[14:00:50] <elukey>	 err
[14:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:07:13] <wikibugs>	 (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[14:08:52] <wikibugs>	 (03PS1) 10DCausse: Stop checking codfw for wikidata/wdqs max lag detection [puppet] - 10https://gerrit.wikimedia.org/r/821753
[14:11:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) It seems (still not 100% sure yet but seeing a lot of failures related to this) that the repeated failures are caused by the bad swift...
[14:16:13] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Stop checking codfw for wikidata/wdqs max lag detection [puppet] - 10https://gerrit.wikimedia.org/r/821753 (owner: 10DCausse)
[14:16:22] <wikibugs>	 (03CR) 10Jbond: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[14:17:45] <wikibugs>	 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10lmata) p:05Triage→03Medium a:03fgiunchedi
[14:18:17] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:31] <wikibugs>	 (03PS2) 10Ayounsi: sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597
[14:18:33] <wikibugs>	 (03PS2) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600
[14:21:52] <wikibugs>	 (03CR) 10Ayounsi: "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi)
[14:28:49] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
[14:34:59] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata) a:03herron @herron: from a POC standpoint, is this good enough? any thoughts on the last option for user management and SSO?
[14:38:44] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update drafttopic docker image and settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:40:24] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) a:05herron→03None >>! In T309033#8140569, @lmata wrote: > any thoughts on the last option for user management and SSO?   Please see https://p...
[14:41:06] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron)
[14:41:21] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron)
[14:41:25] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) 05Open→03Resolved a:03herron
[14:43:19] <wikibugs>	 (03PS2) 10Elukey: ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878)
[14:43:35] <wikibugs>	 (03CR) 10Elukey: ml-services: update drafttopic docker image and settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:43:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[14:43:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:45:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:46:11] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron)
[14:47:30] <wikibugs>	 (03PS1) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups [puppet] - 10https://gerrit.wikimedia.org/r/821756
[14:48:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[14:48:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:48:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:49:45] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:50:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
[14:50:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, minor optimisation inline" [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah)
[14:52:37] <wikibugs>	 (03PS2) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups [puppet] - 10https://gerrit.wikimedia.org/r/821756
[14:52:57] <wikibugs>	 (03CR) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah)
[14:54:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:57:39] <denisse|m>	 !log finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
[14:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:16] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Roll out  query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori)
[14:59:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:59:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:59:52] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori)
[15:00:21] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) 05Open→03In progress
[15:04:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah)
[15:04:58] <wikibugs>	 (03PS2) 10BCornwall: admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn)
[15:05:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
[15:05:31] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall)
[15:08:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi)
[15:08:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall)
[15:08:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275)
[15:08:49] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
[15:08:57] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Change the output directory that cfssl uses for etcd [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[15:10:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[15:11:05] <dcausse>	 ^ this is me
[15:11:11] <wikibugs>	 (03PS1) 10Majavah: P:openstack::cumin::target: remove port forwarding support [puppet] - 10https://gerrit.wikimedia.org/r/821759
[15:15:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) Hi, @Siko_WMDE I'll need you to:  * Sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document (https://phabricator.wikimedia.or...
[15:18:56] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:30] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: autoformat our yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro)
[15:22:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi)
[15:24:40] <godog>	 dcaro: merged your change too
[15:25:22] <dcaro>	 godog: thanks! 
[15:25:37] <dcaro>	 I was waiting for you to finish up, but that's better :)
[15:25:48] <wikibugs>	 (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond)
[15:26:36] <godog>	 yeah good (or bad) timing depending on how you want to look at it! :)
[15:26:59] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn)
[15:27:12] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
[15:27:24] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] admin: add Simon Kock to ldap_only admins (nda,wmde) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn)
[15:28:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:29:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10dcaro)
[15:30:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
[15:30:18] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10fnegri)
[15:31:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:31:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:32:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:33:19] <wikibugs>	 (03PS1) 10FNegri: Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870)
[15:34:22] <wikibugs>	 (03CR) 10Jbond: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[15:36:14] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:01] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835)
[15:38:41] <wikibugs>	 (03CR) 10David Caro: puppetmaster: remove 'allow_from' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[15:42:04] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:42:41] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
[15:45:24] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
[15:50:37] <icinga-wm>	 PROBLEM - Check systemd state on es2021 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic would it be possible to attach the original or raw email messages of your new and old samples? I want to view the email headers to...
[15:54:07] <jynus>	 is clouddb1021 a schema change or something else going on?
[15:54:23] <jynus>	 who should I ping about it, data engineering?
[15:54:27] <jynus>	 or cloud?
[15:54:51] <jynus>	 I will have a look at es2021
[15:56:41] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835)
[15:57:27] <icinga-wm>	 RECOVERY - Check systemd state on es2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:44] <wikibugs>	 (03Abandoned) 10Andrea Denisse: netmon: Add DNS entries to test LibreNMS in Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/820554 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[15:58:36] <RhinosF1>	 jynus: based on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&viewPanel=5&from=now-3h&to=now, it doesn't look like a schema change that arrived because no host had lag before, also it's flattening out now. that's my immediate and based on uneducated history thought.
[15:58:46] <jynus>	 weird, it fixed itself (es2021)
[15:58:55] <RhinosF1>	 DE maintain them now
[15:58:58] <jynus>	 it said the service didn't exist, but it did
[15:59:27] <RhinosF1>	 there was maint this morning though in s5 - https://sal.toolforge.org/log/zzFBgYIB6FQ6iqKi412c
[16:00:44] <wikibugs>	 (03CR) 10Jbond: sre.network.debug: automatically analyse the remote interface (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi)
[16:01:05] <wikibugs>	 (03PS3) 10Seddon: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie)
[16:01:13] <wikibugs>	 (03PS1) 10Clément Goubert: improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770
[16:01:41] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
[16:02:17] <wikibugs>	 (03PS2) 10Clément Goubert: improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770
[16:06:27] <wikibugs>	 (03CR) 10Clément Goubert: "Can you check that out, no rush" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert)
[16:08:26] <wikibugs>	 (03CR) 10David Caro: "Thanks! (sorry, went on pto)" [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite)
[16:08:27] <jynus>	 there is some big delete ongoing- maybe it is being setup? No idea, but it seems otherwise healty
[16:08:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] nova_fullstack_test: rename error.stack to stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite)
[16:09:05] <jynus>	 https://grafana.wikimedia.org/goto/n7rZhGi4k?orgId=1
[16:12:27] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36667/console" [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[16:13:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[16:14:37] <wikibugs>	 (03CR) 10FNegri: [V: 03+1 C: 03+2] Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[16:26:14] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[16:26:30] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[16:29:42] <wikibugs>	 (03PS1) 10Vgutierrez: Revert PR #7465 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/821777
[16:31:29] <wikibugs>	 (03PS1) 10FNegri: Add cloudcephosd1025 to Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870)
[16:34:06] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36668/console" [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[16:36:09] <wikibugs>	 (03CR) 10Jbond: "sorry to be really picky on a minor point but i think its good to get theses things right early 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi)
[16:41:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Platonides) SQLite? I'm surprised it uses that, it may not be the best performant option for such a queue...
[16:44:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 56.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:53:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[16:53:49] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[16:54:12] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[16:54:16] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[17:00:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
[17:13:00] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
[17:15:35] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
[17:21:00] <wikibugs>	 (03PS1) 10Btullis: Use the chained certificate for the etcd cfssl option [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129)
[17:22:56] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36669/console" [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[17:27:22] <wikibugs>	 (03PS2) 10Btullis: Use the chained certificate for the etcd cfssl option [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129)
[17:29:56] <vgutierrez>	 !log test trafficserver 9.1.2-1wm2 in cp6016 - T309651
[17:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:00] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[17:34:07] <wikibugs>	 (03PS3) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600
[17:34:40] <wikibugs>	 (03CR) 10Ayounsi: "Thanks for your feedback :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi)
[17:35:51] <wikibugs>	 (03PS12) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:37:38] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36670/console" [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[17:37:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi)
[17:38:17] <wikibugs>	 (03CR) 10Ayounsi: sre.network.debug: automatically analyse the remote interface (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi)
[17:38:47] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
[17:42:49] <wikibugs>	 (03PS4) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600
[17:47:17] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:54:07] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:57:16] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:58:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) Current status: - all flink jobs are stopped in codfw - wdqs traffic is eqiad - wikidata maxlag is only checking eqiad - the rdf-stream...
[17:59:04] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[17:59:06] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:06:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson)
[18:06:46] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:09:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:11:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson) 05Open→03Resolved left in the rack to track
[18:12:25] <wikibugs>	 (03PS24) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[18:12:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832)
[18:13:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[18:13:59] <wikibugs>	 (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[18:14:42] <wikibugs>	 (03PS11) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[18:16:07] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:16:15] <wikibugs>	 (03PS2) 10Cathal Mooney: Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832)
[18:17:00] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori)
[18:17:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[18:22:01] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[18:30:35] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:41:59] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:43:29] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[18:43:35] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:50:41] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:53:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) From looking at the headers that @TAndic sent me the change in behavior appears as follows:  === Old Route ===  Qualtrics sends to gmail f...
[18:58:31] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:00:56] <wikibugs>	 (03PS1) 10Bking: wdqs: bring more hosts online [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890)
[19:01:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking)
[19:02:16] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking)
[19:06:09] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: bring more hosts online [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking)
[19:11:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone)
[19:19:05] <wikibugs>	 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Jdforrester-WMF)
[19:23:29] <wikibugs>	 (03PS3) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835)
[19:25:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[19:29:09] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) (owner: 10DCausse)
[19:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) (owner: 10DCausse)
[19:35:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[19:35:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[19:36:03] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[19:38:50] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[19:43:58] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:47:34] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:51:56] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:54:29] <inflatador>	 ^^ just brought those wdqs servers online, will ack alerts
[19:55:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
[19:55:30] <stashbot>	 T314890: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890
[19:55:32] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:55:51] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
[19:56:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
[19:56:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
[19:57:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
[19:57:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
[19:58:04] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[20:02:06] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:05:16] <icinga-wm>	 PROBLEM - DNS on db1190.mgmt is CRITICAL: Domain db1190.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:28] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[20:09:54] <wikibugs>	 (03PS3) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295)
[20:10:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
[20:10:35] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[20:12:33] <wikibugs>	 (03PS4) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295)
[20:13:36] <icinga-wm>	 PROBLEM - DNS on db1195.mgmt is CRITICAL: Domain db1195.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:17:36] <icinga-wm>	 PROBLEM - DNS on db1191.mgmt is CRITICAL: Domain db1191.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:25:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
[20:25:52] <icinga-wm>	 PROBLEM - DNS on db1189.mgmt is CRITICAL: Domain db1189.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:26:46] <icinga-wm>	 PROBLEM - DNS on db1194.mgmt is CRITICAL: Domain db1194.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:32:42] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:40:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
[20:43:42] <icinga-wm>	 PROBLEM - DNS on db1193.mgmt is CRITICAL: Domain db1193.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:46:18] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[20:46:51] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:51:20] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
[20:51:20] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
[20:52:40] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:55:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
[20:55:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:55:53] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[20:55:58] <icinga-wm>	 PROBLEM - DNS on db1192.mgmt is CRITICAL: Domain db1192.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:56:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:56:40] <icinga-wm>	 PROBLEM - DNS on db1187.mgmt is CRITICAL: Domain db1187.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:56:54] <icinga-wm>	 PROBLEM - DNS on db1185.mgmt is CRITICAL: Domain db1185.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:00:16] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
[21:06:40] <icinga-wm>	 PROBLEM - DNS on db1186.mgmt is CRITICAL: Domain db1186.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:06:44] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:08:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:08:30] <James_F>	 Just in case someone's alarmed to see gate-and-submit-wmf jobs happening without a C+2 here, they're CI trial runs (on already-merged patches) as we're now going to be running PHP7.4 jobs for prod patches too; see T293924
[21:08:30] <stashbot>	 T293924: Also run PHP 7.4 jobs on wmf branch patches - https://phabricator.wikimedia.org/T293924
[21:15:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) Unfortunately I could not finish the cleanup of the `flink_ha_storage` folder to properly resume operations from k8s. I resumed the job...
[21:24:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:30:19] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:40:55] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms
[21:43:03] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[21:43:05] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[21:43:12] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[21:43:15] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[21:43:22] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[21:43:26] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[21:49:14] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[21:50:45] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[21:52:22] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[21:53:16] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[21:57:00] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
[21:57:03] <stashbot>	 T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146
[21:57:15] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:57:24] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
[22:00:32] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[22:01:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:02:29] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[22:02:33] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[22:05:39] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release changeprop-jobqueue/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:05:40] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[22:12:56] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[22:15:53] <inflatador>	 welp, that didn't go so well
[22:16:13] <inflatador>	 (reverting changeprop-jobqueue shortly)
[22:18:44] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[22:28:29] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[22:29:16] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:31:19] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[22:31:23] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[22:42:43] <wikibugs>	 (03PS1) 10Bking: changeprop-jobqueue: further reduce memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/821800 (https://phabricator.wikimedia.org/T314426)
[22:46:13] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
[22:46:47] <wikibugs>	 (03CR) 10Bking: [C: 03+2] changeprop-jobqueue: further reduce memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/821800 (https://phabricator.wikimedia.org/T314426) (owner: 10Bking)
[22:49:07] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[22:49:13] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[22:51:46] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[22:51:53] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[22:52:48] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[22:55:10] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:24] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:59:08] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[23:02:06] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:51] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[23:07:10] <logmsgbot>	 !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[23:10:39] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release changeprop-jobqueue/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:15:42] <icinga-wm>	 PROBLEM - SSH on cp1089.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:17:29] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
[23:29:46] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[23:42:42] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[23:48:38] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook