[00:00:25] (03CR) 10CI reject: [V: 04-1] openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) (owner: 10Andrew Bogott) [00:12:10] (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [00:19:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48536 bytes in 1.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:42:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:36] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:44] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:14] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:32] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:44:50] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) >>! In T314789#8138964, @MusikAnimal wrote: > That's very helpf... [01:45:12] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json [01:45:39] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [02:00:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json [02:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:13:20] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json [02:19:42] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [02:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json [02:30:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [02:30:56] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [02:31:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [02:31:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json [02:32:36] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:34:26] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:37:48] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [02:43:28] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:12] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:04] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:46] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:52] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10nshahquinn-wmf) I don't know if they're relevant, but here are some past tickets related to Qualtrics emails from our domain: * {T164424} * {T176666} [03:18:14] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [03:23:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:03] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:34] (03PS10) 10MdsShakil: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) [04:11:39] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:17] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:13] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) It looks like pywikibot is a decent test rig for session replication races. I installed pywikibot on a Linode instance in the Dallas region, which... [04:47:21] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:11:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370 [05:12:00] T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370 [05:12:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370 [05:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 T314370', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json [05:21:17] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:31:01] (03PS2) 10Ladsgroup: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/819550 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot) [05:31:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/819550 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot) [05:38:59] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:00] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:57] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:46:01] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:59:05] I restart mailman if it continues like this [06:00:05] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220809T0600). [06:00:11] o/ [06:00:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:55] !log Starting s5 eqiad failover from db1130 to db1100 - T314370 [06:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:00] T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370 [06:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T314370', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json [06:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:01:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T314370', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json [06:02:12] done [06:06:17] (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/819551 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot) [06:06:31] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/819551 (https://phabricator.wikimedia.org/T314370) (owner: 10Gerrit maintenance bot) [06:07:55] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [06:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 T314370', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json [06:08:40] T314370: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T314370 [06:10:23] now it's time to clean up the old s5 master [06:11:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint [06:11:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint [06:16:25] 10SRE, 10SRE-swift-storage, 10Wikidata-Query-Service: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi) [06:16:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: allow x-wikimedia-debug to pick a php backend [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto) [06:18:56] 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Expand thanos-swift sd[ab]3 SSDs - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) [06:19:15] !log dbmaint s5@eqiad (T312863 T312984 T310011 T310485) [06:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:20] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:19:21] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:19:21] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [06:20:58] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10fgiunchedi) Hi @papaul, it looks like it might be the battery indeed, I'll let @MatthewVernon check/confirm [06:21:06] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10fgiunchedi) Hi @papaul, it looks like it might be the battery indeed, I'll let @MatthewVernon check/confirm [06:22:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:22:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:24:45] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) We need to decide if we want to make this change, taking into consideration the fact that the resource savings (in dollar te... [06:25:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:23] (03PS2) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 [06:34:38] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:51] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Siko_WMDE) @BCornwall, the associated e-mail address is: simon.kock@wikimedia.de thank you :) [06:59:05] (03CR) 10Ayounsi: [C: 03+2] sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220809T0700) [07:02:12] (03Merged) 10jenkins-bot: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [07:13:20] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:16:52] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Joe) >>! In T211661#8139622, @ori wrote: > As several people have pointed out in conversation with me, the fact that we're stori... [07:19:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Hi @BCornwall, just to say this is a high priority for us. We are already lost 4 days of work with @munizaA been locked-out from the servers. [07:45:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:55] 10SRE, 10SRE-swift-storage, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10Gehel) p:05Triage→03High [08:04:35] (03CR) 10Jbond: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [08:04:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond) [08:08:08] (03PS1) 10Elukey: ml-services: add model version config for enwiki editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/821596 [08:09:59] (03CR) 10David Caro: [C: 03+2] novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [08:10:24] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:53] (03CR) 10David Caro: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [08:11:48] (03PS3) 10Thiemo Kreuz (WMDE): Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 [08:17:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:35] (03PS1) 10Ayounsi: sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 [08:18:47] (03CR) 10Elukey: [C: 03+2] ml-services: add model version config for enwiki editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/821596 (owner: 10Elukey) [08:21:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:22:27] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) @Ladsgroup @Marostegui for reference, this is the couple of one liners I am using on cumin2002 to check the latest 2 million rows for each table: `lang=bash # mysql.py -BN -h es1021 -... [08:24:18] !log starting data check using es1021 and es2021, expect increased read traffic T314559 [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] T314559: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 [08:26:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:27:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:27:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:27:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:28:50] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:29:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, having tests would be great" [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [08:30:19] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:31:46] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [08:31:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:34:48] (03PS1) 10Elukey: ml-services: update articlequality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821599 (https://phabricator.wikimedia.org/T301878) [08:36:49] (03PS1) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 [08:36:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:39:28] ACKNOWLEDGEMENT - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T314838 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:45:34] (03PS1) 10Giuseppe Lavagetto: Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341 [08:46:58] (03PS1) 10Filippo Giunchedi: o11y: fix logstash alerts to use 'datasource' grafana variable [alerts] - 10https://gerrit.wikimedia.org/r/821601 [08:47:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:48:21] (03PS2) 10Giuseppe Lavagetto: Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341 [08:48:23] (03PS1) 10Giuseppe Lavagetto: lvs: move ulsfo, eqsin off of conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/821602 [08:49:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:52:04] (03Abandoned) 10Cathal Mooney: Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [08:52:10] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:52:11] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:52:15] (03CR) 10Btullis: [C: 03+2] Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:57:41] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:21] (03CR) 10Ayounsi: [V: 03+1] "Example output: https://phabricator.wikimedia.org/P32324" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [08:59:54] (03CR) 10Ayounsi: "Example output (including following CR https://phabricator.wikimedia.org/P32324)" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [09:00:49] (03PS5) 10Thiemo Kreuz (WMDE): Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [09:02:00] 10SRE, 10SRE-OnFire, 10Observability-Alerting: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [09:02:35] (03PS3) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) [09:03:44] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) As of now, the debianization is done in the software, but I'm waiting to build and upload a package until I've solved some of the outstanding issues. [09:05:41] (03PS1) 10Cathal Mooney: Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775) [09:06:45] (03PS20) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:07:06] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:20] (03CR) 10Cathal Mooney: [C: 03+2] Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775) (owner: 10Cathal Mooney) [09:08:14] (03Merged) 10jenkins-bot: Add cloud-gw-transport-eqiad linknet to cloud-transports capirca def [homer/public] - 10https://gerrit.wikimedia.org/r/821655 (https://phabricator.wikimedia.org/T314775) (owner: 10Cathal Mooney) [09:08:35] (03PS21) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:10:04] (03CR) 10FNegri: [C: 03+1] "LGTM, though I'm afraid the formatting will soon diverge again if we don't enforce it with some kind of script in Jenkins. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro) [09:10:06] (03CR) 10Jbond: PeeringDB API: initial commit (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:11:15] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: Use conf2005 in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/821341 (owner: 10Giuseppe Lavagetto) [09:11:50] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe) [09:12:23] !log rolling restart of pybal in codfw - T310070 [09:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:26] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [09:12:35] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [09:15:10] PROBLEM - Check systemd state on dse-k8s-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:08] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe) [09:17:42] (03CR) 10FNegri: [C: 03+1] wmcs: autoformat our yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro) [09:18:37] (03PS1) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 [09:18:43] (03PS1) 10Ayounsi: junos_set_interface_config: fix logic error [cookbooks] - 10https://gerrit.wikimedia.org/r/821688 [09:19:52] (03PS2) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 [09:20:44] (03PS1) 10Ladsgroup: conftool-data: Remove db2135 from the list [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656) [09:21:03] (03CR) 10Elukey: [C: 03+2] ml-services: update articlequality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821599 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [09:21:12] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [09:24:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:24:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:25:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:26:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:27:04] (03CR) 10Jcrespo: [C: 03+1] "Tripley-confirmed on puppet, zarcillo and actual data contents." [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656) (owner: 10Ladsgroup) [09:27:20] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [09:28:44] (03CR) 10Ladsgroup: [C: 03+2] conftool-data: Remove db2135 from the list [puppet] - 10https://gerrit.wikimedia.org/r/821689 (https://phabricator.wikimedia.org/T314656) (owner: 10Ladsgroup) [09:29:50] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10RhinosF1) For the IRC side, it's probably better to check cloak or account: - 1: a nickname on IRC normally has a short period of time (although this can range from 0 to infinity dependin... [09:31:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:38:44] (03PS1) 10Btullis: Change ownership of etcd certificates for cfssl usecase [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) [09:39:39] (03CR) 10Vgutierrez: [C: 03+2] lvs: move ulsfo, eqsin off of conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/821602 (owner: 10Giuseppe Lavagetto) [09:40:19] !log rolling restart of pybal in ulsfo - T310070 [09:40:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36651/console" [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:42:00] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:50] (03PS22) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [09:42:57] (03CR) 10Ayounsi: PeeringDB API: initial commit (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [09:43:04] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:16] (03CR) 10Btullis: [V: 03+1] "Adding joe to CC as a courtesy. I have verified that this has no effect on production etcd clusters, despite a change to the profile." [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:43:58] 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe) a:05BCornwall→03Joe Hi @BCornwall I'm going to take care of this task together with @Clement_Goubert - we already have @akosiaris' approval from before the delays we had;... [09:44:23] 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe) [09:46:35] 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10LSobanski) Approved. [09:48:24] PROBLEM - PyBal connections to etcd on lvs5002 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [09:49:51] ^should I be worried? [09:51:16] (03PS1) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 [09:52:46] oh, I just saw vgutierrez's log now [09:52:53] 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Clement_Goubert) [09:53:05] !log rolling restart of pybal in eqsin - T310070 [09:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:08] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [09:53:09] jynus: nope you shouldn't [09:53:22] yeah, too much scrollback [09:54:16] 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Clement_Goubert) [09:55:05] (03PS2) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) [09:58:14] (03PS1) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 [09:58:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but add yourself to the sreadmins group" [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert) [09:59:42] (03CR) 10CI reject: [V: 04-1] admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (owner: 10Jbond) [10:00:42] RECOVERY - PyBal connections to etcd on lvs5002 is OK: OK: 4 connections established with conf2005.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:01:14] (03CR) 10Elukey: Change ownership of etcd certificates for cfssl usecase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:02:24] (03PS2) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 [10:03:05] (03PS3) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846) [10:04:55] (03PS1) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [10:05:06] (03CR) 10Btullis: [V: 03+1] Change ownership of etcd certificates for cfssl usecase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:06:53] (03PS3) 10Clément Goubert: admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) [10:08:52] (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [10:11:26] (03CR) 10Vgutierrez: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond) [10:14:19] (03CR) 10Clément Goubert: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert) [10:14:28] <_joe_> claime: ^^ [10:14:56] (03PS10) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [10:15:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: move cgoubert from ldap_only_users to users [puppet] - 10https://gerrit.wikimedia.org/r/821693 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert) [10:15:10] _joe_: I maybe should have answered with something else than Done then :D [10:15:21] <_joe_> nah :P [10:15:58] (03PS4) 10Jbond: admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846) [10:16:25] .win 13 [10:17:18] (03PS1) 10Elukey: ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878) [10:19:09] (03CR) 10CI reject: [V: 04-1] sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [10:22:13] (03CR) 10Jbond: [C: 03+2] admin: remove access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/821694 (https://phabricator.wikimedia.org/T314846) (owner: 10Jbond) [10:22:45] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10Joe) [10:26:43] (03CR) 10Jbond: [C: 03+2] C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond) [10:27:22] (03PS21) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [10:44:18] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:44:44] (03PS2) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [10:46:37] (03PS2) 10Elukey: ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878) [10:47:20] (03PS1) 10Filippo Giunchedi: dispatch: add container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229) [10:47:24] (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [10:49:06] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:49:21] (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [10:56:13] (03CR) 10Elukey: [C: 03+2] ml-services: update settings for draftquality to test the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/821696 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [10:58:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:03:32] (03PS1) 10Elukey: ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878) [11:08:58] (03CR) 10Ssingh: [C: 03+2] admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1) [11:09:17] (03CR) 10Elukey: [C: 03+2] ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [11:09:28] (03PS4) 10Ssingh: admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1) [11:11:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10ssingh) Diego has confirmed via email so merging this given MunizA has been locked out. @MunizaA: echoing @Dzahn's suggestion above to update the address if possib... [11:12:48] (03Merged) 10jenkins-bot: ml-services: apply the staging's draftquality image to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821699 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [11:13:34] (03PS3) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [11:13:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10ssingh) 05Open→03Resolved a:03ssingh Change has been merged. [11:13:41] 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10mfossati) [11:15:53] 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10mfossati) [11:20:07] (03CR) 10Jbond: "lgtm minor nit" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [11:21:31] (03PS4) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [11:21:43] (03CR) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [11:21:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:13] (03PS3) 10Gehel: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) [11:28:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10Gehel) Approved from my side [11:32:26] (03CR) 10Jbond: "lgtm but see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [11:35:38] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:26] (03CR) 10Jbond: "LGTm but sorry i missed a comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:47:15] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: reduce backup_keep_time to 2d [puppet] - 10https://gerrit.wikimedia.org/r/820712 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:04:18] RECOVERY - Check systemd state on mw2393 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:26] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Thank you so much for this, @nshahquinn-wmf -- that seems super relevant! From these tickets it appears that we do use SMTP rather than Custo... [12:09:40] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10EChetty) [12:11:42] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:23] (03PS1) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821708 (https://phabricator.wikimedia.org/T314456) [12:30:52] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:58] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:40] (03CR) 10Btullis: [V: 03+1 C: 03+2] Change ownership of etcd certificates for cfssl usecase [puppet] - 10https://gerrit.wikimedia.org/r/821691 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [12:46:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:48:13] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheDJ) >>! In T314789#8139432, @Legoktm wrote: > IMO the most important... [12:49:41] (03CR) 10CDanis: [C: 03+2] Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [12:50:57] (03Merged) 10jenkins-bot: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [12:51:45] (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:53:59] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) As a followup to this past weekend's mi... [12:57:53] (03CR) 10David Caro: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [13:06:12] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10Ottomata) Hi, I don't know much about this, but I did a little bit of digging. I can see that the flink session cluster jobmanager is taking ch... [13:12:50] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:13:30] (03CR) 10Elukey: [C: 03+2] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/821708 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [13:16:41] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime) >>! In T314789#8140256, @TheDJ wrote: > You can use lame a... [13:16:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:17:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:18:30] (03PS2) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 [13:19:25] (03PS1) 10Btullis: Change the output directory that cfssl uses for etcd [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129) [13:20:15] (03CR) 10Majavah: puppetmaster: remove 'allow_from' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [13:21:00] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36660/console" [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [13:21:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36661/console" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [13:21:48] Hi godog , starting the logging: [13:21:48] !log netmon1002 to netmon1003 failover [13:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:20] denisse|m: ok! [13:22:46] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:30] !log stop replication on db1117:m1 T309074 [13:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:33] T309074: Put netmon1003 in service - https://phabricator.wikimedia.org/T309074 [13:24:55] (03CR) 10Bking: [V: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel) [13:25:03] (03CR) 10Bking: [V: 03+2 C: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel) [13:25:13] (03PS2) 10Andrea Denisse: netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) [13:25:40] (03PS4) 10Bking: admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel) [13:25:48] (03CR) 10Bking: [V: 03+2] admin: add mfossati to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/821687 (https://phabricator.wikimedia.org/T314853) (owner: 10Gehel) [13:28:32] (03PS2) 10Andrea Denisse: netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) [13:29:02] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:14] !log Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 [13:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:29] (03CR) 10Andrea Denisse: [C: 03+2] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:29:36] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:30:33] I ran a backup consistent to 13:23:51 state [13:30:38] of librenms [13:31:56] jynus: Thanks a lot! :D [13:32:21] !log running '# authdns-update' in ns0.wikimedia.org [13:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:42] (03CR) 10MVernon: [C: 03+1] "I'm far from a partman expert, but I think this is correct!" [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [13:34:00] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10elukey) ` root@thanos-fe1001:/home/elukey# source /etc/swift/account_AUTH_wdqs.env root@thanos-fe1001:/home/elukey# swift list rdf-streaming-up... [13:34:34] (03PS1) 10Andrea Denisse: Revert "netmon: failover to netmon1003" [dns] - 10https://gerrit.wikimedia.org/r/821727 [13:36:04] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:39] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "netmon: failover to netmon1003" [dns] - 10https://gerrit.wikimedia.org/r/821727 (owner: 10Andrea Denisse) [13:40:03] (03PS1) 10Andrea Denisse: netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074) [13:40:22] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:40:24] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Smokeping failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/821746 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:40:40] (03CR) 10MVernon: [C: 03+1] "This looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [13:42:00] !log Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746 [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] !log authdns updated successfully [13:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:35] upgrade discussion is here? [13:43:48] !log Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179 [13:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] XioNoX: on -sre [13:43:57] (03CR) 10Andrea Denisse: [C: 03+2] netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:44:16] cool [13:44:19] ping me if needed [13:44:38] XioNoX: thank you! will do [13:45:33] !log puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded [13:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:59] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:47:01] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:48:26] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:45] (03PS1) 10Elukey: ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) [13:48:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:50:44] !log Running '# run-puppet-agent' in the netmon1002 host [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] !log Running '# run-puppet-agent' in the netmon1003 host [13:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) While Disk Is Cheap (TM), container listing is not and our thumbs containers are the largest in terms of number-of... [13:52:49] !log Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts. [13:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:59] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:54:01] !log Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124 [13:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add the netmon1003 host as a syslog destination in homer [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:56:02] (03Merged) 10jenkins-bot: netmon: Add the netmon1003 host as a syslog destination in homer [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [13:57:23] (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [13:57:24] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:57:39] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:57:58] (03PS23) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [13:58:11] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Yes, it's the battery that's died (presumably related to the recent power work). [13:58:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:58:57] ^ this is me [13:59:14] ACK [14:00:48] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon) Yep, it's the battery (like ms-be2032). [14:00:49] .7 [14:00:50] err [14:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:07:13] (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [14:08:52] (03PS1) 10DCausse: Stop checking codfw for wikidata/wdqs max lag detection [puppet] - 10https://gerrit.wikimedia.org/r/821753 [14:11:27] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 2 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) It seems (still not 100% sure yet but seeing a lot of failures related to this) that the repeated failures are caused by the bad swift... [14:16:13] (03CR) 10Bking: [V: 03+2 C: 03+2] Stop checking codfw for wikidata/wdqs max lag detection [puppet] - 10https://gerrit.wikimedia.org/r/821753 (owner: 10DCausse) [14:16:22] (03CR) 10Jbond: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [14:17:45] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10lmata) p:05Triage→03Medium a:03fgiunchedi [14:18:17] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:31] (03PS2) 10Ayounsi: sre.network.debug: automatically analyse the remote interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 [14:18:33] (03PS2) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 [14:21:52] (03CR) 10Ayounsi: "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [14:28:49] !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw [14:34:59] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata) a:03herron @herron: from a POC standpoint, is this good enough? any thoughts on the last option for user management and SSO? [14:38:44] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update drafttopic docker image and settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:40:24] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) a:05herron→03None >>! In T309033#8140569, @lmata wrote: > any thoughts on the last option for user management and SSO? Please see https://p... [14:41:06] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [14:41:21] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron) [14:41:25] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) 05Open→03Resolved a:03herron [14:43:19] (03PS2) 10Elukey: ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) [14:43:35] (03CR) 10Elukey: ml-services: update drafttopic docker image and settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:43:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:43:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:45:37] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:46:11] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [14:47:30] (03PS1) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups [puppet] - 10https://gerrit.wikimedia.org/r/821756 [14:48:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:48:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:48:50] (03CR) 10Elukey: [C: 03+2] ml-services: update drafttopic docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821749 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:49:45] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:07] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye [14:50:35] (03CR) 10Jbond: [C: 03+1] "LGTM, minor optimisation inline" [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah) [14:52:37] (03PS2) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups [puppet] - 10https://gerrit.wikimedia.org/r/821756 [14:52:57] (03CR) 10Majavah: P:ldap::client::labs: support multiple restricted_* groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah) [14:54:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:57:39] !log finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors. [14:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:16] 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [14:59:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:59:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:59:52] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) [15:00:21] 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) 05Open→03In progress [15:04:44] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821756 (owner: 10Majavah) [15:04:58] (03PS2) 10BCornwall: admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [15:05:07] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage [15:05:31] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:12] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) [15:08:21] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [15:08:24] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) [15:08:27] (03PS2) 10Filippo Giunchedi: install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) [15:08:49] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage [15:08:57] (03CR) 10Btullis: [V: 03+1 C: 03+2] Change the output directory that cfssl uses for etcd [puppet] - 10https://gerrit.wikimedia.org/r/821721 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [15:10:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:11:05] ^ this is me [15:11:11] (03PS1) 10Majavah: P:openstack::cumin::target: remove port forwarding support [puppet] - 10https://gerrit.wikimedia.org/r/821759 [15:15:09] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) Hi, @Siko_WMDE I'll need you to: * Sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document (https://phabricator.wikimedia.or... [15:18:56] RECOVERY - Check systemd state on dse-k8s-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:30] (03CR) 10David Caro: [C: 03+2] wmcs: autoformat our yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro) [15:22:37] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [15:24:40] dcaro: merged your change too [15:25:22] godog: thanks! [15:25:37] I was waiting for you to finish up, but that's better :) [15:25:48] (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [15:26:36] yeah good (or bad) timing depending on how you want to look at it! :) [15:26:59] (03CR) 10RhinosF1: [C: 03+1] admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [15:27:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye [15:27:24] (03CR) 10RhinosF1: [C: 03+1] admin: add Simon Kock to ldap_only admins (nda,wmde) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [15:28:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:29:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10dcaro) [15:30:09] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye [15:30:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10fnegri) [15:31:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:31:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:32:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:33:19] (03PS1) 10FNegri: Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) [15:34:22] (03CR) 10Jbond: homer: add pyproject.toml (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [15:36:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:01] (03PS1) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) [15:38:41] (03CR) 10David Caro: puppetmaster: remove 'allow_from' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [15:42:04] PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:42:41] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage [15:45:24] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage [15:50:37] PROBLEM - Check systemd state on es2021 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:29] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic would it be possible to attach the original or raw email messages of your new and old samples? I want to view the email headers to... [15:54:07] is clouddb1021 a schema change or something else going on? [15:54:23] who should I ping about it, data engineering? [15:54:27] or cloud? [15:54:51] I will have a look at es2021 [15:56:41] (03PS2) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) [15:57:27] RECOVERY - Check systemd state on es2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:44] (03Abandoned) 10Andrea Denisse: netmon: Add DNS entries to test LibreNMS in Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/820554 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [15:58:36] jynus: based on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&viewPanel=5&from=now-3h&to=now, it doesn't look like a schema change that arrived because no host had lag before, also it's flattening out now. that's my immediate and based on uneducated history thought. [15:58:46] weird, it fixed itself (es2021) [15:58:55] DE maintain them now [15:58:58] it said the service didn't exist, but it did [15:59:27] there was maint this morning though in s5 - https://sal.toolforge.org/log/zzFBgYIB6FQ6iqKi412c [16:00:44] (03CR) 10Jbond: sre.network.debug: automatically analyse the remote interface (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [16:01:05] (03PS3) 10Seddon: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [16:01:13] (03PS1) 10Clément Goubert: improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 [16:01:41] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye [16:02:17] (03PS2) 10Clément Goubert: improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 [16:06:27] (03CR) 10Clément Goubert: "Can you check that out, no rush" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert) [16:08:26] (03CR) 10David Caro: "Thanks! (sorry, went on pto)" [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite) [16:08:27] there is some big delete ongoing- maybe it is being setup? No idea, but it seems otherwise healty [16:08:32] (03CR) 10David Caro: [C: 03+2] nova_fullstack_test: rename error.stack to stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite) [16:09:05] https://grafana.wikimedia.org/goto/n7rZhGi4k?orgId=1 [16:12:27] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36667/console" [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [16:13:48] (03CR) 10David Caro: [C: 03+1] Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [16:14:37] (03CR) 10FNegri: [V: 03+1 C: 03+2] Add new Ceph hosts in racks E4-F4 [puppet] - 10https://gerrit.wikimedia.org/r/821766 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [16:26:14] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:26:30] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:29:42] (03PS1) 10Vgutierrez: Revert PR #7465 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/821777 [16:31:29] (03PS1) 10FNegri: Add cloudcephosd1025 to Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) [16:34:06] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36668/console" [puppet] - 10https://gerrit.wikimedia.org/r/821778 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [16:36:09] (03CR) 10Jbond: "sorry to be really picky on a minor point but i think its good to get theses things right early 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [16:41:19] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Platonides) SQLite? I'm surprised it uses that, it may not be the best performant option for such a queue... [16:44:12] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 56.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:46] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:53:49] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:54:12] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:54:16] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:00:15] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye [17:13:00] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage [17:15:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage [17:21:00] (03PS1) 10Btullis: Use the chained certificate for the etcd cfssl option [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) [17:22:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36669/console" [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [17:27:22] (03PS2) 10Btullis: Use the chained certificate for the etcd cfssl option [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) [17:29:56] !log test trafficserver 9.1.2-1wm2 in cp6016 - T309651 [17:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:00] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:34:07] (03PS3) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 [17:34:40] (03CR) 10Ayounsi: "Thanks for your feedback :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [17:35:51] (03PS12) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:37:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36670/console" [puppet] - 10https://gerrit.wikimedia.org/r/821780 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [17:37:58] (03CR) 10CI reject: [V: 04-1] sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 (owner: 10Ayounsi) [17:38:17] (03CR) 10Ayounsi: sre.network.debug: automatically analyse the remote interface (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/821597 (owner: 10Ayounsi) [17:38:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye [17:42:49] (03PS4) 10Ayounsi: sre.network.debug: allow referencing directly an interface [cookbooks] - 10https://gerrit.wikimedia.org/r/821600 [17:47:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:54:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:16] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:10] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) Current status: - all flink jobs are stopped in codfw - wdqs traffic is eqiad - wikidata maxlag is only checking eqiad - the rdf-stream... [17:59:04] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [17:59:06] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:06:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson) [18:06:46] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:09:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:02] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson) 05Open→03Resolved left in the rack to track [18:12:25] (03PS24) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [18:12:46] (03PS1) 10Cathal Mooney: Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) [18:13:44] (03CR) 10CI reject: [V: 04-1] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [18:13:59] (03CR) 10Ayounsi: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [18:14:42] (03PS11) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [18:16:07] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:16:15] (03PS2) 10Cathal Mooney: Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) [18:17:00] 10SRE, 10MediaWiki-General, 10Traffic: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [18:17:16] (03CR) 10CI reject: [V: 04-1] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [18:22:01] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:30:35] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:41:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:29] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:43:35] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:41] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:53:50] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) From looking at the headers that @TAndic sent me the change in behavior appears as follows: === Old Route === Qualtrics sends to gmail f... [18:58:31] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:56] (03PS1) 10Bking: wdqs: bring more hosts online [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) [19:01:44] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [19:02:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [19:06:09] (03CR) 10Bking: [C: 03+2] wdqs: bring more hosts online [puppet] - 10https://gerrit.wikimedia.org/r/821785 (https://phabricator.wikimedia.org/T314890) (owner: 10Bking) [19:11:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) [19:19:05] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Jdforrester-WMF) [19:23:29] (03PS3) 10DCausse: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) [19:25:31] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:29:09] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) (owner: 10DCausse) [19:34:24] (03Merged) 10jenkins-bot: rdf-streaming-updater: use the S3 client for flink ha [deployment-charts] - 10https://gerrit.wikimedia.org/r/821768 (https://phabricator.wikimedia.org/T314835) (owner: 10DCausse) [19:35:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:35:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:36:03] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [19:38:50] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [19:43:58] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:47:34] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:51:56] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:29] ^^ just brought those wdqs servers online, will ack alerts [19:55:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890 [19:55:30] T314890: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890 [19:55:32] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:55:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890 [19:56:26] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890 [19:56:40] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890 [19:57:06] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890 [19:57:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890 [19:58:04] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:02:06] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:05:16] PROBLEM - DNS on db1190.mgmt is CRITICAL: Domain db1190.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:28] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [20:09:54] (03PS3) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) [20:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json [20:10:35] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:12:33] (03PS4) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) [20:13:36] PROBLEM - DNS on db1195.mgmt is CRITICAL: Domain db1195.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:36] PROBLEM - DNS on db1191.mgmt is CRITICAL: Domain db1191.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json [20:25:52] PROBLEM - DNS on db1189.mgmt is CRITICAL: Domain db1189.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:46] PROBLEM - DNS on db1194.mgmt is CRITICAL: Domain db1194.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:42] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json [20:43:42] PROBLEM - DNS on db1193.mgmt is CRITICAL: Domain db1193.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:46:18] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [20:46:51] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:51:20] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet [20:51:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet [20:52:40] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json [20:55:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:55:53] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:55:58] PROBLEM - DNS on db1192.mgmt is CRITICAL: Domain db1192.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:56:40] PROBLEM - DNS on db1187.mgmt is CRITICAL: Domain db1187.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:54] PROBLEM - DNS on db1185.mgmt is CRITICAL: Domain db1185.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:00:16] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet [21:06:40] PROBLEM - DNS on db1186.mgmt is CRITICAL: Domain db1186.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:06:44] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:08:17] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:08:30] Just in case someone's alarmed to see gate-and-submit-wmf jobs happening without a C+2 here, they're CI trial runs (on already-merged patches) as we're now going to be running PHP7.4 jobs for prod patches too; see T293924 [21:08:30] T293924: Also run PHP 7.4 jobs on wmf branch patches - https://phabricator.wikimedia.org/T293924 [21:15:04] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) Unfortunately I could not finish the cleanup of the `flink_ha_storage` folder to properly resume operations from k8s. I resumed the job... [21:24:30] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:30:19] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:40:55] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [21:43:03] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [21:43:05] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [21:43:12] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [21:43:15] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [21:43:22] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [21:43:26] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [21:49:14] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [21:50:45] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [21:52:22] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [21:53:16] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [21:57:00] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146 [21:57:03] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [21:57:15] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:57:24] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146 [22:00:32] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:01:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:02:29] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [22:02:33] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [22:05:39] (HelmReleaseBadStatus) firing: Helm release changeprop-jobqueue/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:05:40] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [22:12:56] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:15:53] welp, that didn't go so well [22:16:13] (reverting changeprop-jobqueue shortly) [22:18:44] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [22:28:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:29:16] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:31:19] !log ryankemper@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [22:31:23] !log ryankemper@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [22:42:43] (03PS1) 10Bking: changeprop-jobqueue: further reduce memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/821800 (https://phabricator.wikimedia.org/T314426) [22:46:13] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet [22:46:47] (03CR) 10Bking: [C: 03+2] changeprop-jobqueue: further reduce memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/821800 (https://phabricator.wikimedia.org/T314426) (owner: 10Bking) [22:49:07] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [22:49:13] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [22:51:46] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [22:51:53] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [22:52:48] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:55:10] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:24] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:08] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [23:02:06] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:51] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [23:07:10] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [23:10:39] (HelmReleaseBadStatus) resolved: Helm release changeprop-jobqueue/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:15:42] PROBLEM - SSH on cp1089.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:29] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet [23:29:46] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:42:42] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [23:48:38] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook